Pages

Monday, May 18, 2015

How to Split a MultiPolygon Column Into the Corresponding Multiple Polygons in SQL Server


If you're looking for a way to decompose SQL Server geometry (or geography) columns containing multiple polygons (MULTIPOLYGON) into their corresponding polygons, then the following method will come for sure in handy.

First of all, we create a custom dataset (a table variable) containing our sample multipolygons:

declare
    @tv_MultiPolygonsAndPolygons
table
(
    MultiPolygon_ID int identity
    , MultiPolygon_Shape geometry
)
;
insert into
    @tv_MultiPolygonsAndPolygons (MultiPolygon_Shape)
values
(
    geometry::STPolyFromText('POLYGON((0 0, 2 0, 2 2, 0 2, 0 0))', 0))
    , (geometry::STMPolyFromText('MULTIPOLYGON(((6 20, 8 20, 8 22, 6 22, 6 20)),((3 6, 6 7, 5 9, 3 6)))', 0))
    , (geometry::STPolyFromText('POLYGON((10 0, 12 0, 12 2, 10 2, 10 0))', 0))
    , (geometry::STMPolyFromText('MULTIPOLYGON(((0 10, 2 10, 2 12, 0 12, 0 10)),((15 10, 25 12, 23 20, 15 20, 15 10)))', 0)
);

By querying this table variable:

select
    g.*
from
    @tv_MultiPolygonsAndPolygons g

;
we get the following 4 multipolygons, as expected, each of one with its own distinctive color:


The easiest and cleaniest way to split our multipolygons into their separate polygons is by creating a table-valued function, and then cross-apply (or outer-apply) our table variable containing the multipolygons to it. In order to do so, we make use of a recursive CTE:

create function [dbo].[ft_Geographie_SplitMultiPolygon](@pr_MultiPolygon geometry)
returns table
as
return
(
    with
        NS(Num) as
    (
        select
            Num = 1
   
        union all
   
        select
            Num = Num + 1
        from
            NS
        where
            Num < @pr_MultiPolygon.STNumGeometries()
    )
    select
        MultiPolygon_ID     =  NS.Num
        , Polygon           =  @pr_MultiPolygon.STGeometryN(NS.Num)
    from
        NS option (MaxRecursion 1000)
)

The CTE contains the STNumGeometries()function, a function returning the number of geometries that comprise a geometry instance, in our case the individual polygons.

We're now ready to test our brand new table-value splitting function:

select
    g.MultiPolygon_ID
    , Polygon_ID        =    sp.ID
    , Polygon_Shape        =    sp.Polygon
from
    @tv_MultiPolygonsAndPolygons g
    cross apply usw.ft_Geographie_SplitMultiPolygon(g.MultiPolygon_Shape) sp
order by
    MultiPolygon_ID
    , Polygon_ID


And, as expected, we're now getting a different row for each polygon:


Monday, March 23, 2015

Dynamic Labelling of Dimension Attributes


In Dimensional Modelling, dimensions are usually implemented as single, fully denormalized tables. This morphology translates into a simplified and more readable structure as well as into a performance gain when processing user queries, since less joins between tables are involved.

Dimensions essentially consist of attributes. Each dimension attribute should always be uniquely identified by a code and might show one or more descriptive fields:



The "Car" Dimension and its attributes.









In this example, each car to be sold is associated with a model, a manufacturer, a building year, a price, and a color. "building year" and "price" are numeric attributes and don't need therefore any further additional descriptive information - in some cases we might even copy the price field back to the fact tables, in order to allow user analysis over it. But what about "model", "manufacturer" and "color"?

Those attributes are uniquely identified by a code. However, during the processing of the dimensional table, we need a way to dynamically associate these codes with their corresponding text labels. Here a simple, ANSI-standard SQL-based solution:



; with
    iPvt
as
(
    select
        ID_iPvt                         =    ID // Dimension ID
        , Color_Descr                   =    max([color])
        , Manufacturer_Descr            =    max([manufacturer])
        , Model_Descr                   =    max([model])
    from   
        (
            select
                k.ID
                , m.Code
                , m.Description
            from
                tb_Dim_Car k
                inner join tb_Car_Attribute_Map m on
                    (m.Code = k.Color_Code and m.Type = 'Color')
                    or (m.Code = k.Manufacturer_Code and m.Type = 'Manufacturer')    
                    or (m.Code = k.Model_Code and m.Type = 'Model')
        ) Src
        pivot
        (
            max([Description])
            for [Attribute]  in
            (
                [color]
                , [manufacturer]
                , [model]
            )       
        ) Pvt   
    group by   
        ID
)
select
    ID                              =    k.ID   
    , Color_Code                    =    k.Color_Code
    , Color_Descr                   =    coalesce(p.Color_Descr, 'n/a')
    , Manufacturer_Code             =    k.Manufacturer_Code
    , Manufacturer_Descr            =    coalesce(p.Manufacturer_Descr, 'n/a')
    , Model_Code                    =    k.Model_Code
    , Model_Descr                   =    coalesce(p.Model_Descr, 'n/a')
    , Building_Year                 =    k.Building_Year
    , Price                         =    k.Price
from
    tb_Dim_Car k
    left outer join iPvt p on
        p.ID_iPvt = k.ID

;


This simple query joins every attribute code/key of our "tb_Dim_Car" dimension with the "tb_Car_Attribute_Map", a 2NF table containing all metadata we need in order to allow a dynamic mapping across all attributes:


The tb_Car_Attribute_Map mapping table.

The result of this join will be then pivoted back and outer joined with the original dimensional table. In order to allow this, however, we need a unique "ID" for each dimension row - in case we lack it, a ranking windowing functions comes of course at handy - if supported by the database engine.

This pattern can be easly extended in case of multilngual description attributes by simply adding a "Language Code" column in the mapping table.

Friday, January 16, 2015

First Steps to Market Basket Analysis with R


In this article I'd like to provide a walkthrough guide on how to perform a simple market basket analysis using R. First of all, I suggest a brief reading of the "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" book written by Gordon S. Linoff and Michael J. A. Berry, and more specifically its chapter 15.


The Market Basket Analysis is a marketing application of the data mining technique known as "affinity analysis". It is often used in the retail industry as instrument to infer the customer purchase behavior, in order to improve the sales process by means of activities such as loyalty programs, discounts, and cross-selling.


A well known application of market basket analysis: the Amazon "customers who bought this item also bought" cross-selling functionality.


We suppose to have access to the historical user search data of a car retail portal. In order to perform a Market Basket Analysis, data must to be first fully pivoted - i.e. every user search must correspond to a row, and every search criteria must be corresponding to a column.


Fully pivoted user search data. Source: e "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" by Gordon S. Linoff and Michael J. A. Berry, p. 295.


We now need our pivoted data to be imported in our R environment. Through ODBC, by querying a table or executing a stored procedure, we can direct import them - thus generate a R data frame object:

require(RODBC)
channel <- odbcConnect("dwh", uid = "<your_username>", pwd = "<your_password>")
db <- sqlQuery(channel, "select * from <your_table>")
close(channel
df <- db[,c('column1', 'column2', 'column3', 'column4')]


Our data are now in the "df" data frame object.

In order to perform a Market Basket Analysis, we need to load the "arules" and the "arulesViz" R packages:

library(arules)
library(arulesViz)

Are all our data column of "factor" type? This is required by the arules package.

sapply(df, class)

If not, we can "correct" the "wrong" column by using a cast operation:

class(df$ProductX)
df$ProductX <- as.factor(df$ProductX)
class(df$ProductX)
We might not be want to manually cast every single column of our dataframe. If this is the case, we can use the colwise function of the plyr package:

require(plyr)
tofactor <- function(x) { x <- as.factor(x) }
df <- colwise(tofactor)(df)

The Market Basket Analysis algorithm ignores every NA value (the R equivalent of the database NULL). In case our DWH dataset is still not qualitative good enough, we might want to to manually clean it:

df$WithDiscount <- replace(df$WithDiscount, df$WithDiscount==0, NA)

Our data are now ready to generate useful information. We begin with a standard support of 0.1% and a confidence of 80%:

rules <- apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen=5))


Let's give a look at the top 7 rules, ordered by lift:

options(digits=2)
rules <- sort(rules, by="lift", decreasing=TRUE)
    inspect(rules[1:7])

We obtain something like this:

   lhs   rhs      support     confidence     lift
1  {WithDiscount=1} => {Model=Golf}  0.0070  0.92   1.3
2  {WithDiscount=1} => {EngineType=Diesel} 0.0064  0.85  1.1
3  {Model=Golf} => {Color=Green}   0.0300      0.97 1.4
4  {Model=Golf} => {Color=Red}   0.4138      0.90   1.3
5  {Color=Red} => {EngineType=Gasoline}  0.4138   0.90   1.2
6  {Model=Golf} => {EngineType=Diesel}  0.6343  0.89  1.2
7  {EngineType=Electric} => {Color=White}  0.6343      0.

For better reading we can export our rules to a csv file:
write(rules, file="rules.csv", quote=TRUE, sep=";")

As usual, most of the Data Mining results appears to be trivial - i.e. they are already known by anyone familiar with the business.

In this example, it may appear non trivial and therefore worth a further analysis the rule 7 (a customer choosing an electric vehicle is more likely to purchase it of white color) and the rule 5 (future owners of red painted cars tend to prefer traditional gasoline engines over diesel or electric ones).

Finding out a sound and useful interpretation of these derived rules is, of course, part of the analysis itself.

Monday, October 13, 2014

Distinct Count of Values in R

issue of duplicate rows
Multidimensional datasets often shows the issue of rows containing duplicate values. In SQL we can easly handle this problem thanks to the COUNT DISTINCT aggregate function. But what about R?

According to a couple of websites and blogs I've quickly checked, the fastest and most efficient way to get a distinct count of values in R seems to be by making use of the R unique function:

unique(dataset$column)

where "column" is the column name of the "dataset" dataset, whose values we'd like to distinct count.
The function is gonna return us a vector containing the unique list of values of the specified column - i.e. a vector without duplicate elements.

Thus what we need now is a simple count of this vector:

nrow(newdataset)

Wrapping in one, single scalar-returning statement:

nrow(unique(dataset$column))

If we wanna apply the same logic to the whole dataset rather than a single column, we can use the sapply() lamba-function:

sapply(dataset, function(x), length(unique(x)))

Wednesday, June 11, 2014

How to Determine the Bounding Box of a Spatial Index in SQL Server

In SQL Server you can associate an object with a region by invoking the STIntersects() function against a geometry or geography column, as it would be a traditional join:

DECLARE @g geometry = geometry::STGeomFromText('LINESTRING(0 2, 2 0, 4 2)', 0)
DECLARE @h geometry = geometry::STGeomFromText('POINT(1 1)', 0)
SELECT @g.STIntersects(@h)

This method allows the spatial location of objects using a standard SQL query: we're basically asking SQL Server which polygons our objects/points fall in.

Tracking a large population of objects and especially if they're moving, however, shows a really poor query performance. Is there any method to increase this geometrical logic performance?

According to the Microsoft online documentation, we can make use of a spatial index - a standard B-tree structure, which decomposes the 2-dimensional spatial data into a linear, hierarchial grid. By means of a spatial index, SQL Server can then internally perform a spatial calculation using simple and fast integer arithmetic.


A spatial index decomposes the 2-dimensional spatial data into a linear, hierarchial grid.


The syntax is pretty much trivial:

create spatial index
    six_tb_Dim_Geographie_Polygon
on
    dbo.tb_Dim_Geographie(Polygon)
using GEOMETRY_GRID
with
(
    BOUNDING_BOX =(xmin=9.53074889950784, ymin=46.3723050741545,xmax=17.1607732090805, ymax=49.0205207370626)
    GRIDS = (LOW, LOW, MEDIUM, HIGH),
    CELLS_PER_OBJECT = 64,
    PAD_INDEX  = ON
);

Since geometry data can -teoretically- occupy an infinte plane, the spatial index requires a rectangular bounding box, i.e. the coordinates of the x/y coordinates of the lower-left/upper-right corners of the entire geometrical structure. How to calculate them? Here is a simple query:

; with
x
as
(
   select
     geom = geometry::EnvelopeAggregate(g.Polygon)
   from
     dbo.tb_Dim_Geography g
)
   select
   xMin = x.geom.STPointN(1).STX
   , yMin = x.geom.STPointN(1).STY
   , xMax = x.geom.STPointN(3).STX
   , yMax = x.geom.STPointN(3).STY
  from
   x
;





Whereby the '1' point is the lower-left corner and '3' is the upper-right one; 'dbo.tb_Dim_Geography' could be the geographical dimension of your data warehouse, or any table containing the geographical structure in a normalised environment.

Monday, April 28, 2014

How to convert a Shapefile from UTM (WGS84) Coordinates into GPS Latitude/Longitude in R

According to Gartner, more than 80% of all information is supposed to have spatial reference.
The business value of any kind of data with spatial reference can be dramatically leveraged by means of integration and visualization with geographical, demographic and geopolitical data.

Developers and IT professionals often choose the way of creating their own geographical master data instead of relying on traditional GIS applications, which are often too highly specialized in order to be integrate into the ongoing information systems.

For this purpose, many open data sources and technologies can be used; the most common data format is the ERSI shapefile. To import a shapefile from the filesystem to one database many different tools can be used; in SQL Server environments I suggest the free and fast Shape2SQL Freeware tool.

Important: during the upload. Shape2SQL uses an unique transaction. If for any reason the creation of the SQL spatial index fails, the entire transaction will be rollbacked and and you won't see any newly created table in your target database - I suggest to disable this automatic and buggy index creation feature. Do not also forget to set the SRID as 4236.

You might think that once you your "shape" table has been created, your pairs of latitude and longitude coordinates are ready to uniquely identify every surface/polygon as well as every point on the earth's surface. Unfortunately, it's not quite that simple.

If your goal is to integrate and visualize data on standard platforms as Google Maps, OpenStreeMap or Bing Maps, you need in fact GPS latitude/longitude coordinates in WGS 1984 format - otherwise known as EPSG 4326. Most of the open data sources, however, publish shapefile data in UTM format, an old format that differs from the latitude/longitude system in several respects.

How to convert shapefile from UTM to latitude/longitude GPS formats? Here is fast and no-cost solution, using the popular data manipulation and data analysis opensource framework "R", togheter with the gdal library.

First of all, we install gdal and switch to our working directory (i.e., the directory in which we copied the original UTM shapefile):

install.packages("rgdal")
setwd("[yourworkingdirectory]")
getwd()


We then import the shapefile by creating a dataset in our workspace:
shape <- readOGR("directory", layer="filename_without_extension")

A dataset called "shape" of type SpatialPolygonsDataFrame has now been created. We are curious to see what exactly we just imported, and how does it looks like:

dimensions(shape)
summary(shape)
plot(shape)

You should see something like this:



We are now ready for the UTM to GPS Lat/Long conversion:

shape_gps = spTransform(shape, CRS("+proj=longlat +ellps=GRS80"))

Eventually, we commit the result back to the filesystem:

writeOGR(shape_gps, ".", "shape_gps", driver="ESRI Shapefile")

A file called "shape_gps.shp" will now contain your gps coordinate data.

In case SQL Server complains about the validity of your shape data, make use of the MakeValid() SQL (CLR) function.

Tuesday, April 22, 2014

Talend Open Studio Cookbook by Rick Barton



Packt Publishing hat recently published a new Book: the "Talend Open Studio Cookbook" by Rick Barton.

As Data Warehouse developer and engineer, I've been extensively used Talend for many years. Apart from the basic tutorials provided by Talend, however, my only source for learning has always been the proactive Talend online community.

This book doesn't digress too much in theory and provides a full, comprehensive view on many every-day, concrete situations and their corresponding solving patterns. The learning-by-doing "recipe" approach followed by the book makes it easier to read and understand it. The book also provides a good foundation of XML principles; it lacks, however, a chapter illustrating the development of custom components, a scenario that is more likely to occur than expected.

Prerequisites for the book are a basic knowledge of Java or any c-like object-oriented programming language, as well as a rudimentary understanding of relational concepts.

I highly recommend this book for both IT experts and novices with a focus on system and data integration.