TechBlogFG: 2015

Monday, May 18, 2015

How to Split a MultiPolygon Column Into the Corresponding Multiple Polygons in SQL Server

If you're looking for a way to decompose SQL Server geometry (or geography) columns containing multiple polygons (MULTIPOLYGON) into their corresponding polygons, then the following method will come for sure in handy.

First of all, we create a custom dataset (a table variable) containing our sample multipolygons:

declare
    @tv_MultiPolygonsAndPolygons
table
(
    MultiPolygon_ID int identity
    , MultiPolygon_Shape geometry
)
;

insert into
   @tv_MultiPolygonsAndPolygons (MultiPolygon_Shape)
values
(
   geometry::STPolyFromText('POLYGON((0 0, 2 0, 2 2, 0 2, 0 0))', 0))
   , (geometry::STMPolyFromText('MULTIPOLYGON(((6 20, 8 20, 8 22, 6 22, 6 20)),((3 6, 6 7, 5 9, 3 6)))', 0))
   , (geometry::STPolyFromText('POLYGON((10 0, 12 0, 12 2, 10 2, 10 0))', 0))
   , (geometry::STMPolyFromText('MULTIPOLYGON(((0 10, 2 10, 2 12, 0 12, 0 10)),((15 10, 25 12, 23 20, 15 20, 15 10)))', 0)
);

By querying this table variable:

select
g.*
from
@tv_MultiPolygonsAndPolygons g
;

we get the following 4 multipolygons, as expected, each of one with its own distinctive color:

The easiest and cleaniest way to split our multipolygons into their separate polygons is by creating a table-valued function, and then cross-apply (or outer-apply) our table variable containing the multipolygons to it. In order to do so, we make use of a recursive CTE:

create function [dbo].[ft_Geographie_SplitMultiPolygon](@pr_MultiPolygon geometry)
returns table
as
return
(
    with
        NS(Num) as
    (
        select
            Num = 1

        union all

        select
            Num = Num + 1
        from
            NS
        where
            Num < @pr_MultiPolygon.STNumGeometries()
    )
    select
        MultiPolygon_ID     = NS.Num
        , Polygon           = @pr_MultiPolygon.STGeometryN(NS.Num)
    from
        NS option (MaxRecursion 1000)
)

The CTE contains the STNumGeometries()function, a function returning the number of geometries that comprise a geometry instance, in our case the individual polygons.

We're now ready to test our brand new table-value splitting function:

select
    g.MultiPolygon_ID
    , Polygon_ID        =    sp.ID
    , Polygon_Shape        =    sp.Polygon
from
    @tv_MultiPolygonsAndPolygons g
    cross apply usw.ft_Geographie_SplitMultiPolygon(g.MultiPolygon_Shape) sp
order by
    MultiPolygon_ID
    , Polygon_ID

And, as expected, we're now getting a different row for each polygon:

Monday, March 23, 2015

Dynamic Labelling of Dimension Attributes

In Dimensional Modelling, dimensions are usually implemented as single, fully denormalized tables. This morphology translates into a simplified and more readable structure as well as into a performance gain when processing user queries, since less joins between tables are involved.

Dimensions essentially consist of attributes. Each dimension attribute should always be uniquely identified by a code and might show one or more descriptive fields:

The "Car" Dimension and its attributes.

In this example, each car to be sold is associated with a model, a manufacturer, a building year, a price, and a color. "building year" and "price" are numeric attributes and don't need therefore any further additional descriptive information - in some cases we might even copy the price field back to the fact tables, in order to allow user analysis over it. But what about "model", "manufacturer" and "color"?

Those attributes are uniquely identified by a code. However, during the processing of the dimensional table, we need a way to dynamically associate these codes with their corresponding text labels. Here a simple, ANSI-standard SQL-based solution:

; with
    iPvt
as
(
    select
        ID_iPvt                         =    ID // Dimension ID
        , Color_Descr                   =    max([color])
        , Manufacturer_Descr            =    max([manufacturer])
        , Model_Descr                   =    max([model])
    from
        (
            select
                k.ID
                , m.Code
                , m.Description
            from
                tb_Dim_Car k
                inner join tb_Car_Attribute_Map m on
                    (m.Code = k.Color_Code and m.Type = 'Color')
                    or (m.Code = k.Manufacturer_Code and m.Type = 'Manufacturer')
                    or (m.Code = k.Model_Code and m.Type = 'Model')
        ) Src
        pivot
        (
            max([Description])
            for [Attribute] in
            (
                [color]
                , [manufacturer]
                , [model]
            )
        ) Pvt
    group by
        ID
)
select
    ID                            =    k.ID
    , Color_Code                    =    k.Color_Code
    , Color_Descr                   =    coalesce(p.Color_Descr, 'n/a')
    , Manufacturer_Code             =    k.Manufacturer_Code
    , Manufacturer_Descr            =    coalesce(p.Manufacturer_Descr, 'n/a')
    , Model_Code                    =    k.Model_Code
    , Model_Descr   =    coalesce(p.Model_Descr, 'n/a')
    , Building_Year                 =    k.Building_Year
    , Price                         =    k.Price
from
    tb_Dim_Car k
    left outer join iPvt p on
        p.ID_iPvt = k.ID
;

This simple query joins every attribute code/key of our "tb_Dim_Car" dimension with the "tb_Car_Attribute_Map", a 2NF table containing all metadata we need in order to allow a dynamic mapping across all attributes:

The tb_Car_Attribute_Map mapping table.

The result of this join will be then pivoted back and outer joined with the original dimensional table. In order to allow this, however, we need a unique "ID" for each dimension row - in case we lack it, a ranking windowing functions comes of course at handy - if supported by the database engine.

This pattern can be easly extended in case of multilngual description attributes by simply adding a "Language Code" column in the mapping table.

Friday, January 16, 2015

First Steps to Market Basket Analysis with R

In this article I'd like to provide a walkthrough guide on how to perform a simple market basket analysis using R. First of all, I suggest a brief reading of the "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" book written by Gordon S. Linoff and Michael J. A. Berry, and more specifically its chapter 15.

The Market Basket Analysis is a marketing application of the data mining technique known as "affinity analysis". It is often used in the retail industry as instrument to infer the customer purchase behavior, in order to improve the sales process by means of activities such as loyalty programs, discounts, and cross-selling.

A well known application of market basket analysis: the Amazon "customers who bought this item also bought" cross-selling functionality.

We suppose to have access to the historical user search data of a car retail portal. In order to perform a Market Basket Analysis, data must to be first fully pivoted - i.e. every user search must correspond to a row, and every search criteria must be corresponding to a column.

Fully pivoted user search data. Source: e "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" by Gordon S. Linoff and Michael J. A. Berry, p. 295.

We now need our pivoted data to be imported in our R environment. Through ODBC, by querying a table or executing a stored procedure, we can direct import them - thus generate a R data frame object:

require(RODBC)
channel <- odbcConnect("dwh", uid = "<your_username>", pwd = "<your_password>")
db <- sqlQuery(channel, "select * from <your_table>")
close(channel

df <- db[,c('column1', 'column2', 'column3', 'column4')]

Our data are now in the "df" data frame object.

In order to perform a Market Basket Analysis, we need to load the "arules" and the "arulesViz" R packages:

library(arules)
library(arulesViz)

Are all our data column of "factor" type? This is required by the arules package.

sapply(df, class)

If not, we can "correct" the "wrong" column by using a cast operation:

class(df$ProductX)
df$ProductX <- as.factor(df$ProductX)
class(df$ProductX)

We might not be want to manually cast every single column of our dataframe. If this is the case, we can use the colwise function of the plyr package:

require(plyr)
tofactor <- function(x) { x <- as.factor(x) }
df <- colwise(tofactor)(df)

The Market Basket Analysis algorithm ignores every NA value (the R equivalent of the database NULL). In case our DWH dataset is still not qualitative good enough, we might want to to manually clean it:

df$WithDiscount <- replace(df$WithDiscount, df$WithDiscount==0, NA)

Our data are now ready to generate useful information. We begin with a standard support of 0.1% and a confidence of 80%:

rules <- apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen=5))

Let's give a look at the top 7 rules, ordered by lift:

options(digits=2)

rules <- sort(rules, by="lift", decreasing=TRUE)

inspect(rules[1:7])

We obtain something like this:

lhs rhs support confidence lift

1 {WithDiscount=1} => {Model=Golf} 0.0070 0.92 1.3

2 {WithDiscount=1} => {EngineType=Diesel} 0.0064 0.85 1.1

3 {Model=Golf} => {Color=Green} 0.0300 0.97 1.4

4 {Model=Golf} => {Color=Red} 0.4138 0.90 1.3

5 {Color=Red} => {EngineType=Gasoline} 0.4138 0.90 1.2

6 {Model=Golf} => {EngineType=Diesel} 0.6343 0.89 1.2

7 {EngineType=Electric} => {Color=White} 0.6343 0.

For better reading we can export our rules to a csv file:

write(rules, file="rules.csv", quote=TRUE, sep=";")

As usual, most of the Data Mining results appears to be trivial - i.e. they are already known by anyone familiar with the business.

In this example, it may appear non trivial and therefore worth a further analysis the rule 7 (a customer choosing an electric vehicle is more likely to purchase it of white color) and the rule 5 (future owners of red painted cars tend to prefer traditional gasoline engines over diesel or electric ones).

Finding out a sound and useful interpretation of these derived rules is, of course, part of the analysis itself.

TechBlogFG

Pages