Pages

Monday, October 13, 2014

Distinct Count of Values in R

issue of duplicate rows
Multidimensional datasets often shows the issue of rows containing duplicate values. In SQL we can easly handle this problem thanks to the COUNT DISTINCT aggregate function. But what about R?

According to a couple of websites and blogs I've quickly checked, the fastest and most efficient way to get a distinct count of values in R seems to be by making use of the R unique function:

unique(dataset$column)

where "column" is the column name of the "dataset" dataset, whose values we'd like to distinct count.
The function is gonna return us a vector containing the unique list of values of the specified column - i.e. a vector without duplicate elements.

Thus what we need now is a simple count of this vector:

nrow(newdataset)

Wrapping in one, single scalar-returning statement:

nrow(unique(dataset$column))

If we wanna apply the same logic to the whole dataset rather than a single column, we can use the sapply() lamba-function:

sapply(dataset, function(x), length(unique(x)))

Wednesday, June 11, 2014

How to Determine the Bounding Box of a Spatial Index in SQL Server

In SQL Server you can associate an object with a region by invoking the STIntersects() function against a geometry or geography column, as it would be a traditional join:

DECLARE @g geometry = geometry::STGeomFromText('LINESTRING(0 2, 2 0, 4 2)', 0)
DECLARE @h geometry = geometry::STGeomFromText('POINT(1 1)', 0)
SELECT @g.STIntersects(@h)

This method allows the spatial location of objects using a standard SQL query: we're basically asking SQL Server which polygons our objects/points fall in.

Tracking a large population of objects and especially if they're moving, however, shows a really poor query performance. Is there any method to increase this geometrical logic performance?

According to the Microsoft online documentation, we can make use of a spatial index - a standard B-tree structure, which decomposes the 2-dimensional spatial data into a linear, hierarchial grid. By means of a spatial index, SQL Server can then internally perform a spatial calculation using simple and fast integer arithmetic.


A spatial index decomposes the 2-dimensional spatial data into a linear, hierarchial grid.


The syntax is pretty much trivial:

create spatial index
    six_tb_Dim_Geographie_Polygon
on
    dbo.tb_Dim_Geographie(Polygon)
using GEOMETRY_GRID
with
(
    BOUNDING_BOX =(xmin=9.53074889950784, ymin=46.3723050741545,xmax=17.1607732090805, ymax=49.0205207370626)
    GRIDS = (LOW, LOW, MEDIUM, HIGH),
    CELLS_PER_OBJECT = 64,
    PAD_INDEX  = ON
);

Since geometry data can -teoretically- occupy an infinte plane, the spatial index requires a rectangular bounding box, i.e. the coordinates of the x/y coordinates of the lower-left/upper-right corners of the entire geometrical structure. How to calculate them? Here is a simple query:

; with
x
as
(
   select
     geom = geometry::EnvelopeAggregate(g.Polygon)
   from
     dbo.tb_Dim_Geography g
)
   select
   xMin = x.geom.STPointN(1).STX
   , yMin = x.geom.STPointN(1).STY
   , xMax = x.geom.STPointN(3).STX
   , yMax = x.geom.STPointN(3).STY
  from
   x
;





Whereby the '1' point is the lower-left corner and '3' is the upper-right one; 'dbo.tb_Dim_Geography' could be the geographical dimension of your data warehouse, or any table containing the geographical structure in a normalised environment.

Monday, April 28, 2014

How to convert a Shapefile from UTM (WGS84) Coordinates into GPS Latitude/Longitude in R

According to Gartner, more than 80% of all information is supposed to have spatial reference.
The business value of any kind of data with spatial reference can be dramatically leveraged by means of integration and visualization with geographical, demographic and geopolitical data.

Developers and IT professionals often choose the way of creating their own geographical master data instead of relying on traditional GIS applications, which are often too highly specialized in order to be integrate into the ongoing information systems.

For this purpose, many open data sources and technologies can be used; the most common data format is the ERSI shapefile. To import a shapefile from the filesystem to one database many different tools can be used; in SQL Server environments I suggest the free and fast Shape2SQL Freeware tool.

Important: during the upload. Shape2SQL uses an unique transaction. If for any reason the creation of the SQL spatial index fails, the entire transaction will be rollbacked and and you won't see any newly created table in your target database - I suggest to disable this automatic and buggy index creation feature. Do not also forget to set the SRID as 4236.

You might think that once you your "shape" table has been created, your pairs of latitude and longitude coordinates are ready to uniquely identify every surface/polygon as well as every point on the earth's surface. Unfortunately, it's not quite that simple.

If your goal is to integrate and visualize data on standard platforms as Google Maps, OpenStreeMap or Bing Maps, you need in fact GPS latitude/longitude coordinates in WGS 1984 format - otherwise known as EPSG 4326. Most of the open data sources, however, publish shapefile data in UTM format, an old format that differs from the latitude/longitude system in several respects.

How to convert shapefile from UTM to latitude/longitude GPS formats? Here is fast and no-cost solution, using the popular data manipulation and data analysis opensource framework "R", togheter with the gdal library.

First of all, we install gdal and switch to our working directory (i.e., the directory in which we copied the original UTM shapefile):

install.packages("rgdal")
setwd("[yourworkingdirectory]")
getwd()


We then import the shapefile by creating a dataset in our workspace:
shape <- readOGR("directory", layer="filename_without_extension")

A dataset called "shape" of type SpatialPolygonsDataFrame has now been created. We are curious to see what exactly we just imported, and how does it looks like:

dimensions(shape)
summary(shape)
plot(shape)

You should see something like this:



We are now ready for the UTM to GPS Lat/Long conversion:

shape_gps = spTransform(shape, CRS("+proj=longlat +ellps=GRS80"))

Eventually, we commit the result back to the filesystem:

writeOGR(shape_gps, ".", "shape_gps", driver="ESRI Shapefile")

A file called "shape_gps.shp" will now contain your gps coordinate data.

In case SQL Server complains about the validity of your shape data, make use of the MakeValid() SQL (CLR) function.

Tuesday, April 22, 2014

Talend Open Studio Cookbook by Rick Barton



Packt Publishing hat recently published a new Book: the "Talend Open Studio Cookbook" by Rick Barton.

As Data Warehouse developer and engineer, I've been extensively used Talend for many years. Apart from the basic tutorials provided by Talend, however, my only source for learning has always been the proactive Talend online community.

This book doesn't digress too much in theory and provides a full, comprehensive view on many every-day, concrete situations and their corresponding solving patterns. The learning-by-doing "recipe" approach followed by the book makes it easier to read and understand it. The book also provides a good foundation of XML principles; it lacks, however, a chapter illustrating the development of custom components, a scenario that is more likely to occur than expected.

Prerequisites for the book are a basic knowledge of Java or any c-like object-oriented programming language, as well as a rudimentary understanding of relational concepts.

I highly recommend this book for both IT experts and novices with a focus on system and data integration.

Friday, January 17, 2014

How to Determine if Two Date Ranges Overlap in SQL

By the many possible configurations two time intervals can have with respect to each other, the most common is when two or more date ranges overlap. How to express it in relational terms?

Suppose we have a reference time range X. X will be in SQL represented by a pair of date columns or variables (how else?): @pr_DateFrom and @pr_DateTo.

We are now interested in determining whether the date ranges contained in one table, tb_DateRanges, are overlapping with our reference date range. tb_DateRanges has two column, DateFrom and DateTo. Moreover, if only one of those ranges are overlapping, our query should return a positive result. In SQL, the solution to this problem can be easly determined by an aggregating query.


The reference date range (black), the overlapping date ranges (green) and the non-overlapping date ranges (red).



We begin by expressing the opposite condition, i.e. if date range A does NOT overlap with reference range X. This happens when:
   
tb_DateRanges.DateTo <= @pr_DateFrom

OR

tb_DateRanges.DateFrom >= @pr_DateTo

Now, according to deMorgan's law:
   
not(A OR B)

is equivalent to  

not(A) AND not(B)

which means:

NOT(tb_DateRanges.DateTo <= @pr_DateFrom)

AND

NOT(tb_DateRanges.DateFrom >= @pr_DateTo)
   


By cross-applying this condition for every row of tb_DateRanges, we obtain a sequence of "true" (1) and "false" (0) values. By aggregating this raw result with the MAX() SQL function, we simulate an "OR" condition within the values.
   
This solution admits configurations where the edges overlap exactly. If you wish to exclude that, you may change the non-equijoins operators to ">= "to ">", and "<=" to "<", and viceversa.

A similar pattern can be used for spatial problems. For istance, for three-dimensional objects the same logic can be applied by listing the relation for each coordinate separately.

Wednesday, January 8, 2014

How to Split a Comma Separated Values Column into Multiple Rows

In many cases a SQL Developer may face a situation in which different values are "wrapped" into the same column. Even though this violates the First Normal Form, it is not uncommon.


Depending on your design considerations, you may want to "unwrap" the column, generating a new table with multiple rows with the same ID -each for every corresponding "splitted" value-, or simply use an additional reference table. The first solution fits well with the fact table of a dimensional data warehouse, while the second is more suitable for a normalized environment, such as a traditional information system.


Here is my definitely not elegant but working solution for a DWH scenario. First of all, you need a splitting table-valued function:

create function [dbo].[ft_Split]
(
    @pr_RowData nvarchar(2000),
    @pr_SplitOn nvarchar(5)

returns @vr_Rtn table
(
    Id int identity(1,1),
    Data nvarchar(max)
)
as
begin
    Declare @
vr_Cnt int
    Set @vr_Cnt = 1

    while (Charindex(@pr_SplitOn, @pr_RowData); 0)
    begin
        Insert Into @vr_Rtn (data)
        Select
            Data = ltrim(rtrim(substring(@pr_RowData, 1, charindex(@pr_SplitOn, @pr_RowData)-1)))

        Set @pr_RowData = substring(@pr_RowData, charindex(@pr_SplitOn, @pr_RowData)+1, len(@pr_RowData))
        Set @
vr_Cnt = @vr_Cnt + 1
    end
   
    insert Into @vr_Rtn (data)
    Select Data = nullif(ltrim(rtrim(@pr_RowData)), N'')

    Return
end

Now you can write a query with the CROSS APPLY clause, which works very much like a traditional join:

select
    s.ID
    ,
virtualSplit.data  as SplittedValue
    ...
from
    dbo.tb_CrazyWrappedTable as s
    cross apply dbo.ft_Split(s.CSVWrappedColumn, N',') as virtualSplit


The CROSS APPLY operator was introduced in SQL Server 2005 and in our scenario returns the original table (CrazyWrappedTable) rows matching with the virtual table (virtualSplit) generated by ft_Split.

As you might have guessed, the original table expression is processed first; the right table expression is then evaluated against each row of the left table expression. The final result-set contains all the selected columns from the left table expression combined with all the corresponding columns of the right table expression - and this means the original source table ID will be duplicated! Queries against your resulting fact table should therefore use the COUNT DISTINCT aggregate function - slow and disgraceful, but logically consistent.

To reduce performance issues or simply boost your read-only queries, you should build an index-view upon the original table, based on the two previous code fragments.