Pages

Friday, January 17, 2014

How to Determine if Two Date Ranges Overlap in SQL

By the many possible configurations two time intervals can have with respect to each other, the most common is when two or more date ranges overlap. How to express it in relational terms?

Suppose we have a reference time range X. X will be in SQL represented by a pair of date columns or variables (how else?): @pr_DateFrom and @pr_DateTo.

We are now interested in determining whether the date ranges contained in one table, tb_DateRanges, are overlapping with our reference date range. tb_DateRanges has two column, DateFrom and DateTo. Moreover, if only one of those ranges are overlapping, our query should return a positive result. In SQL, the solution to this problem can be easly determined by an aggregating query.


The reference date range (black), the overlapping date ranges (green) and the non-overlapping date ranges (red).



We begin by expressing the opposite condition, i.e. if date range A does NOT overlap with reference range X. This happens when:
   
tb_DateRanges.DateTo <= @pr_DateFrom

OR

tb_DateRanges.DateFrom >= @pr_DateTo

Now, according to deMorgan's law:
   
not(A OR B)

is equivalent to  

not(A) AND not(B)

which means:

NOT(tb_DateRanges.DateTo <= @pr_DateFrom)

AND

NOT(tb_DateRanges.DateFrom >= @pr_DateTo)
   


By cross-applying this condition for every row of tb_DateRanges, we obtain a sequence of "true" (1) and "false" (0) values. By aggregating this raw result with the MAX() SQL function, we simulate an "OR" condition within the values.
   
This solution admits configurations where the edges overlap exactly. If you wish to exclude that, you may change the non-equijoins operators to ">= "to ">", and "<=" to "<", and viceversa.

A similar pattern can be used for spatial problems. For istance, for three-dimensional objects the same logic can be applied by listing the relation for each coordinate separately.

Wednesday, January 8, 2014

How to Split a Comma Separated Values Column into Multiple Rows

In many cases a SQL Developer may face a situation in which different values are "wrapped" into the same column. Even though this violates the First Normal Form, it is not uncommon.


Depending on your design considerations, you may want to "unwrap" the column, generating a new table with multiple rows with the same ID -each for every corresponding "splitted" value-, or simply use an additional reference table. The first solution fits well with the fact table of a dimensional data warehouse, while the second is more suitable for a normalized environment, such as a traditional information system.


Here is my definitely not elegant but working solution for a DWH scenario. First of all, you need a splitting table-valued function:

create function [dbo].[ft_Split]
(
    @pr_RowData nvarchar(2000),
    @pr_SplitOn nvarchar(5)

returns @vr_Rtn table
(
    Id int identity(1,1),
    Data nvarchar(max)
)
as
begin
    Declare @
vr_Cnt int
    Set @vr_Cnt = 1

    while (Charindex(@pr_SplitOn, @pr_RowData); 0)
    begin
        Insert Into @vr_Rtn (data)
        Select
            Data = ltrim(rtrim(substring(@pr_RowData, 1, charindex(@pr_SplitOn, @pr_RowData)-1)))

        Set @pr_RowData = substring(@pr_RowData, charindex(@pr_SplitOn, @pr_RowData)+1, len(@pr_RowData))
        Set @
vr_Cnt = @vr_Cnt + 1
    end
   
    insert Into @vr_Rtn (data)
    Select Data = nullif(ltrim(rtrim(@pr_RowData)), N'')

    Return
end

Now you can write a query with the CROSS APPLY clause, which works very much like a traditional join:

select
    s.ID
    ,
virtualSplit.data  as SplittedValue
    ...
from
    dbo.tb_CrazyWrappedTable as s
    cross apply dbo.ft_Split(s.CSVWrappedColumn, N',') as virtualSplit


The CROSS APPLY operator was introduced in SQL Server 2005 and in our scenario returns the original table (CrazyWrappedTable) rows matching with the virtual table (virtualSplit) generated by ft_Split.

As you might have guessed, the original table expression is processed first; the right table expression is then evaluated against each row of the left table expression. The final result-set contains all the selected columns from the left table expression combined with all the corresponding columns of the right table expression - and this means the original source table ID will be duplicated! Queries against your resulting fact table should therefore use the COUNT DISTINCT aggregate function - slow and disgraceful, but logically consistent.

To reduce performance issues or simply boost your read-only queries, you should build an index-view upon the original table, based on the two previous code fragments.