Home Ask Login Register

Developers Planet

Your answer is one click away!

Oryx February 2016

TSQL - Run date comparison for "duplicates"/false positives on initial query?

I'm pretty new to SQL and am working on pulling some data from several very large tables for analysis. The data is basically triggered events for assets on a system. The events all have a created_date (datetime) field that I care about.

I was able to put together the query below to get the data I need (YAY):

  FROM event
  LEFT JOIN asset
         ON event.a_key = asset.a_key
         ON event.l_key = l.l_key

  WHERE event.e_key IN (350, 352, 378)

  ORDER BY asset.a_id, event.created_date

However, while this gives me the data for the specific events I want, I still have another problem. Assets can trigger these events repeatedly, which can result in large numbers of "false positives" for what I'm looking at.

What I need to do is go through the result set of the query above and remove any events for an asset that occur closer than N minutes together (say 30 minutes for this example). So IF the asset_ID is the same AND the event.created_date is within 30 minutes of another event for that asset in the set THEN I want that removed. For example:

For the following records

a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-01 12:35:31
a_id 1124 created 2016-02-01 12:40:33
a_id 1124 created 2016-02-01 12:45:42
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30

I'd want to return only:

a_id 1124 created 2016-02-01 12:30:30 
a_id 1124 created 2016-02-02 12:30:30 
a_id 1124 created 2016-02-02 13:00:30 
a_id 1115 created 2016-02-01-12:30:30

I tried referencing this<


HABO February 2016

-- Sample data.
declare @Samples as Table ( Id Int Identity, A_Id Int, CreatedDate DateTime );
insert into @Samples ( A_Id, CreatedDate ) values
  ( 1124, '2016-02-01 12:30:30' ),
  ( 1124, '2016-02-01 12:35:31' ),
  ( 1124, '2016-02-01 12:40:33' ),
  ( 1124, '2016-02-01 12:45:42' ),
  ( 1124, '2016-02-02 12:30:30' ),
  ( 1124, '2016-02-02 13:00:30' ),
  ( 1125, '2016-02-01 12:30:30' );
select * from @Samples;

-- Calculate the windows of 30 minutes before and after each   CreatedDate   and check for conflicts with other rows.
with Ranges as (
  select Id, A_Id, CreatedDate,
    DateAdd( minute, -30, S.CreatedDate ) as RangeStart, DateAdd( minute, 30, S.CreatedDate ) as RangeEnd
    from @Samples as S )
  select Id, A_Id, CreatedDate, RangeStart, RangeEnd,
    -- Check for a conflict with another row with:
    --   the same   A_Id   value and an earlier   CreatedDate   that falls inside the +/-30 minute range.
    case when exists ( select 42 from @Samples where A_Id = R.A_Id and CreatedDate < R.CreatedDate and R.RangeStart < CreatedDate and CreatedDate < R.RangeEnd ) then 1
      else 0 end as Conflict
    from Ranges as R;

Becuzz February 2016

This is a trickier problem than it initially appears. The hard part is capturing the previous good row and removing the next bad rows but not allowing those bad rows to influence whether or not the next row is good. Here is what I came up with. I've tried to explain what is going on with comments in the code.

--sample data since I don't have your table structure and your original query won't work for me
declare @events table
  id int,
  timestamp datetime

--note that I changed some of your sample data to test some different scenarios
insert into @events values( 1124, '2016-02-01 12:30:30')
insert into @events values( 1124, '2016-02-01 12:35:31')
insert into @events values( 1124, '2016-02-01 12:40:33')
insert into @events values( 1124, '2016-02-01 13:05:42')
insert into @events values( 1124, '2016-02-02 12:30:30')
insert into @events values( 1124, '2016-02-02 13:00:30')
insert into @events values( 1115, '2016-02-01 12:30:30')

--using a cte here to split the result set of your query into groups
--by id (you would want to partition by whatever criteria you use
--to determine that rows are talking about the same event)
--the row_number function gets the row number for each row within that 
--id partition
--the over clause specifies how to break up the result set into groups 
--(partitions) and what order to put the rows in within that group so 
--that the numbering stays consistant
;with orderedEvents as
    select id, timestamp, row_number() over (partition by id order by timestamp) as rn
    from @events
    --you would replace @events here with your query
--using a second recursive cte here to determine which rows are "good"
--and which ones are not.  
, previousGoodTimestamps as 
    --this is the "seeding" part of the recursive cte where I pick the
    --first rows of each group as being a desired result.  Since they 
    --are the first in each group, I know they are good.  I also assign
    --their timestamp as the previous good  

Post Status

Asked in February 2016
Viewed 3,516 times
Voted 6
Answered 2 times


Leave an answer

Quote of the day: live life