Exclude Overlapping Periods In Time Aggregate Function

October 21, 2024 Post a Comment

I have a table containing each a start and and end date: DROP TABLE temp_period; CREATE TABLE public.temp_period ( id integer NOT NULL, 'startDate' date, 'endDate' date );

Solution 1:

What about this:

WITH/* get all time points where something changes */
   points AS (
       SELECT "startDate" AS p
       FROM temp_period
       UNIONSELECT "endDate"
       FROM temp_period
   ),
   /*
    * Get all date ranges between these time points.
    * The first time range will start with NULL,
    * but that will be excluded in the next CTE anyway.
    */
   inter AS (
      SELECT daterange(
                lag(p) OVER (ORDERBY p),
                p
             ) i
      FROM points
   ),
   /*
    * Get all date ranges that are contained
    * in at least one of the intervals.
    */
   overlap AS (
      SELECTDISTINCT i
      FROM inter
         CROSSJOIN temp_period
      WHERE i <@ daterange("startDate", "endDate")
   )
/* sum the lengths of the date ranges */SELECTsum(age(upper(i), lower(i)))
FROM overlap;

For your data it will return:

┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1row)

Solution 2:

You could try to use recursive cte to calculate the period. For each record, we will check if it's overlapped with previous records. If it is, we only calculate the period that is not overlapping.

WITH RECURSIVE days_count AS 
  ( 
         SELECT startDate, 
                endDate, 
                AGE(endDate, startDate) AS total_days, 
                rowSeq 
         FROM   ordered_data 
         WHERE  rowSeq = 1 
         UNION ALL 
         SELECT     GREATEST(curr.startDate, prev.endDate)                                            AS startDate,
                    GREATEST(curr.endDate, prev.endDate)                                              AS endDate,
                    AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
                    curr.rowSeq 
         FROM       ordered_data curr 
         INNER JOIN days_count prev 
         ON         curr.rowSeq > 1AND        curr.rowSeq = prev.rowSeq + 1), 
ordered_data AS 
  ( 
           SELECT   *, 
                    ROW_NUMBER() OVER (ORDERBY startDate) AS rowSeq 
           FROM     temp_period) 
SELECT SUM(total_days) AS total_days
FROM   days_count;

I've created a demo here

Solution 3:

Actually there is a case that is not covered by the previous examples. What if we have such a period ?

INSERTINTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');

We have the following intervals:

Interval No. || start_date ||  end_date
--------------+------------------+------------+----------------+------------1|Intervalstart|2010-01-01|Intervalend|2010-03-312|Intervalstart|2010-01-03|Intervalend|2010-02-103|Intervalstart|2010-02-15|Intervalend|2010-05-314|Intervalstart|2013-05-17|Intervalend|2013-07-185|Intervalstart|2014-01-01|Intervalend|2014-12-316|Intervalstart|2014-03-31|Intervalend|2014-06-30

Even though segment 3 overlaps segment 1, it's seen as a new segment, hence the (wrong) result:

 sum
-----
 620
(1 row)

The solution is to tweak the core of the query

CASEWHEN start_date <lag(end_date) OVER (ORDERBY start_date, end_date) thenNULLELSE start_date END

needs to be replaced by

CASEWHEN start_date <max(end_date) OVER (ORDERBY start_date, end_date rowsbetween unbounded preceding and1 preceding) thenNULLELSE start_date END

then it works as expected

 sum
-----
 576
(1 row)

Summary:

SELECTsum(e - s)
  FROM (
    SELECT left_edge as s, max(end_date) as e
    FROM (   
      SELECT start_date, end_date, max(new_start) over (ORDERBY start_date, end_date) as left_edge
      FROM ( 
        SELECT start_date, end_date, CASEWHEN start_date <max(end_date) OVER (ORDERBY start_date, end_date rowsbetween unbounded preceding and1 preceding) thenNULLELSE start_date ENDAS new_start
        FROM temp_period
      ) s1
    ) s2
    GROUPBY left_edge
  ) s3;

Solution 4:

This one required two outer joins on a complex query. One join to identify all overlaps with a startdate larger than THIS and to expand the timespan to match the larger of the two. The second join is needed to match records with no overlaps. Take the Min of the min and the max of the max, including non matched. I was using MSSQL so the syntax may be a bit different.

DECLARE@temp_period TABLE
(
  id intNOTNULL,
  startDate datetime,
  endDate datetime
)

INSERTINTO@temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')


;WITHOverLapsAS
(
    SELECT 
        Main.id,
        OverlappedID=Overlaps.id,
        OverlapMinDate,
        OverlapMaxDate
    FROM@temp_period Main
        LEFTOUTERJOIN
        (
            SELECT 
                This.id,
                OverlapMinDate=CASEWHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
                OverlapMaxDate=CASEWHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
                PriorID=Prior.id
            FROM@temp_period This
                LEFTOUTERJOIN@temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
        ) OverlapsON Main.Id=Overlaps.PriorId
)

SELECT
    T.Id,
    --If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
    MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
    MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM@temp_period T
    LEFTOUTERJOINOverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
    LEFTOUTERJOINOverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
    IsAOverlap.OverlappedID ISNULL-- Exclude older records that have overlapsGROUPBY
    T.Id

Solution 5:

Beware: the answer by Laurenz Albe has a huge scalability issue.

I was more than happy when I found it. I customized it for our needs. We deployed to staging and very soon, the server took several minutes to return the results.

Then I found this answer on postgresql.org. Much more efficient. https://wiki.postgresql.org/wiki/Range_aggregation

SELECTsum(e - s)
FROM (
  SELECT left_edge as s, max(end_date) as e
  FROM (   
    SELECT start_date, end_date, max(new_start) over (ORDERBY start_date, end_date) as left_edge
    FROM ( 
      SELECT start_date, end_date, CASEWHEN start_date <lag(end_date) OVER (ORDERBY start_date, end_date) thenNULLELSE start_date ENDAS new_start
      FROM temp_period
      ) s1
    ) s2
  GROUPBY left_edge
  ) s3;

Result:

 sum
-----
 576
(1 row)

Lab Sql