Exclude Overlapping Periods In Time Aggregate Function
Solution 1:
What about this:
WITH/* get all time points where something changes */
points AS (
SELECT "startDate" AS p
FROM temp_period
UNIONSELECT "endDate"
FROM temp_period
),
/*
* Get all date ranges between these time points.
* The first time range will start with NULL,
* but that will be excluded in the next CTE anyway.
*/
inter AS (
SELECT daterange(
lag(p) OVER (ORDERBY p),
p
) i
FROM points
),
/*
* Get all date ranges that are contained
* in at least one of the intervals.
*/
overlap AS (
SELECTDISTINCT i
FROM inter
CROSSJOIN temp_period
WHERE i <@ daterange("startDate", "endDate")
)
/* sum the lengths of the date ranges */SELECTsum(age(upper(i), lower(i)))
FROM overlap;
For your data it will return:
┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1row)
Solution 2:
You could try to use recursive cte to calculate the period. For each record, we will check if it's overlapped with previous records. If it is, we only calculate the period that is not overlapping.
WITH RECURSIVE days_count AS
(
SELECT startDate,
endDate,
AGE(endDate, startDate) AS total_days,
rowSeq
FROM ordered_data
WHERE rowSeq = 1
UNION ALL
SELECT GREATEST(curr.startDate, prev.endDate) AS startDate,
GREATEST(curr.endDate, prev.endDate) AS endDate,
AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
curr.rowSeq
FROM ordered_data curr
INNER JOIN days_count prev
ON curr.rowSeq > 1AND curr.rowSeq = prev.rowSeq + 1),
ordered_data AS
(
SELECT *,
ROW_NUMBER() OVER (ORDERBY startDate) AS rowSeq
FROM temp_period)
SELECT SUM(total_days) AS total_days
FROM days_count;
I've created a demo here
Solution 3:
Actually there is a case that is not covered by the previous examples. What if we have such a period ?
INSERTINTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');
We have the following intervals:
Interval No. || start_date || end_date
--------------+------------------+------------+----------------+------------1|Intervalstart|2010-01-01|Intervalend|2010-03-312|Intervalstart|2010-01-03|Intervalend|2010-02-103|Intervalstart|2010-02-15|Intervalend|2010-05-314|Intervalstart|2013-05-17|Intervalend|2013-07-185|Intervalstart|2014-01-01|Intervalend|2014-12-316|Intervalstart|2014-03-31|Intervalend|2014-06-30Even though segment 3 overlaps segment 1, it's seen as a new segment, hence the (wrong) result:
sum
-----
620
(1 row)
The solution is to tweak the core of the query
CASEWHEN start_date <lag(end_date) OVER (ORDERBY start_date, end_date) thenNULLELSE start_date ENDneeds to be replaced by
CASEWHEN start_date <max(end_date) OVER (ORDERBY start_date, end_date rowsbetween unbounded preceding and1 preceding) thenNULLELSE start_date ENDthen it works as expected
sum
-----
576
(1 row)
Summary:
SELECTsum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDERBY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASEWHEN start_date <max(end_date) OVER (ORDERBY start_date, end_date rowsbetween unbounded preceding and1 preceding) thenNULLELSE start_date ENDAS new_start
FROM temp_period
) s1
) s2
GROUPBY left_edge
) s3;
Solution 4:
This one required two outer joins on a complex query. One join to identify all overlaps with a startdate larger than THIS and to expand the timespan to match the larger of the two. The second join is needed to match records with no overlaps. Take the Min of the min and the max of the max, including non matched. I was using MSSQL so the syntax may be a bit different.
DECLARE@temp_period TABLE
(
id intNOTNULL,
startDate datetime,
endDate datetime
)
INSERTINTO@temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERTINTO@temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')
;WITHOverLapsAS
(
SELECT
Main.id,
OverlappedID=Overlaps.id,
OverlapMinDate,
OverlapMaxDate
FROM@temp_period Main
LEFTOUTERJOIN
(
SELECT
This.id,
OverlapMinDate=CASEWHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
OverlapMaxDate=CASEWHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
PriorID=Prior.id
FROM@temp_period This
LEFTOUTERJOIN@temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
) OverlapsON Main.Id=Overlaps.PriorId
)
SELECT
T.Id,
--If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM@temp_period T
LEFTOUTERJOINOverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
LEFTOUTERJOINOverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
IsAOverlap.OverlappedID ISNULL-- Exclude older records that have overlapsGROUPBY
T.Id
Solution 5:
Beware: the answer by Laurenz Albe has a huge scalability issue.
I was more than happy when I found it. I customized it for our needs. We deployed to staging and very soon, the server took several minutes to return the results.
Then I found this answer on postgresql.org. Much more efficient. https://wiki.postgresql.org/wiki/Range_aggregation
SELECTsum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDERBY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASEWHEN start_date <lag(end_date) OVER (ORDERBY start_date, end_date) thenNULLELSE start_date ENDAS new_start
FROM temp_period
) s1
) s2
GROUPBY left_edge
) s3;
Result:
sum
-----
576
(1 row)
Post a Comment for "Exclude Overlapping Periods In Time Aggregate Function"