PySpark rolling operation with variable ranges - pyspark

I have a dataframe looking like this
some_data | date | date_from | date_to
1234 |1-2-2020| 1-2-2020 | 2-2-2020
5678 |2-2-2020| 1-2-2020 | 2-3-2020
and I need to perform some operations on some_data based on time ranges that are different for every row, and stored in date_from and date_to. This is basically a rolling operation on some_data vs date, where the width of the window is not constant.
If the time ranges were the same, like always 7 days preceding/following, I would just do a window with rangeBetween. Any idea how I can still use rangeBetween with these variable ranges? I could really use the partitioning capability Window provides...
My current solution is:
a join of the table with itself to obtain a secondary/nested date column. at this point every date has the full list of possible dates
some wheres to select, for each primary date the proper secondary dates according to date_from and date_to
a groupby the primary date with agg performing the actual operation on the selected rows
But I am afraid this would not be very performant on large datasets. Can this be done with Window? Do you have a better/more performant suggestion?
Thanks a lot,
Andrea.

Related

Convert date column in source to date +time format in target (YYYY-MM-DD.HH.MM.SS.nnnnnn) and for each duplicate entry need to increment by 1 nano sec

I have a requirement to convert date in source table column to a date time in target.Now, this date time column is part of a composite primary key in target, so if there are any duplicate entry then we have to increase the nanosecond by 1. This has to be done in Postgres CTE for DBT query. Also there can be duplicates in source so to achieve unique value we need to add nanosecond while conversion for duplicate rows.
For eg, 2021-07-30 00:00:00.000000
If more than one row for same effective date then increment nanosecond by 1
Update:- postgres version 11.9
Postgres 9.4 doesn't have on conflict. And Postgres doesn't support "nanosecond" as an interval. But if you won't get conflicts on after incrementing, you can try:
insert into target (dt)
select (case when t.dt is null then s.dt
else s.dt + interval '1 microsecond'
end)
from source s left join
target t
on s.dt = t.dt;
This problem gets a bit trickier if you have duplicates in the source or if there are conflicts after incrementing. You haven't provided sample data and desired results, so this answers the simplest interpretation of your question,.

query to fetch records between two date and time

I have been using postgreSQL. My table has 3 columns date, time and userId. I have to find out records between the given date and time frame. Since date and time columns are different, 'BETWEEN' clause is not providing valid results
Combine the two columns into a single timestamp by adding the time to the date:
select *
from some_table
where date_column + time_column
between timestamp '2017-06-14 17:30:00' and timestamp '2017-06-19 08:26:00';
Note that this will not use an index on date_column or time_column. You would need to create an index on that expression. Or better: use a single column defined as timestamp instead.

DB2 select last day with a lot of data

I have a very big table in DB2 with around 500 million rows
I need to select the last day only based on a timestamp column and other conditions
I did something like this, but it takes forever (about 10 minutes) to get the results, is there any other way to query this faster, I am not familiar with db2
DTM is a timestamp datatype
select a, b, c, d, e, DTM from table1
where e = 'I' and DTM > current timestamp - 1 days
Any help please
Besides an index, another option may be range partitioning on this table. If you could range partition by month, you would only have to scan the month, for example. Even better if you could partition by day (and have the partitioning key in the index so you had partitioned index too).

How to optimize a table for queries sorted by insertion order in Postgres

I have a table of time series data where for almost all queries, I wish to select data ordered by collection time. I do have a timestamp column, but I do not want to use actual Timestamps for this, because if two entries have the same timestamp it is crucial that I be able to sort them in the order they were collected, which is information I have at Insert time.
My current schema just has a timestamp column. How would I alter my schema to make sure I can sort based on collection/insertion time, and make sure querying in collection/insertion order is efficient?
Add column based on sequence (i.e. serial), and create index on (timestamp_column, serial_column). Then you can have insertion order (more or less) by doing:
ORDER BY timestamp_column, serial_column;
You could use a SERIAL column called insert_order. This way there will be no two rows with the same value. However, I am not sure that you requirement of being in absolute time order is possible to achieve.
For example suppose there are two transactions, T1 and T2 and they do happen at the same time, and you are running on a machine with multiple processor, so in fact both T1 and T2 did the insert at exactly the same instant. Is this a case that you are concerned about? There was not enough info your question to know exactly.
Also with a serial column you have the issue of gaps, for example T1 cloud grab serial value 14 and T2 can grab value 15, then T1 rolls back and T2 does not, so you have to expect that the insert_order column might have gaps in it.

PostgreSQL amount for each day summed up in weeks

I've been trying to find a solution to this challenge all day.
I've got a table:
id | amount | type | date | description | club_id
-------+---------+------+----------------------------+---------------------------------------+---------+--------
783 | 10000 | 5 | 2011-08-23 12:52:19.995249 | Sign on fee | 7
The table has a lot more data than this.
What I'm trying to do is get the sum of amount for each week, given a specific club_id.
The last thing I ended up with was this, but it doesn't work:
WITH RECURSIVE t AS (
SELECT EXTRACT(WEEK FROM date) AS week, amount FROM club_expenses WHERE club_id = 20 AND EXTRACT(WEEK FROM date) < 10 ORDER BY week
UNION ALL
SELECT week+1, amount FROM t WHERE week < 3
)
SELECT week, amount FROM t;
I'm not sure why it doesn't work, but it complains about the UNION ALL.
I'll be off to bed in a minute, so I won't be able to see any answers before tomorrow (sorry).
I hope I've described it adequately.
Thanks in advance!
It looks to me like you are trying to use the UNION ALL to retrieve a subset of the first part of the query. That won't work. You have two options. The first is to use user defined functions to add behavior as you need it and the second is to nest your WITH clauses. I tend to prefer the former, but you may be preferring the latter.
To do the functions/table methods approach you create a function which accepts as input a row from a table and does not hit the table directly. This provides a bunch of benefits including the ability to easily index the output. Here the function would look like:
CREATE FUNCTION week(club_expense) RETURNS int LANGUAGE SQL IMMUTABLE AS $$
select EXTRACT(WEEK FROM $1.date)
$$;
Now you have a usable macro which can be used where you would use a column. You can then:
SELECT c.week, sum(amount) FROM club_expense c
GROUP BY c.week;
Note that the c. before week is not optional. The parser converts that into week(c). If you want to limit this to a year, you can do the same with years.
This is a really neat, useful feature of Postgres.