my-sql query to fill out the time gaps between the records - mysql-5.6

I want to write an optimized query to fill out the time gaps between the records with the stock value that is most recent to date.
The requirement is to have the latest stock value for every group of id_warehouse, id_stock, and date. The table is quite large (2 million records) and hence I would like to optimize the query that I have added below and the table grows.
daily_stock_levels:
date
id_stock
id_warehouse
new_stock
is_stock_avaible
2022-01-01
1
1
24
1
2022-01-01
1
1
25
1
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-06
1
1
27
1
2022-01-09
1
1
26
1
Result:
date
id_stock
id_warehouse
closest_date_with_stock_value
most_recent_stock_value
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-03
1
1
30
1
2022-01-04
1
1
30
1
2022-01-05
1
1
30
1
2022-01-06
1
1
27
1
2022-01-07
1
1
27
1
2022-01-07
1
1
27
1
2022-01-09
1
1
26
1
2022-01-10
1
1
26
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
2022-08-08
1
1
26
1
SELECT
sl.date,
sl.id_warehouse,
sl.id_item,
(SELECT
s.date
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS closest_date_with_stock_value,
(SELECT
s.new_stock
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS most_recent_stock_value
FROM
daily_stock_levels sl
GROUP BY sl.id_warehouse,
sl.id_item,
sl.date

Related

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

Postgres: for each row evaluate all successive rows under conditions

I have this table:
id | datetime | row_number
1 2018-04-09 06:27:00 1
1 2018-04-09 14:15:00 2
1 2018-04-09 15:25:00 3
1 2018-04-09 15:35:00 4
1 2018-04-09 15:51:00 5
1 2018-04-09 17:05:00 6
1 2018-04-10 06:42:00 7
1 2018-04-10 16:39:00 8
1 2018-04-10 18:58:00 9
1 2018-04-10 19:41:00 10
1 2018-04-14 17:05:00 11
1 2018-04-14 17:48:00 12
1 2018-04-14 18:57:00 13
I'd count for each row the successive rows with time <= '01:30:00' and start the successive evaluation from the first row that doesn't meet the condition.
I try to exlplain better the question.
Using windows function lag():
SELECT id, datetime,
CASE WHEN datetime - lag (datetime,1) OVER(PARTITION BY id ORDER BY datetime)
< '01:30:00' THEN 1 ELSE 0 END AS count
FROM table
result is:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 1
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 1
But it's not ok for me because I want exclude row_number 5 because interval between row_number 5 and row_number 2 is > '01:30:00'. And start the new evaluation from row_number 5.
The same for row_number 13.
The right output could be:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 0
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 0
So right count is 5.
I'd use a recursive query for this:
WITH RECURSIVE tmp AS (
SELECT
id,
datetime,
row_number,
0 AS counting,
datetime AS last_start
FROM mytable
WHERE row_number = 1
UNION ALL
SELECT
t1.id,
t1.datetime,
t1.row_number,
CASE
WHEN lateral_1.counting THEN 1
ELSE 0
END AS counting,
CASE
WHEN lateral_1.counting THEN tmp.last_start
ELSE t1.datetime
END AS last_start
FROM
mytable AS t1
INNER JOIN
tmp ON (t1.id = tmp.id AND t1.row_number - 1 = tmp.row_number),
LATERAL (SELECT (t1.datetime - tmp.last_start) < '1h 30m'::interval AS counting) AS lateral_1
)
SELECT id, datetime, counting
FROM tmp
ORDER BY id, datetime;

Pandas edit dataframe

I am querying a MongoDB collection, with two queries and appending them to get a single data frame.(keys are: status, date, uniqueid)
for record in results:
query1 = (record["sensordata"]["user"])
df1 = pd.DataFrame(query1.items())
query2 = (record["created_date"])
df2 = pd.DataFrame(query2.items())
index = "status"
result = df1.append(df2, index)
b = result.transpose()
print b
b.to_csv(q)
output is :
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 17:51:08.263000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:23.269000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:21.130000
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 18:26:28.217000
how to remove these extra 0 ,1 ,2 and 0,1 in rows and columns?
also, i don't want status uniqueid and date repeat everytime.
My desired output should be like this:
status uniqueid date
0 191b117fcf5c 2017-03-01 18:26:28.217000
1 191b117fcf5c 2017-03-01 19:26:28.192000
1 191b117fcf5c 2017-04-01 11:16:28.222000

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01