writing spark dataframe by overwriting the values in key as redis list - scala

I have a redis key in which I have inserted the data using command
format is like below
lpush key_name 'json data'
lpush test4 '{"id":"358899055773504","start_lat":0,"start_long":0,"end_lat":26.075942,"end_long":83.179573,"start_interval":"2018-02-01 00:01:00","end_interval":"2018-02-01 00:02:00"}'
Now I did some processing in spark scala and got a dataframe like this
id end_interval start_interval start_lat start_long end_lat end_long
866561010400483 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
358899055773504 2018-08-02 04:57:29 2018-08-01 21:35:52 22.684658 75.909716 22.684658 75.909716
862304021520161 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
862304021470656 2018-08-02 05:25:11 2018-08-02 00:03:21 26.030764 75.180587 26.030764 75.180587
351608081284031 2018-08-02 05:22:10 2018-08-02 05:06:17 17.117284 78.269013 17.117284 78.269013
866561010407496 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
862304021504975 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
866561010407868 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
862304021483931 2018-02-01 00:02:00 2018-02-01 00:01:00 0 0 26.075942 83.179573
I want to insert this dataframe into the same key(test4) by overwriting it as a redis list(as it was before but now with the rows of dataframe)

Related

my-sql query to fill out the time gaps between the records

I want to write an optimized query to fill out the time gaps between the records with the stock value that is most recent to date.
The requirement is to have the latest stock value for every group of id_warehouse, id_stock, and date. The table is quite large (2 million records) and hence I would like to optimize the query that I have added below and the table grows.
daily_stock_levels:
date
id_stock
id_warehouse
new_stock
is_stock_avaible
2022-01-01
1
1
24
1
2022-01-01
1
1
25
1
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-06
1
1
27
1
2022-01-09
1
1
26
1
Result:
date
id_stock
id_warehouse
closest_date_with_stock_value
most_recent_stock_value
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-03
1
1
30
1
2022-01-04
1
1
30
1
2022-01-05
1
1
30
1
2022-01-06
1
1
27
1
2022-01-07
1
1
27
1
2022-01-07
1
1
27
1
2022-01-09
1
1
26
1
2022-01-10
1
1
26
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
2022-08-08
1
1
26
1
SELECT
sl.date,
sl.id_warehouse,
sl.id_item,
(SELECT
s.date
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS closest_date_with_stock_value,
(SELECT
s.new_stock
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS most_recent_stock_value
FROM
daily_stock_levels sl
GROUP BY sl.id_warehouse,
sl.id_item,
sl.date

Postgres: for each row evaluate all successive rows under conditions

I have this table:
id | datetime | row_number
1 2018-04-09 06:27:00 1
1 2018-04-09 14:15:00 2
1 2018-04-09 15:25:00 3
1 2018-04-09 15:35:00 4
1 2018-04-09 15:51:00 5
1 2018-04-09 17:05:00 6
1 2018-04-10 06:42:00 7
1 2018-04-10 16:39:00 8
1 2018-04-10 18:58:00 9
1 2018-04-10 19:41:00 10
1 2018-04-14 17:05:00 11
1 2018-04-14 17:48:00 12
1 2018-04-14 18:57:00 13
I'd count for each row the successive rows with time <= '01:30:00' and start the successive evaluation from the first row that doesn't meet the condition.
I try to exlplain better the question.
Using windows function lag():
SELECT id, datetime,
CASE WHEN datetime - lag (datetime,1) OVER(PARTITION BY id ORDER BY datetime)
< '01:30:00' THEN 1 ELSE 0 END AS count
FROM table
result is:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 1
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 1
But it's not ok for me because I want exclude row_number 5 because interval between row_number 5 and row_number 2 is > '01:30:00'. And start the new evaluation from row_number 5.
The same for row_number 13.
The right output could be:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 0
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 0
So right count is 5.
I'd use a recursive query for this:
WITH RECURSIVE tmp AS (
SELECT
id,
datetime,
row_number,
0 AS counting,
datetime AS last_start
FROM mytable
WHERE row_number = 1
UNION ALL
SELECT
t1.id,
t1.datetime,
t1.row_number,
CASE
WHEN lateral_1.counting THEN 1
ELSE 0
END AS counting,
CASE
WHEN lateral_1.counting THEN tmp.last_start
ELSE t1.datetime
END AS last_start
FROM
mytable AS t1
INNER JOIN
tmp ON (t1.id = tmp.id AND t1.row_number - 1 = tmp.row_number),
LATERAL (SELECT (t1.datetime - tmp.last_start) < '1h 30m'::interval AS counting) AS lateral_1
)
SELECT id, datetime, counting
FROM tmp
ORDER BY id, datetime;

Scala Spark group as per value change

I have following dataset :-
ID Sensor State DateTime
1 S1 0 2018-09-10 10:10:05
1 S1 0 2018-09-10 10:10:10
1 S1 0 2018-09-10 10:10:20
1 S1 1 2018-09-10 10:10:30
1 S1 1 2018-09-10 10:10:40
1 S1 1 2018-09-10 10:10:50
1 S1 1 2018-09-10 10:10:60
1 S2 0 2018-09-10 10:10:10
1 S2 0 2018-09-10 10:10:20
1 S2 0 2018-09-10 10:10:30
1 S2 1 2018-09-10 10:10:40
1 S2 1 2018-09-10 10:10:50
2 S1 0 2018-09-10 10:10:30
2 S1 1 2018-09-10 10:10:40
2 S1 1 2018-09-10 10:10:50
Required Output
ID Sensor State MinDT MaxDT
1 S1 0 2018-09-10 10:10:05 2018-09-10 10:10:20
1 S1 1 2018-09-10 10:10:30 2018-09-10 10:10:60
1 S2 0 2018-09-10 10:10:10 2018-09-10 10:10:30
1 S2 1 2018-09-10 10:10:40 2018-09-10 10:10:50
2 S1 0 2018-09-10 10:10:30 2018-09-10 10:10:30
2 S1 1 2018-09-10 10:10:40 2018-09-10 10:10:50
I want to make a group on the basis of sensor change values and I'll be needing the range when the value is changed. Any help please. I tried a simple approach by initialling the value in variables then iterating over each row to check change in value and storing the ResultSet in an array but this approach is not distributed on cluster. Any suggestions please.
You can just group in this way and achieve the results as you desired.
df.groupBy("ID", "Sensor", "State")
.agg(
date_format(max(to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss")), "yyyy-MM-dd HH:mm:ss").alias("MaxDT"),
date_format(min(to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss")), "yyyy-MM-dd HH:mm:ss").alias("MinDT"))
.show()
Output:
+---+------+-----+-------------------+-------------------+
| ID|Sensor|State| MaxDT| MinDT|
+---+------+-----+-------------------+-------------------+
| 2| S1| 0|2018-09-10 10:10:30|2018-09-10 10:10:30|
| 1| S2| 1|2018-09-10 10:10:50|2018-09-10 10:10:40|
| 2| S1| 1|2018-09-10 10:10:50|2018-09-10 10:10:40|
| 1| S1| 0|2018-09-10 10:10:20|2018-09-10 10:10:05|
| 1| S2| 0|2018-09-10 10:10:30|2018-09-10 10:10:10|
| 1| S1| 1|2018-09-10 10:10:50|2018-09-10 10:10:30|
+---+------+-----+-------------------+-------------------+

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01

How to calculate interest rate between two dates with out put two dates parameter

Transaction table
TransactionDate Balance
2014-03-30 50000
2014-04-05 2000
2014-04-10 1000
2014-04-20 25000
2014-05-05 10000
Interest Rate table
FromDate ToDate Rate
2014-02-28 2014-04-01 2.1%
2014-04-02 2014-04-15 2.2%
2014-04-16 2014-04-21 2.7%
2014-04-22 2014-05-02 2.8%
2014-05-03 2015-03-31 2.9%
SQL Query output required in below format
FromDate: 2014-04-01
ToDate: 2014-05-30
** Output table as below **
Balance FromDate ToDate Rate
50000 2014-03-31 2014-04-01 2.1
50000 2013-04-01 2014-04-05 2.2
52000 2014-04-05 2014-04-10 2.2
53000 2014-04-10 2014-04-15 2.2
53000 2014-04-15 2014-04-20 2.7
78000 2014-04-20 2014-04-21 2.7
78000 2014-04-21 2014-05-02 2.8
78000 2014-05-02 2014-05-05 2.9
79000 2014-05-05 2014-05-30 2.9