update dataframe value by timestamp scala - scala

I have this dataframe
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:50:40.0 |aaaaaa | 25 | 5025 |
| 1222222 | 2019-02-07 06:50:42.0 |aaaaaa | 35 | 5000 |
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
I want to update the value of column C by event (tiemstamp) and keep only the row with the latest value update in new dataframe like this
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
the data are coming in streaming mode with spark streaming

You can try creating row number partitioned by customerid and order by event desc and take the rows where rownum is 1. I hope this helps.
df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
.filter(col("rownum") === 1)
.drop("rownum")

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

How to convert row into column in PostgreSQL of below table

I was trying to convert the trace table to resulted table in postgress. I have hug data in the table.
I have table with name : Trace
entity_id | ts | key | bool_v | dbl_v | str_v | long_v |
---------------------------------------------------------------------------------------------------------------
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | temperature | | | | 45 |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | operation | | | Normal | |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | period | | | | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | temperature | | | | 44 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | operation | | | Reverse | |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | period | | | | 3535 |
Trace Table
convert the above table into following table in PostgreSQL
Output Table: Result
entity_id | ts | temperature | operation | period |
----------------------------------------------------------------------------------------|
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | 45 | Normal | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | 44 | Reverse | 3535 |
Result Table
Have you tried this yet?
select entity_id, ts,
max(long_v) filter (where key = 'temperature') as temperature,
max(str_v) filter (where key = 'operation') as operation,
max(long_v) filter (where key = 'period') as period
from trace
group by entity_id, ts;

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

In postgresql, how do you find aggregate base on time range

For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+
How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.
Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.

Missing spark partition column in partition table

I am creating a partitioned parquet file in HDFS with a datasource.
The datasource looks like:
scala> sqlContext.sql("select * from parquetFile").show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| 0|LSKG5GC19BA210794|
| 0|LSKG5GC15BA210372|
| 0|LSKG5GC18BA210107|
| 0|LSKG4GC16BA211971|
| 0|LSKG4GC19BA210233|
| 0|LSKG5GC17BA210017|
| 0|LSKG4GC19BA211785|
| 0|LSKG4GC15BA210004|
| 0|LSKG4GC12BA211739|
| 0|LSKG4GC18BA210238|
| 0|LSKG4GC13BA210261|
| 0|LSKG5GC16BA210106|
| 0|LSKG4GC1XBA210287|
| 0|LSKG4GC10BA210265|
| 0|LSKG5GC10CA210118|
| 0|LSKG5GC16BA212289|
| 0|LSKG5GC1XBA211016|
| 0|LSKG5GC15CA210194|
| 0|LSKG5GC12CA210119|
| 0|LSKG4GC19BA211379|
+--------+-----------------+
I create partition with the following commands (I did it in spark shell):
scala>val df1 = sqlContext.sql("select * from parquetFile where area_tag=0 ")
scala>df1.write.parquet("/tmp/test_table3/area_tag=0")
scala>val p1 = sqlContext.read.parquet("/tmp/test_table3")
When I print the data by loading from the partitioned table, it shows:
scala> p1.show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| |LSKG5GC19BA210794|
| |LSKG5GC15BA210372|
| |LSKG5GC18BA210107|
| |LSKG4GC16BA211971|
| |LSKG4GC19BA210233|
| |LSKG5GC17BA210017|
| |LSKG4GC19BA211785|
| |LSKG4GC15BA210004|
| |LSKG4GC12BA211739|
| |LSKG4GC18BA210238|
| |LSKG4GC13BA210261|
| |LSKG5GC16BA210106|
| |LSKG4GC1XBA210287|
| |LSKG4GC10BA210265|
| |LSKG5GC10CA210118|
| |LSKG5GC16BA212289|
| |LSKG5GC1XBA211016|
| |LSKG5GC15CA210194|
| |LSKG5GC12CA210119|
| |LSKG4GC19BA211379|
+--------+-----------------+
only showing top 20 rows
The partition column was missing. What happened with the column, is it a bug?