Simulate Lag Function - Spark structured streaming - scala

I'm using Spark Structured Streaming to analyze sensor data and need to perform calculations based on a sensors previous timestamp. My incoming data stream has three columns: sensor_id, timestamp, and temp. I need to add a fourth column that is that sensors previous timestamp so that I can then calculate the time between data points for each sensor.
This is easy using traditional batch processing using a lag function and grouping by sensor_id. What is the best way to approach this in a streaming situation?
So for example if my streaming dataframe looked like this:
+----------+-----------+------+
| SensorId | Timestamp | Temp |
+----------+-----------+------+
| 1800 | 34 | 23 |
| 500 | 36 | 54 |
| 1800 | 45 | 23 |
| 500 | 60 | 54 |
| 1800 | 78 | 23 |
+----------+-----------+------+
I would like something like this:
+----------+-----------+------+---------+
| SensorId | Timestamp | Temp | Prev_ts |
+----------+-----------+------+---------+
| 1800 | 34 | 23 | 21 |
| 500 | 36 | 54 | 27 |
| 1800 | 45 | 23 | 34 |
| 500 | 60 | 54 | 36 |
| 1800 | 78 | 23 | 45 |
+----------+-----------+------+---------+
If I try
test = filteredData.withColumn("prev_ts", lag("ts").over(Window.partitionBy("sensor_id").orderBy("ts")))
I get an AnalysisException: 'Non-time-based windows are not supported on streaming DataFrames/Datasets
Could I save the previous timestamp of each sensor in a data structure that I could reference and then update with each new timestamp?

There is no need to "simulate" anything. Standard window functions can be used with Structured Streaming.
s = spark.readStream.
...
load()
s.withColumn("prev_ts", lag("Temp").over(
Window.partitionBy("SensorId").orderBy("Timestamp")
)

Related

Reformatting table in SQL

I have a table in sql
Table Link
is there a way to reformat this table with contract_kind as rows and percentile values as columns without me having to create 4 tables with where clauses and joining them. Using Postgres.
TLDR: just use the crosstab function as per documentation
Long reply:
I recreated a similar case with
create table test (id int, contract_kind text, percentile int, cut_off_time float);
insert into test values
(1,'TEMPLATE',25,1.91),
(2,'TEMPLATE',50,51.93),
(3,'TEMPLATE',75,158.41),
(4,'TEMPLATE',90,343.01),
(5,'TEMPLATE_EDITABLE',25,26),
(6,'TEMPLATE_EDITABLE',50,27),
(7,'TEMPLATE_EDITABLE',75,28),
(8,'TEMPLATE_EDITABLE',90,29),
(9,'UPLOAD_EDITABLE',25,10),
(10,'UPLOAD_EDITABLE',50,20),
(11,'UPLOAD_EDITABLE',75,30),
(12,'UPLOAD_EDITABLE',90,40),
(13,'UPLOAD_SIGN',25,40),
(14,'UPLOAD_SIGN',50,77),
(15,'UPLOAD_SIGN',75,99),
(16,'UPLOAD_SIGN',90,133);
result:
id | contract_kind | percentile | cut_off_time
----+-------------------+------------+--------------
1 | TEMPLATE | 25 | 1.91
2 | TEMPLATE | 50 | 51.93
3 | TEMPLATE | 75 | 158.41
4 | TEMPLATE | 90 | 343.01
5 | TEMPLATE_EDITABLE | 25 | 26
6 | TEMPLATE_EDITABLE | 50 | 27
7 | TEMPLATE_EDITABLE | 75 | 28
8 | TEMPLATE_EDITABLE | 90 | 29
9 | UPLOAD_EDITABLE | 25 | 10
10 | UPLOAD_EDITABLE | 50 | 20
11 | UPLOAD_EDITABLE | 75 | 30
12 | UPLOAD_EDITABLE | 90 | 40
13 | UPLOAD_SIGN | 25 | 40
14 | UPLOAD_SIGN | 50 | 77
15 | UPLOAD_SIGN | 75 | 99
16 | UPLOAD_SIGN | 90 | 133
(16 rows)
Now to use the crosstab you need to create the tablefunc extension.
create extension tablefunc;
and then you can use it to pivot the data
select * from
crosstab('select percentile, contract_kind, cut_off_time from test order by 1,2')
as ct(percentile int, template float, template_editable float, upload_editable float, upload_sing float);
result
percentile | template | template_editable | upload_editable | upload_sing
------------+----------+-------------------+-----------------+-------------
25 | 1.91 | 26 | 10 | 40
50 | 51.93 | 27 | 20 | 77
75 | 158.41 | 28 | 30 | 99
90 | 343.01 | 29 | 40 | 133
(4 rows)

How to Join data from a dataframe

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data.
This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?
It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:
dataFrame.createOrReplaceTempView("your_table")
val progressDf = sparkSession.sql(
"""
SELECT name, player_id, data_ms, coins,
COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
FROM your_table;
"""
)

Dynamic groups in Postgresql data

I have a PostgreSQL 9.1 database with a table containing measurement data, which contains setpoint information. For example temperature setpoints. The measurements are taken when at a setpoint, after which the following setpoint will be set. A setpoint can be reached multiple times, e.g. -25, 25, 75, 125, 75, 25 degree Celcius. In this case 25 and 75 degree Celcius are reached multiple times.
Now I want to group the data per setpoint, but not group data together of another setpoint that has the same value but is reached at a later point in time.
Example data:
| id | setpoint | value |<dyn.group>|
| 1 | -25 | 5.324 | 1
| 2 | -25 | 6.343 | 1
| 3 | -25 | 6.432 | 1
| 4 | 25 | 3.432 | 2
| 5 | 25 | 4.472 | 2
| 6 | 25 | 6.221 | 2
| 7 | 75 | 5.142 | 3
| 8 | 75 | 7.922 | 3
| 9 | 75 | 3.832 | 3
|10 | 125 | 8.882 | 4
|11 | 125 | 9.742 | 4
|12 | 125 | 7.632 | 4
|13 | 75 | 5.542 | 5
|14 | 75 | 2.452 | 5
|15 | 75 | 1.332 | 5
|16 | 25 | 3.232 | 6
|17 | 25 | 4.132 | 6
|18 | 25 | 5.432 | 6
Normal group by clauses will fail, because setpoint can be there multiple times, but should not be put together.
Looking with LEAD and LAG to the previous/next values is also not desired, because changes will most likely be similar (e.g. if setpoint 75 is repeated, then most likely the step from 25->75 will also be repeated).
The expected outcome is the 4th column (<dyn.group>). With that column I can for example average on these groups.
It can be done with a custom aggregation function to generate the "group index" value and then a "group by" clause in that value.

In postgresql, how do you find aggregate base on time range

For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+
How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.
Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.

Displaying 2 metrics on a tableau map

I am new to Tableau and I have requirements as below:
I need to create a dashboard with a filter on Paywave or EMV and show count of Confirmed and Probable on a geo map.
When I select EMV from the quick filter, it should show a count of confirm & probable for that city. I should be able to drill down and see a count of confirm and probable for zip codes as well.
I am not sure how to achieve the above requirements.
As shown below I have fields like:
EMV Paywave
mrchchant_city, mrch_zipcode confirm probable confirm probable
A 1001 10 15 20 18
B 1005 34 67 78 12
C 2001 24 56 76 45
C 2001 46 19 63 25
Please let me know if any information required from my side.
This will be a lot easier on you if you restructure your data a bit. More often than not, the goal in Tableau is to provide an aggregated summary of the data, rather than showing each individual row. We'll want to group by dimensions (categorical data like "EMV"/"Paywave" or "Confirm"/"Probable"), so this data will be much easier to work with if we get those dimensions into their own columns.
Here's how I personally would go about structuring your table:
+----------------+--------------+---------+----------+-------+-----+
| mrchchant_city | mrch_zipcode | dim1 | dim2 | count | ... |
+----------------+--------------+---------+----------+-------+-----+
| A | 1001 | Paywave | confirm | 20 | ... |
| A | 1001 | Paywave | probable | 18 | ... |
| A | 1001 | EMV | confirm | 10 | ... |
| A | 1001 | EMV | probable | 15 | ... |
| B | 1005 | Paywave | confirm | 78 | ... |
| B | 1005 | Paywave | probable | 12 | ... |
| B | 1005 | EMV | confirm | 34 | ... |
| B | 1005 | EMV | probable | 67 | ... |
| C | 2001 | Paywave | confirm | 76 | ... |
| C | 2001 | Paywave | probable | 45 | ... |
| C | 2001 | EMV | confirm | 24 | ... |
| C | 2001 | EMV | probable | 56 | ... |
| C | 2001 | Paywave | confirm | 63 | ... |
| C | 2001 | Paywave | probable | 25 | ... |
| C | 2001 | EMV | confirm | 46 | ... |
| C | 2001 | EMV | probable | 19 | ... |
| ... | ... | ... | ... | ... | ... |
+----------------+--------------+---------+----------+-------+-----+
(Sorry about the dim1 and dim2, I don't really know what those dimensions represent. You can/should obviously pick a more intuitive nomenclature.)
Once you have a table with columns for your categorical data, it will be simple to filter and group by those dimensions.