How to Join data from a dataframe

How to Join data from a dataframe - scala

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data.
This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?

It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:
dataFrame.createOrReplaceTempView("your_table")
val progressDf = sparkSession.sql(
"""
SELECT name, player_id, data_ms, coins,
COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
FROM your_table;
"""
)

Related

SQL Server 2008 R2 - converting columns to rows and have all values in one column

I am having a hard time trying to wrap my head around the pivot/unpivot concepts and hoping someone can help or give me some guidance on how to approach my problem.
Here is a simplified sample table I have
+-------+------+------+------+------+------+
| SAUID | COM1 | COM2 | COM3 | COM4 | COM5 |
+-------+------+------+------+------+------+
| 1 | 24 | 22 | 100 | 0 | 45 |
| 2 | 34 | 55 | 789 | 23 | 0 |
| 3 | 33 | 99 | 5552 | 35 | 4675 |
+-------+------+------+------+------+------+
The end result I am looking for a table result similar below
+-------+-----------+-------+
| SAUID | OCCUPANCY | VALUE |
+-------+-----------+-------+
| 1 | COM1 | 24 |
| 1 | COM2 | 22 |
| 1 | COM3 | 100 |
| 1 | COM4 | 0 |
| 1 | COM5 | 45 |
| 2 | COM1 | 34 |
| 2 | COM2 | 55 |
| 2 | COM3 | 789 |
| 2 | COM4 | 23 |
| 2 | COM5 | 0 |
| 3 | COM1 | 33 |
| 3 | COM2 | 99 |
| 3 | COM3 | 5552 |
| 3 | COM4 | 35 |
| 3 | COM5 | 4675 |
+-------+-----------+-------+
Im looking around but most of the examples seem to use pivot but having a hard time trying to wrap that around my case as I need the values all in one column.
I hoping to experiment with some hardcoding to get fimilar with my example but my actual table columns are ~100 with varying #s of SAUID per table and looks like it will require dynamic sql?
Thanks for the help in advance.

Use UNPIVOT:
SELECT u.SAUID, u.OCCUPANCY, u.VALUE
FROM yourTable t
UNPIVOT
(
VALUE for OCCUPANCY in (COM1, COM2, COM3, COM4, COM5)
) u;
ORDER BY
u.SAUID, u.OCCUPANCY;
Demo

Simulate Lag Function - Spark structured streaming

I'm using Spark Structured Streaming to analyze sensor data and need to perform calculations based on a sensors previous timestamp. My incoming data stream has three columns: sensor_id, timestamp, and temp. I need to add a fourth column that is that sensors previous timestamp so that I can then calculate the time between data points for each sensor.
This is easy using traditional batch processing using a lag function and grouping by sensor_id. What is the best way to approach this in a streaming situation?
So for example if my streaming dataframe looked like this:
+----------+-----------+------+
| SensorId | Timestamp | Temp |
+----------+-----------+------+
| 1800 | 34 | 23 |
| 500 | 36 | 54 |
| 1800 | 45 | 23 |
| 500 | 60 | 54 |
| 1800 | 78 | 23 |
+----------+-----------+------+
I would like something like this:
+----------+-----------+------+---------+
| SensorId | Timestamp | Temp | Prev_ts |
+----------+-----------+------+---------+
| 1800 | 34 | 23 | 21 |
| 500 | 36 | 54 | 27 |
| 1800 | 45 | 23 | 34 |
| 500 | 60 | 54 | 36 |
| 1800 | 78 | 23 | 45 |
+----------+-----------+------+---------+
If I try
test = filteredData.withColumn("prev_ts", lag("ts").over(Window.partitionBy("sensor_id").orderBy("ts")))
I get an AnalysisException: 'Non-time-based windows are not supported on streaming DataFrames/Datasets
Could I save the previous timestamp of each sensor in a data structure that I could reference and then update with each new timestamp?

There is no need to "simulate" anything. Standard window functions can be used with Structured Streaming.
s = spark.readStream.
...
load()
s.withColumn("prev_ts", lag("Temp").over(
Window.partitionBy("SensorId").orderBy("Timestamp")
)

PostgreSQL Crosstab generate_series of weeks for columns

From a table of "time entries" I'm trying to create a report of weekly totals for each user.
Sample of the table:
+-----+---------+-------------------------+--------------+
| id | user_id | start_time | hours_worked |
+-----+---------+-------------------------+--------------+
| 997 | 6 | 2018-01-01 03:05:00 UTC | 1.0 |
| 996 | 6 | 2017-12-01 05:05:00 UTC | 1.0 |
| 998 | 6 | 2017-12-01 05:05:00 UTC | 1.5 |
| 999 | 20 | 2017-11-15 19:00:00 UTC | 1.0 |
| 995 | 6 | 2017-11-11 20:47:42 UTC | 0.04 |
+-----+---------+-------------------------+--------------+
Right now I can run the following and basically get what I need
SELECT COALESCE(SUM(time_entries.hours_worked),0) AS total,
time_entries.user_id,
week::date
--Using generate_series here to account for weeks with no time entries when
--doing the join
FROM generate_series( (DATE_TRUNC('week', '2017-11-01 00:00:00'::date)),
(DATE_TRUNC('week', '2017-12-31 23:59:59.999999'::date)),
interval '7 day') as week LEFT JOIN time_entries
ON DATE_TRUNC('week', time_entries.start_time) = week
GROUP BY week, time_entries.user_id
ORDER BY week
This will return
+-------+---------+------------+
| total | user_id | week |
+-------+---------+------------+
| 14.08 | 5 | 2017-10-30 |
| 21.92 | 6 | 2017-10-30 |
| 10.92 | 7 | 2017-10-30 |
| 14.26 | 8 | 2017-10-30 |
| 14.78 | 10 | 2017-10-30 |
| 14.08 | 13 | 2017-10-30 |
| 15.83 | 15 | 2017-10-30 |
| 8.75 | 5 | 2017-11-06 |
| 10.53 | 6 | 2017-11-06 |
| 13.73 | 7 | 2017-11-06 |
| 14.26 | 8 | 2017-11-06 |
| 19.45 | 10 | 2017-11-06 |
| 15.95 | 13 | 2017-11-06 |
| 14.16 | 15 | 2017-11-06 |
| 1.00 | 20 | 2017-11-13 |
| 0 | | 2017-11-20 |
| 2.50 | 6 | 2017-11-27 |
| 0 | | 2017-12-04 |
| 0 | | 2017-12-11 |
| 0 | | 2017-12-18 |
| 0 | | 2017-12-25 |
+-------+---------+------------+
However, this is difficult to parse particularly when there's no data for a week. What I would like is a pivot or crosstab table where the weeks are the columns and the rows are the users. And to include nulls from each (for instance if a user had no entries in that week or week without entries from any user).
Something like this
+---------+---------------+--------------+--------------+
| user_id | 2017-10-30 | 2017-11-06 | 2017-11-13 |
+---------+---------------+--------------+--------------+
| 6 | 4.0 | 1.0 | 0 |
| 7 | 4.0 | 1.0 | 0 |
| 8 | 4.0 | 0 | 0 |
| 9 | 0 | 1.0 | 0 |
| 10 | 4.0 | 0.04 | 0 |
+---------+---------------+--------------+--------------+
I've been looking around online and it seems that "dynamically" generating a list of columns for crosstab is difficult. I'd rather not hard code them, which seems weird to do anyway for dates. Or use something like this case with week number.
Should I look for another solution besides crosstab? If I could get the series of weeks for each user including all nulls I think that would be good enough. It just seems that right now my join strategy isn't returning that.

Personally I would use a Date Dimension table and use that table as the basis for the query. I find it far easier to use tabular data for these types of calculations as it leads to SQL that's easier to read and maintain. There's a great article on creating a Date Dimension table in PostgreSQL at https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac, though you could get away with a much simpler version of this table.
Ultimately what you would do is use the Date table as the base for the SELECT cols FROM table section and then join against that, or probably use Common Table Expressions, to create the calculations.
I'll write up a solution to that if you would like demonstrating how you could create such a query.

In postgresql, how do you find aggregate base on time range

For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+

How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.

Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.

Subtract fields of a column - Tableau

I would like to subtract promoters and detractors in Tableau by creating a new column. Thanks for all the help!
Customer Type Table (I would like to create the NPS field as shown below):
+---------+------------+----------+-----------+--------------+
| Quarter | Detractors | Passives | Promoters | NPS |
+---------+------------+----------+-----------+--------------+
| Q1 15 | 40.56 | 23.56 | 35.79 | =35.79-40.56 |
| ... | ... | ... | ... | ... |
+---------+------------+----------+-----------+--------------+

Simply create a calculated field (called NPS):
[Promoters] - [Detractors]
This will add a new field to every row of your partition called NPS.
Check out the Tableau online help on calculated fields - this is a skill well worth learning.

I understand the OPs question. The data comes in like this:
+---------+---------------+------+
| Quarter | Customer Type | Score|
+---------+------------+---------+
| Q1 15 | Detractors | 25 |
| Q1 15 | Promoters | 32 |
| Q1 15 | Passives | 45 |
| Q1 15 | Detractors | 17 |
| Q1 15 | Detractors | 28 |
| ... | ... | ... |
+---------+------------+---------+
And when brought into Tableau, the [Customer Type] field is put in the Column shelf and this arranges the data like the table below. The OP wants to calculate the [NPS] column (Promoters - Detractors).
+---------+------------+----------+-----------+--------------+
| Quarter | Detractors | Passives | Promoters | NPS |
+---------+------------+----------+-----------+--------------+
| Q1 15 | 40.56 | 23.56 | 35.79 | =35.79-40.56 |
| ... | ... | ... | ... | ... |
+---------+------------+----------+-----------+--------------+
I hope this clarifies. I am stuck with a similar situation (I want a column that shows the difference between 2015 and 2016):
+---------+-------+-------+------------+
| Measure | 2015 | 2016 | Difference |
+---------+---------------+------------+
| # Hires | 100 | 115 | 15 |
| # Terms | 9 | 6 | 3 |
+---------+---------------+------------+
I believe the steps are similar. I hope someone can help.