Aggregate all previous rows for a specific time difference - scala

I have a Spark DataFrame with the following entries:
| order id | time | amt |
| 1 | 2017-10-01 12:00 | 100 |
| 2 | 2017-10-01 15:00 | 100 |
| 3 | 2017-10-01 17:00 | 100 |
| 4 | 2017-10-02 16:00 | 100 |
| 5 | 2017-10-02 23:00 | 100 |
I want to add a column amount_prev_24h that has, for each order id, the sum of amt for all orders in the last 24 hours.
| order id | time | amt | amt_24h
| 1 | 2017-10-01 12:00 | 100 | 0
| 2 | 2017-10-01 15:00 | 100 | 100
| 3 | 2017-10-01 17:00 | 100 | 200
| 4 | 2017-10-02 16:00 | 100 | 100
| 5 | 2017-10-02 23:00 | 100 | 100
How would I go about doing it?

This is a pyspark code and similar to scala API.
df = df.withColumn('time_uts', unix_timestamp('time', format='yyyy-MM-dd HH:mm'))
df = df.withColumn('amt_24h', sum('amt').over(Window.orderBy('time_uts').rangeBetween(-24 * 3600, -1))).fillna(0, subset='amt_24h')
I hope this may help you.

Related

postgreSQL question: get data by last date of each record and subtract from last date number of days

Please help me make a request. i'm at a dead end.
There are 2 tables:
“Trains”:
+----+---------+
| id | numbers |
+----+---------+
| 1 | 101 |
| 2 | 102 |
| 3 | 103 |
| 4 | 104 |
| 5 | 105 |
+----+---------+
“Passages”:
+----+--------------+-------+---------------------+
| id | train_number | speed | date_time |
+----+--------------+-------+---------------------+
| 1 | 101 | 26 | 2021-11-10 16:26:30 |
| 2 | 101 | 28 | 2021-11-12 16:26:30 |
| 3 | 102 | 24 | 2021-11-14 16:26:30 |
| 4 | 103 | 27 | 2021-11-15 16:26:30 |
| 5 | 101 | 29 | 2021-11-16 16:26:30 |
+----+--------------+-------+---------------------+
The goal is to go through the train numbers from the Trains table, take from the existing ones from the Passages table by the latest date (date_time) and the number of passages for “the last date for each train” - N days. as I understand date_time - interval "N days". should get something like:
+----+--------+---------------------+----------------+
| id | train | last_passage | count_passages |
+----+--------+---------------------+----------------+
| 1 | 101 | 2021-11-10 16:26:30 | 2 |
| 2 | 102 | 2021-11-14 16:26:30 | 1 |
| 3 | 103 | 2021-11-15 16:26:30 | 1 |
| 4 | 104 | null | 0 |
| 5 | 105 | null | 0 |
+----+--------+---------------------+----------------+
ps: "count_passages" - for example, last passage date minus 4 days
I tried through "where in" but I can’t create the necessary and correct request

How to display results for each year dynamic column in Crystal report

How to display each years cost in dynamic column (Max 3 years) in Crystal report.
Parameters : Date From and Date To
Crystal report Version : 2013
Table : Jobs
+-------+------------+------------+
| EQ_no | Job_Date | Total_Cost |
+-------+------------+------------+
| 1006 | 01/30/2017 | 250 |
| 1006 | 01/31/2018 | 350 |
| 1006 | 01/01/2019 | 150 |
| 1006 | 02/01/2019 | 322 |
| 1006 | 05/05/2019 | 450 |
| 1006 | 02/02/2020 | 500 |
| 1006 | 02/03/2021 | 1212 |
| 29198 | 02/04/2017 | 3000 |
| 29198 | 02/05/2018 | 250 |
+-------+------------+------------+
Table : Equipment
+-------+-----------+
| EQ_no | Serial no |
+-------+-----------+
| 1006 | MDRSC12 |
| 29198 | FDRSC13 |
| 6218 | REAFC14 |
+-------+-----------+
Result:
+-------+-----------+------+------+------+
| EQ_no | Serial no | 2018 | 2019 | 2020 |
+-------+-----------+------+------+------+
| 1006 | MDRSC12 | 350 | 922 | 500 |
| 29198 | FDRSC13 | 250 | 0 | 0 |
| 6218 | REAFC14 | 0 | 0 | 0 |
+-------+-----------+------+------+------+
If date from 1-jan-2018 to 1-June-2020 then show each years total cost 2018,2019 & 2020.
If date from 1-jan-2020 to 1-June-2021 then show each years total cost of 2020 & 2021 only.
Create a Formula Field that uses the Year() function to extract only the 4 digit numerical year from your Job_Date field. Name this field whatever you like, but I will call it "JobYear" going forward in this answer.
The formula will be Year(Job_Date);.
Now create a second Formula Field that uses the same function to extract the 4 digit numerical year from today's date. I will call this formula field "CurrentYear" going forward.
This formula will be Year(CurrentDate);.
Now create 3 Running Total Fields. Name them something like ThisYear, LastYear, and TwoYearsAgo. Set all three of these fields to summarize the Total_Cost field. Set the reset conditions to whatever is most appropriate for your report, and then set the evaluate conditions to use a formula and use the following formulas for each one.
For ThisYear the formula should be CurrentYear = JobYear;.
For LastYear the formula should be CurrentYear - 1 = JobYear;.
For TwoYearsAgo the formula should be CurrentYear - 2 = JobYear;.
This will allow the running total fields to summarize the total cost for any job into the correct buckets based upon the year the job was completed.

PostgreSQL Crosstab generate_series of weeks for columns

From a table of "time entries" I'm trying to create a report of weekly totals for each user.
Sample of the table:
+-----+---------+-------------------------+--------------+
| id | user_id | start_time | hours_worked |
+-----+---------+-------------------------+--------------+
| 997 | 6 | 2018-01-01 03:05:00 UTC | 1.0 |
| 996 | 6 | 2017-12-01 05:05:00 UTC | 1.0 |
| 998 | 6 | 2017-12-01 05:05:00 UTC | 1.5 |
| 999 | 20 | 2017-11-15 19:00:00 UTC | 1.0 |
| 995 | 6 | 2017-11-11 20:47:42 UTC | 0.04 |
+-----+---------+-------------------------+--------------+
Right now I can run the following and basically get what I need
SELECT COALESCE(SUM(time_entries.hours_worked),0) AS total,
time_entries.user_id,
week::date
--Using generate_series here to account for weeks with no time entries when
--doing the join
FROM generate_series( (DATE_TRUNC('week', '2017-11-01 00:00:00'::date)),
(DATE_TRUNC('week', '2017-12-31 23:59:59.999999'::date)),
interval '7 day') as week LEFT JOIN time_entries
ON DATE_TRUNC('week', time_entries.start_time) = week
GROUP BY week, time_entries.user_id
ORDER BY week
This will return
+-------+---------+------------+
| total | user_id | week |
+-------+---------+------------+
| 14.08 | 5 | 2017-10-30 |
| 21.92 | 6 | 2017-10-30 |
| 10.92 | 7 | 2017-10-30 |
| 14.26 | 8 | 2017-10-30 |
| 14.78 | 10 | 2017-10-30 |
| 14.08 | 13 | 2017-10-30 |
| 15.83 | 15 | 2017-10-30 |
| 8.75 | 5 | 2017-11-06 |
| 10.53 | 6 | 2017-11-06 |
| 13.73 | 7 | 2017-11-06 |
| 14.26 | 8 | 2017-11-06 |
| 19.45 | 10 | 2017-11-06 |
| 15.95 | 13 | 2017-11-06 |
| 14.16 | 15 | 2017-11-06 |
| 1.00 | 20 | 2017-11-13 |
| 0 | | 2017-11-20 |
| 2.50 | 6 | 2017-11-27 |
| 0 | | 2017-12-04 |
| 0 | | 2017-12-11 |
| 0 | | 2017-12-18 |
| 0 | | 2017-12-25 |
+-------+---------+------------+
However, this is difficult to parse particularly when there's no data for a week. What I would like is a pivot or crosstab table where the weeks are the columns and the rows are the users. And to include nulls from each (for instance if a user had no entries in that week or week without entries from any user).
Something like this
+---------+---------------+--------------+--------------+
| user_id | 2017-10-30 | 2017-11-06 | 2017-11-13 |
+---------+---------------+--------------+--------------+
| 6 | 4.0 | 1.0 | 0 |
| 7 | 4.0 | 1.0 | 0 |
| 8 | 4.0 | 0 | 0 |
| 9 | 0 | 1.0 | 0 |
| 10 | 4.0 | 0.04 | 0 |
+---------+---------------+--------------+--------------+
I've been looking around online and it seems that "dynamically" generating a list of columns for crosstab is difficult. I'd rather not hard code them, which seems weird to do anyway for dates. Or use something like this case with week number.
Should I look for another solution besides crosstab? If I could get the series of weeks for each user including all nulls I think that would be good enough. It just seems that right now my join strategy isn't returning that.
Personally I would use a Date Dimension table and use that table as the basis for the query. I find it far easier to use tabular data for these types of calculations as it leads to SQL that's easier to read and maintain. There's a great article on creating a Date Dimension table in PostgreSQL at https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac, though you could get away with a much simpler version of this table.
Ultimately what you would do is use the Date table as the base for the SELECT cols FROM table section and then join against that, or probably use Common Table Expressions, to create the calculations.
I'll write up a solution to that if you would like demonstrating how you could create such a query.

Spotfire - Calculate average only if there are minimum 3 values

I want to create a cross table in Spotfire where in which Average is calculated only when there are at least 3 values. If there are no values or less than 3 values the average should be blank.
+-------+-----+---------+
| Month | Age | Average |
+-------+-----+---------+
| 1 | 10 | |
| 2 | 11 | |
| 3 | 2 | 7.7 |
| 4 | | |
| 5 | 13 | |
| 6 | 14 | |
| 7 | | |
| 8 | 19 | |
| 9 | 20 | |
| 10 | 21 | 20 |
+-------+-----+---------+
If I'm understanding you correctly, you want to group by Month, and then have something like this as your aggregation:
If(Count()>2,Avg([Age]),null) as [AverageAge_3Min]

How can I disaggregate rows of a data frame in Spark?

I have a Spark dataframe containing data similar to the following:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 |
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 |
+----+---------------------+-------+----------+-------------+
I'm looking to turn this into something like the following:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 | ? |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 | ? |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 | ? |
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 | ? |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 | ? |
+----+---------------------+-------+----------+-------------+------------+
More specifically I want to turn this:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
+----+---------------------+-------+----------+-------------+
Into this:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
+----+---------------------+-------+----------+-------------+------------+
I want to take the rows with more than 1 interval out of the original table, interpolate Values for missing intervals and reinsert the newly created rows into the initial table place of the original rows. I have ideas of how to achieve this (in PostgreSQL for example I would simply use the generate_series() function to create the required Timestamps and calculate new Values), but implementing these in Spark/Scala is proving troublesome.
Assuming I've created a new dataframe containing only rows with Interval > 1, how could I replicate those rows 'n' times with 'n' being the value of Interval? I believe that would give me enough to get going using a Counter function partitioned by some row reference I can create.
If there's a way to replicate the behavior of generate_series() that I've missed, even better.