How to get runtime value of column and used in the next row - scala

I have below dataframe, based on the visited date I need to create a new column allowed- If the customar visted within a week from last allowd week visit I have to mark allowed as NO (4th row 2020-01-09-2020-01-10 <7 ) and if it is more than 1 week allowed yes (3rd row 2020-01-09-2020-01-01 >7 )
Input DF
Customar visited_date
John 2020-01-01
John 2020-01-05
John 2020-01-09
John 2020-01-10
John 2020-01-17
output DF
Customar visited_date allowed
John 2020-01-01 Yes
John 2020-01-05 No
John 2020-01-09 Yes
John 2020-01-10 No
John 2020-01-17 Yes
I dont know how to calulate the colum value in runtime and used that in subsequent column calculation.

Related

How to calculate current month / six months ago and result as a percent change in Postgresql?

create table your_table(type text,compdate date,amount numeric);
insert into your_table values
('A','2022-01-01',50),
('A','2022-02-01',76),
('A','2022-03-01',300),
('A','2022-04-01',234),
('A','2022-05-01',14),
('A','2022-06-01',9),
('B','2022-01-01',201),
('B','2022-02-01',33),
('B','2022-03-01',90),
('B','2022-04-01',41),
('B','2022-05-01',11),
('B','2022-06-01',5),
('C','2022-01-01',573),
('C','2022-02-01',77),
('C','2022-03-01',109),
('C','2022-04-01',137),
('C','2022-05-01',405),
('C','2022-06-01',621);
I am trying to calculate to show the percentage change in $ from 6 months prior to today's date for each type. In example:
Type A decreased -82% over six months.
Type B decreased -97.5%
Type C increased +8.4%.
How do I write this in postgresql mixed in with other statements?
It looks like comparing against 5, not 6 months prior, and 2022-06-01 isn't today's date.
Join the table with itself based on the matching type and desired time difference. Demo
select
b.type,
b.compdate,
a.compdate "6 months earlier",
b.amount "amount 6 months back",
round(-(100-b.amount/a.amount*100),2) "change"
from your_table a
inner join your_table b
on a.type=b.type
and a.compdate = b.compdate - '5 months'::interval;
-- type | compdate | 6 months earlier | amount 6 months back | change
--------+------------+------------------+----------------------+--------
-- A | 2022-06-01 | 2022-01-01 | 9 | -82.00
-- B | 2022-06-01 | 2022-01-01 | 5 | -97.51
-- C | 2022-06-01 | 2022-01-01 | 621 | 8.38

How do I sort hierarchical data in Palantir?

Lets say I have flight data (from Foundry Academy).
Starting dataset:
Date
flight_id
origin_state
carrier_name
jan
000000001
California
delta air
jan
000000002
Alabama
delta air
jan
000000003
California
southwest
feb
000000004
California
southwest
...
...
...
...
I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:
origin state
carrier name
jan
feb
...
Alabama
delta air
1
0
...
California
delta air
1
0
...
California
southwest
1
1
...
I need to get subtotals for each state;
I need to sort by most flights;
and I want it to be sorted by states, then by carrier.
desired output
origin state
carrier name
jan
feb
...
California
null
2
1
...
California
delta air
1
0
...
California
southwest
1
1
...
Alabama
null
1
0
...
Alabama
delta air
1
0
...
PIVOT - doesn't provide subtotals for categories;
EXPRESSION - doesn't offer possibility to split date column into columns.
I solved it on Contour. not the prettiest solution, but it works.
I've created two paths to the same dataset:
| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ... | ... | ... | ... |
1st path was used to calculate full aggregation. pivot table and switch to pivoted data:
Switch to pivoted data: using column "date",
grouped by "origin_state" and "carrier_name",
aggregated by Count
2nd path was used to get subtotals:
Switch to pivoted data: using column "date",
grouped by "origin_state",
aggregated by Count
Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets
Add rows that appear in "second_path" by column name
After that I've added additional column with expression
Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )
After that I sorted resulting dataset.
Sort dataset by "order" descending, then by "Jan" descending
I received result. but it has additional column, and now I wish to change row formatting of subtotals.
Other approaches are welcome. as my real data has more hierarchical levels.

Scala Spark get sum by time bucket across team spans and key

I have a question that is very similar to How to group by time interval in Spark SQL
However, my metric is time spent (duration), so my data looks like
KEY |Event_Type | duration | Time
001 |event1 | 10 | 2016-05-01 10:49:51
002 |event2 | 100 | 2016-05-01 10:50:53
001 |event3 | 20 | 2016-05-01 10:50:55
001 |event1 | 15 | 2016-05-01 10:51:50
003 |event1 | 13 | 2016-05-01 10:55:30
001 |event2 | 12 | 2016-05-01 10:57:00
001 |event3 | 11 | 2016-05-01 11:00:01
Is there a way to sum the time spent into five minute buckets, grouped by key, and know when the duration goes outside of the bound of the bucket?
For example, the first row starts at 10:49:51 and ends at 10:50:01
Thus, the bucket for key 001 in window [2016-05-01 10:45:00.0,2016-05-01 10:50:00.0] would get 8 seconds of duration (51 seconds to 60 seconds) and the and the 10:50 to 10:55 would get 2 seconds of duration, plus the relevant seconds from other log lines (20 seconds from the third row, 15 from the 4th row).
I want to sum the time in a specific bucket, but the solution on the other thread of
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
would overcount in the buckets timestamps that overlap buckets start in, and undercount the subsequent buckets
Note: My Time column is also in Epoch timestamps like 1636503077, but I can easily cast it to the above format if that makes this calculation easier.
for my opinion, maybe you need preprocess you data by spilt you duration to every minutes (or every five minutes).
as you wish, the first row
001 |event1 | 10 | 2016-05-01 10:49:51
should be convert to
001 |event1 | 9 | 2016-05-01 10:49:51
001 |event1 | 1 | 2016-05-01 10:50:00
then you can use spark window function to sum it properly.
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
that will not change the result if you only want to know the duration of time bucket, but will increasing the record counts.

Filtering inactivated rows in Spark using Scala

I am very new to Spark and Scala programming and I have a problem that I hope some smart people can help me to solve.
I have a table named users with 4 columns: status, user_id, name, date
Rows are:
status user_id name date
active 1 Peter 2020-01-01
active 2 John 2020-01-01
active 3 Alex 2020-01-01
inactive 1 Peter 2020-02-01
inactive 2 John 2020-01-01
I need to select only active users. Two users were inactivated. Only one was inactivated for the same date.
What I aim is to filter rows with inactive status(this I can) and to filter inactivated users when inactivation row matches columns with active row. Peter was inactivated for different date and he is not filtered. Desired result would be:
1 Peter 2020-01-01
3 Alex 2020-01-01
rows with inactive status filtered. John is inactivated, so his row is filtered too.
The closest I come is to filter users that has inactive status:
val users = spark.table("db.users")
.filter(col("status").not Equal("Inactive"))
.select("user_id", "name", "date")
Any ideas or suggestions how to solve this?
Thanks!
Check the inactive first with group by for each user and date, and join this result into the original df.
val df2 = df.groupBy('user_id, 'date).agg(max('status).as("status"))
.filter("status = 'inactive'")
.withColumnRenamed("status", "inactive")
df.join(df2, Seq("user_id", "date"), "left")
.filter('inactive.isNull)
.select(df.columns.head, df.columns.tail: _*)
.show()
+------+-------+-----+----------+
|status|user_id| name| date|
+------+-------+-----+----------+
|active| 1|Peter|2020-01-01|
|active| 3| Alex|2020-01-01|
+------+-------+-----+----------+

Remove duplicates in spark with 90 percent column match

Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA