Spark Scala SQL making a new Column based on another column

Spark Scala SQL making a new Column based on another column - scala

I have a Dataframe as
EmpId EmpName Salary SalaryDate
1 Amit 1000.0 2016-01-01
1 Amit 2000.0 2016-02-01
1 Amit 1000.0 2016-03-01
1 Amit 2000.0 2016-04-01
1 Amit 3000.0 2016-05-01
1 Amit 1000.0 2016-06-01
I want to add a new column named prevSal which will have data of Amit's previous row value of salary
Expected Output:
EmpId EmpName Salary SalaryDate prevSal
1 Amit 1000.0 2016-01-01 null
1 Amit 2000.0 2016-02-01 1000.0
1 Amit 1000.0 2016-03-01 2000.0
1 Amit 2000.0 2016-04-01 1000.0
1 Amit 3000.0 2016-05-01 2000.0
1 Amit 1000.0 2016-06-01 3000.0
Also, I want a new column named NextSal which will have data of Amit's next row value of salary.
Expected Output
EmpId EmpName Salary SalaryDate prevSal nextSal
1 Amit 1000.0 2016-01-01 null 2000.0
1 Amit 2000.0 2016-02-01 1000.0 1000.0
1 Amit 1000.0 2016-03-01 2000.0 2000.0
1 Amit 2000.0 2016-04-01 1000.0 3000.0
1 Amit 3000.0 2016-05-01 2000.0 1000.0
1 Amit 1000.0 2016-06-01 3000.0 null

As ggordon pointed with their comment, this is a simple case of using lag and lead to access the previous and next row's value based on the row that's being currently scanned each time.
The key for these functions to work is basically enforcing an order of rows within the DataFrame to fully determine which row comes before and after another. For this we use one case of Spark's window functions named orderBy. orderBy in your DataFrame should be used on a column that will be used as reference to order the rows. Since your rows are kind of already ordered by the SalaryDate column, something like this would be sufficient:
val w = Window.orderBy("SalaryDate")
About the prevSal column, the usage of lag can be seen here and based on that information we can use the window function from above to access the previous row's Salary value with something like this (first argument is the name of the column to search upon, second argument is the offset aka how may rows back do we want to go each time):
lag("Salary", 1).over(w)
lead works the same way, but for the forward values of the Salary. lead's usage can be seen here:
lead("Salary", 1).over(w)
So all of this can look a bit like this (given that df is the name of your DataFrame):
val w = Window.orderBy("SalaryDate")
df.withColumn("prevSal", lag("Salary", 1).over(w))
.withColumn("nextSal", lead("Salary", 1).over(w))
.show()
So the end result would look like this:
+-----+-------+------+----------+-------+-------+
|EmpId|EmpName|Salary|SalaryDate|prevSal|nextSal|
+-----+-------+------+----------+-------+-------+
| 1| Amit| 1000|2016-01-01| null| 2000|
| 1| Amit| 2000|2016-02-01| 1000| 1000|
| 1| Amit| 1000|2016-03-01| 2000| 2000|
| 1| Amit| 2000|2016-04-01| 1000| 3000|
| 1| Amit| 3000|2016-05-01| 2000| 1000|
| 1| Amit| 1000|2016-06-01| 3000| null|
+-----+-------+------+----------+-------+-------+
In case you want to make prevSal and nextSal specific for each employer (e.g Amit to have a different set of prevSal/nextSal than let's say John), you simply change the window function by firstly partitioning the table by EmpName and then ordering it by Salary:
val w = Window.partitionBy("EmpName").orderBy("SalaryDate")

Related

Keep only modified rows in Pyspark

I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the first occurrence.
If a row goes back to a previous state (but not for the immediately preceding), I still want to keep it.
Input df1
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
paris
football
2022-02-13
abc
paris
football
2022-02-14
abc
paris
football
2022-02-15
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
def
paris
volley
2022-02-11
ghi
manchester
basketball
2022-02-09
Output DESIDERED
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
ghi
manchester
basketball
2022-02-09

I would simply use a lag function to compare over a hash :
from pyspark.sql import functions as F, Window
output_df = (
df.withColumn("hash", F.hash(F.col("city"), F.col("sport")))
.withColumn(
"prev_hash", F.lag("hash").over(Window.partitionBy("id").orderBy("date"))
)
.where(~F.col("hash").eqNullSafe(F.col("prev_hash")))
.drop("hash", "prev_hash")
)
output_df.show()
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+

Though following solution works for the given data, there are 2 things:
Spark's architecture is not suitable for serial processing like this.
As I pointed out in the comment, you must have a key attribute or combination of attributes which can bring your data back in order if it gets fragmented. A slight change in partitioning and fragmentation can change the results.
The logic is:
Shift "city" and "sport" row by one index.
Compare with this row's "city" and "sport" with these shifted values. If you see a difference, then that is a new row. For similar rows, there will be no difference. For this we use Spark's Window util and a "dummy_serial_key".
Filter the data which matches above condition.
You can feel free to add more columns as per your data design:
from pyspark.sql.window import Window
df = spark.createDataFrame(data=[["abc","london","football","2022-02-11"],["abc","paris","football","2022-02-12"],["abc","paris","football","2022-02-13"],["abc","paris","football","2022-02-14"],["abc","paris","football","2022-02-15"],["abc","london","football","2022-02-16"],["abc","paris","football","2022-02-17"],["def","paris","volley","2022-02-10"],["def","paris","volley","2022-02-11"],["ghi","manchester","basketball","2022-02-09"]], schema=["id","city","sport","date"])
df = df.withColumn("date", F.to_date("date", format="yyyy-MM-dd"))
df = df.withColumn("dummy_serial_key", F.lit(0))
dummy_w = Window.partitionBy("dummy_serial_key").orderBy("dummy_serial_key")
df = df.withColumn("city_prev", F.lag("city", offset=1).over(dummy_w))
df = df.withColumn("sport_prev", F.lag("sport", offset=1).over(dummy_w))
df = df.filter(
(F.col("city_prev").isNull())
| (F.col("sport_prev").isNull())
| (F.col("city") != F.col("city_prev"))
| (F.col("sport") != F.col("sport_prev"))
)
df = df.drop("dummy_serial_key", "city_prev", "sport_prev")
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+

Can temporals be used as columns in KDB?

I've created a pivot table based on:
https://code.kx.com/q/kb/pivoting-tables/
I've just replaced the symbols with minutes:
t:([]k:1 2 3 2 3;p:09:00 09:30 10:00 09:00 09:30; v:10 20 30 40 50)
P:asc exec distinct p from t;
exec P#(p!v) by k:k from t
Suffice to say, this doesn't work:
k|
-| -----------------------------
1| `s#09:00 09:30 10:00!10 0N 0N
2| `s#09:00 09:30 10:00!40 20 0N
3| `s#09:00 09:30 10:00!0N 50 30
which I expected, as the docs says P must be a list of symbols.
My question is; can temporal datatypes be used as columns at all in KDB?

Column names must be symbols. You can use .Q.id to give columns valid names, for example:
q)t:([]k:1 2 3 2 3;p:09:00 09:30 10:00 09:00 09:30; v:10 20 30 40 50)
q)P:.Q.id each asc exec distinct p from t;
q)exec P#.Q.id'[p]!v by k:k from t
k| a0900 a0930 a1000
-| -----------------
1| 10
2| 40 20
3| 50 30
You could convert minutes to their symbolic representation like this of course:
q)P:`$string asc exec distinct p from t;
q)exec P#(`$string p)!v by k:k from t
k| 09:00 09:30 10:00
-| -----------------
1| 10
2| 40 20
3| 50 30
but the result would be confusing at best, I strongly advise against such column names.

Pyspark calculated field based off time difference

I have a table that looks like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime|
+-------------+----------------------+----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 |
In the end, I need to get create a speed column for each row, so something like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| speed |
+-------------+----------------------+----------------------+-------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 13.5 |
So this is what I'm trying to do to get there. I figure I should add an interium column to help out, called trip_time which is a calculation of tpep_dropoff_datetime - tpep_pickup_datetime. Here is the code I'm doing to get that:
df4 = df.withColumn('trip_time', df.tpep_dropoff_datetime - df.tpep_pickup_datetime)
which is producing a nice trip_time column:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| trip_time|
+-------------+----------------------+----------------------+-----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 6 minutes 40 seconds|
But now I want to do the speed column, and this how I'm trying to do that:
df4 = df4.withColumn('speed', (F.col('trip_distance') / F.col('trip_time')))
But that is giving me this error:
AnalysisException: cannot resolve '(trip_distance/trip_time)' due to data type mismatch: differing types in '(trip_distance/trip_time)' (float and interval).;;
Is there a better way?

One option is to convert your time to unix_timestamp which is in unit of seconds, and then you can do the subtraction, which gives you interval as integer that can be further used to calculate speed:
import pyspark.sql.functions as f
df.withColumn('speed', f.col('trip_distance') * 3600 / (
f.unix_timestamp('tpep_dropoff_datetime') - f.unix_timestamp('tpep_pickup_datetime'))
).show()
+-------------+--------------------+---------------------+-----+
|trip_distance|tpep_pickup_datetime|tpep_dropoff_datetime|speed|
+-------------+--------------------+---------------------+-----+
| 1.5| 2019-01-01 00:46:40| 2019-01-01 00:53:20| 13.5|
+-------------+--------------------+---------------------+-----+

Filter out null strings and empty strings in hivecontext.sql

I'm using pyspark and hivecontext.sql and I want to filter out all null and empty values from my data.
So I used simple sql commands to first filter out the null values, but it doesen't work.
My code:
hiveContext.sql("select column1 from table where column2 is not null")
but it work without the expression "where column2 is not null"
Error:
Py4JavaError: An error occurred while calling o577.showString
I think it was due to my select is wrong.
Data example:
column 1 | column 2
null | 1
null | 2
1 | 3
2 | 4
null | 2
3 | 8
Objective:
column 1 | column 2
1 | 3
2 | 4
3 | 8
Tks

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One of the way to read Hive table is using the pysaprk shell.
We need to register the data frame we get from reading the hive table. Then we can run the SQL query.

You have to give database_name.table and run the same query it will work. Please let me know if that helps

It work for me:
df.na.drop(subset=["column1"])

Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases
dbname.person table in hive
name age
aaa null // this null is entered manually -case 1
Andy 30
Justin 19
okay NULL // this null came as this field was left blank. case 2
---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
| name| age|
+------+----+
| aaa |null|
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
-----------------------------
case 2
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
| name|age |
+------+----+
| aaa |null|
| Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
| name| age|
+------+----+
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
------------------------------------
I hope above use cases would clear your doubts about filtering null values
out.
and if you are querying a table registered in spark then use sqlContext.

SQL calculating stock per month

I have specific task, and don't know how to realize it. I hope someone can help me =)
I have stock_move table:
product_id |location_id |location_dest_id |product_qty |date_expected |
-----------|------------|-----------------|------------|--------------------|
327 |80 |84 |10 |2014-05-28 00:00:00 |
327 |80 |84 |10 |2014-05-23 00:00:00 |
327 |80 |84 |10 |2014-02-26 00:00:00 |
327 |80 |85 |10 |2014-02-21 00:00:00 |
327 |80 |84 |10 |2014-02-12 00:00:00 |
327 |84 |85 |20 |2014-02-06 00:00:00 |
322 |84 |80 |120 |2015-12-16 00:00:00 |
322 |80 |84 |30 |2015-12-10 00:00:00 |
322 |80 |84 |30 |2015-12-04 00:00:00 |
322 |80 |84 |15 |2015-11-26 00:00:00 |
i.e. it's table of product moves from one warehouse to second.
I can calculate stock at custom date if I use something like this:
select
coalesce(si.product_id, so.product_id) as "Product",
(coalesce(si.stock, 0) - coalesce(so.stock, 0)) as "Stock"
from
(
select
product_id
,sum(product_qty * price_unit) as stock
from stock_move
where
location_dest_id = 80
and date_expected < now()
group by product_id
) as si
full outer join (
select
product_id
,sum(product_qty * price_unit) as stock
from stock_move
where
location_id = 80
and date_expected < now()
group by product_id
) as so
on si.product_id = so.product_id
Result I have current stock:
Product |Stock |
--------|------|
325 |1058 |
313 |34862 |
304 |2364 |
BUT what to do if I need stock per month?
something like this?
Month |Total Stock |
--------|------------|
Jan |130238 |
Feb |348262 |
Mar |2323364 |
How can I sum product qty from start period to end of each month?
I have just one idea - it's use 24 sub queries for get stock per each month (ex. below)
Jan |Feb | Mar |
----|----|-----|
123 |234 |345 |
End after this rotate rows and columns?
I think this's stupid, but I don't know another way... Help me pls =)

Something like this could give you monthly "ending" inventory snapshots. The trick is your data may omit certain months for certain parts, but that part will still have a balance (ie 50 received in January, nothing happened in February, but you still want to show February with a running total of 50).
One way to handle this is to come up with all possible combinations part/dates. I assumed 1/1/14 + 24 months in this example, but that's easily changed in the all_months subquery. For example, you may only want to start with the minimum date from the stock_move table.
with all_months as (
select '2014-01-01'::date + interval '1 month' * generate_series(0, 23) as month_begin
),
stock_calc as (
select
product_id, date_expected,
date_trunc ('month', date_expected)::date as month_expected,
case
when location_id = 80 then -product_qty * price_unit
when location_dest_id = 80 then product_qty * price_unit
else 0
end as qty
from stock_move
union all
select distinct
s.product_id, m.month_begin::date, m.month_begin::date, 0
from
stock_move s
cross join all_months m
),
running_totals as (
select
product_id, date_expected, month_expected,
sum (qty) over (partition by product_id order by date_expected) as end_qty,
row_number() over (partition by product_id, month_expected
order by date_expected desc) as rn
from stock_calc
)
select
product_id, month_expected, end_qty
from running_totals
where
rn = 1