Spark Pivot On Date - scala

The original DataFrame looks like:
+--------------------+--------------------+--------------------+
| user_id| measurement_date| features|
+--------------------+--------------------+--------------------+
|b6d0bb3d-7a8e-4ac...|2016-06-28 02:00:...|[3492.68576170840...|
..
|048ffee9-a942-4d1...|2016-04-28 02:00:...|[1404.42230898422...|
|05101595-5a6f-4cd...|2016-07-10 02:00:...|[1898.50082132108...|
+--------------------+--------------------+--------------------+
My pivoting efforts:
data = data.select(data.col("user_id"),data.col("features"),data.col("measurement_date").cast(DateType).alias("date")).filter(data.col("measurement_date").between("2016-01-01", "2016-01-07"))
data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").min()
I the output is:
+--------------------+--------------------+
| user_id| features|
+--------------------+--------------------+
|14cd26dc-200a-436...|[2281.34579074947...|
..
|d8ae1b5e-c1e0-4bf...|[2568.49641198251...|
|1cceb175-12b4-4c3...|[4436.36029554227...|
+--------------------+--------------------+
The columns I want 2016-01-01,..,2016-01-07 are missing, nothing was pivoted at all.
What am I doing wrong?
EDIT:
This is how the DataFrame looks after the first statement:
| user_id| features| date|
+--------------------+--------------------+----------+
|60f1cd63-0d5a-4f2...|[1553.35305181118...|2016-01-05|
|a56d1fef-5f17-4c9...|[1704.34897309186...|2016-01-02|
..
|992b6a34-803d-44b...|[1518.14292508305...|2016-01-05|
It might be noteworthy that (user_id, features) is not a time series, there are gaps in the data. Sometimes there are no measurements for certain dates, in that case I want Null as entry.

You forgot the aggregation part. So that your second line of code should be
data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").agg(min("date"))

Related

how to generate a bi-weekly column in pyspark

I was asked to aggregate my data by two-week, let's say my data starts at jun 1st 2020, which happens to be a MONDAY. Ever since then, i need to aggregate that by every two-weeks.
I will show you by a simple dataset down below,
+----+-----------+----------+
| id|day_revenue| ts_string|
+----+-----------+----------+
| 1| 10|2020-06-01|
| 1| 8|2020-06-04|
| 2| 10|2020-06-30|
|1081| 100|2020-07-07|
+----+-----------+----------+
i skip a lot data, and that is only a sample.
My purpose is making this dataframe looks like the one down below.
+----+-----------+----------+-------------+---------------------+
|id |day_revenue|ts_string |bi_week_start|bi_week_full |
+----+-----------+----------+-------------+---------------------+
|1 |10 |2020-06-01|2020-06-01 |2020-06-01/2020-06-14|
|1 |8 |2020-06-04|2020-06-01 |2020-06-01/2020-06-14|
|2 |10 |2020-06-30|2020-06-29 |2020-06-29/2020-07-12|
|1081|100 |2020-07-07|2020-06-29 |2020-06-29/2020-07-12|
+----+-----------+----------+-------------+---------------------+
so, whatever you do, the 2020-06-01 is a magic day, because we split every two weeks base d on this day. Importantly,bi-week-full column is not necessary but if you can create new column in that way, that will be much more convenient for me.
There are two things please keep in mind,
My data frame is still growing every single day.Thus, I would like to find a universal function or general way to treat this kind of ad-hoc request.
My data starts from 2020-06-01 and it has accumulated more than two years. It is a big data frame.
THANK YOU in advance.
you can create sample df by code below,
data_ls = [('1', '10', '2020-06-01'),
('1', '8', '2020-06-04'),
('2', '10', '2020-06-30'),
('1081', '100', '2020-07-07'),]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'day_revenue', 'ts_string'])
Using a combination of datediff() and date_add()
import pyspark.sql.functions as F
(
df
.withColumn('start', F.lit('2020-06-01'))
.withColumn('n', F.floor(F.datediff('ts_string', 'start') / 14).cast('int'))
.select(
'id', 'day_revenue', 'ts_string',
F.date_add('start', F.col('n') * 14).alias('bi_week_start'),
F.date_add('start', (F.col('n') + 1) * 14 - 1).alias('bi_week_end'),
)
.withColumn('bi_week_full', F.concat_ws('/', 'bi_week_start', 'bi_week_end'))
.show(truncate=False)
)

How to create a new sequential timestamp column in a CSV file using Spark

I have a sample CSV file with columns as shown below.
col1,col2
1,57.5
2,24.0
3,56.7
4,12.5
5,75.5
I want a new Timestamp column in the HH:mm:ss format and the timestamp should keep on the increase by seconds as shown below.
col1,col2,ts
1,57.5,00:00:00
2,24.0,00:00:01
3,56.7,00:00:02
4,12.5,00:00:03
5,75.5,00:00:04
Thanks in advance for your help.
I can propose a solution based on pyspark. The scala implementation should be almost transparent.
My idea is to create a column filled with a unique timestamps (here 1980 as an example but does not matter) and add seconds based on your first column (row number). Then, you just reformat the timestamp to only see hours
import pyspark.sql.functions as psf
df = (df
.withColumn("ts", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("ts", psf.col("ts") + psf.col("i") - 1)
.withColumn("ts", psf.from_unixtime("ts", format='HH:mm:ss'))
)
df.show(2)
+---+----+---------+
| i| x| ts|
+---+----+---------+
| 1|57.5| 00:00:00|
| 2|24.0| 00:00:01|
+---+----+---------+
only showing top 2 rows
Data generation
df = spark.createDataFrame([(1,57.5),
(2,24.0),
(3,56.7),
(4,12.5),
(5,75.5)], ['i','x'])
df.show(2)
+---+----+
| i| x|
+---+----+
| 1|57.5|
| 2|24.0|
+---+----+
only showing top 2 rows
Update: if you don't have a row number in your csv (from your comment)
In that case, you will need row_number function.
This is not straightforward to number rows in Spark because the data are distributed on independent partitions and locations. The order observed in the csv will not be respected by spark when mapping file rows to partitions. I think it would be better not to use Spark to number your rows in your csv if the order in the file is important. A pre-processing step based on pandas with a loop over all your files, to take it one at a time, could make it work.
Anyway, I can propose you a solution if you don't mind having row order different from the one in your csv stored in disk.
import pyspark.sql.window as psw
w = psw.Window.partitionBy().orderBy("x")
(df
.drop("i")
.withColumn("i", psf.row_number().over(w))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+---+---------+
| x| i|Timestamp|
+----+---+---------+
|12.5| 1| 00:00:00|
|24.0| 2| 00:00:01|
+----+---+---------+
only showing top 2 rows
In terms of efficiency this is bad (it's like collecting all the data in master) because you don't use partitionBy. In this step, using Spark is overkill.
You could also use a temporary column and use this one to order. In this particular example it will produce the expected output but not sure it works great in general
w2 = psw.Window.partitionBy().orderBy("temp")
(df
.drop("i")
.withColumn("temp", psf.lit(1))
.withColumn("i", psf.row_number().over(w2))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+----+---+---------+
| x|temp| i|Timestamp|
+----+----+---+---------+
|57.5| 1| 1| 00:00:00|
|24.0| 1| 2| 00:00:01|
+----+----+---+---------+
only showing top 2 rows

How to transform DataFrame before joining operation?

The following code is used to extract ranks from the column products. The ranks are second numbers in each pair [...]. For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333, accordingly.
But the code in Spark 2.2 works very slowly when df_products is around 800 Mb:
df_products.createOrReplaceTempView("df_products")
val result = df.as("df2")
.join(spark.sql("SELECT * FROM df_products")
.select($"product_PK", explode($"products").as("products"))
.withColumnRenamed("product_PK","product_PK_temp").as("df1"),$"df2.product _PK" === $"df1.product_PK_temp" and $"df2.rec_product_PK" === $"df1.products.product_PK", "left")
.drop($"df1.product_PK_temp")
.select($"product_PK", $"rec_product_PK", coalesce($"df1.products.col2", lit(0.0)).as("rank_product"))
This is a small sample of df_products and df:
df_products =
+----------+--------------------+
|product_PK| products|
+----------+--------------------+
| 111|[[222,66],[333,55...|
| 222|[[333,24],[444,77...|
...
+----------+--------------------+
df =
+----------+-----------------+
|product_PK| rec_product_PK|
+----------+-----------------+
| 111| 222|
| 222| 888|
+----------+-----------------+
The above-given code works well when the arrays in each row of products contain a small number of elements. But when there are a lot of elements in the arrays of each row [[..],[..],...], then the code seems to get stuck and it does not advance.
How can I optimize the code? Any help is really highly appreciated.
Is it possible, for example, to transform df_products into the following DataFrame before joining?
df_products =
+----------+--------------------+------+
|product_PK| rec_product_PK| rank|
+----------+--------------------+------+
| 111| 222| 66|
| 111| 333| 55|
| 222| 333| 24|
| 222| 444| 77|
...
+----------+--------------------+------+
As per my answer here, you can transform df_products using something like this:
import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")
This assumes products is an array of structs. If products is an array of array, you can use the following instead:
df2 = df1.withColumn("rank", df2("products").getItem(1))

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

How to compare two dataframe and print columns that are different in scala

We have two data frames here:
the expected dataframe:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
and the actual data frame:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
the difference between the two dataframes now is:
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.
From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.
So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.
Now, to apply selective difference we have to write code something like this:
First we need to find the columns in expected and actual data frames.
val columns = df1.schema.fields.map(_.name)
Then we have to find the difference columnwise.
val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contain different values.
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
And, we will get only the columns that contain different values. Like this:
+--------+
|emp_name|
+--------+
| romino|
+--------+
I hope this helps!
list_col=[]
cols=df1.columns
# Prepare list of dataframes/per column
for col in cols:
list_col.append(df1.select(col).subtract(df2.select(col)))
# Render/persist
for l in list_col :
if l.count() > 0 :
l.show()
Spark-extensions have an API for this - DIFF. I believe you can use it like this:
left.diff(right).show()
Or supply emp_id as an id column, like this:
left.diff(right, "emp_id").show()
This API is available for Spark 2.4.x - 3.x.