finding difference in timestamps when some of them are null in pyspark - pyspark

I have two timestamps columns in a pyspark dataframe like so:
+--------------------+--------------------+
| TIME_STAMP| TIME_STAMP2|
+--------------------+--------------------+
|2020-01-03 12:58:...| null|
|2020-01-03 12:59:...| null|
|2020-01-03 13:01:...| null|
|2020-01-03 13:02:...| null|
|2020-01-03 13:04:...| null|
|2020-01-03 13:05:...| null|
|2020-01-03 13:07:...| null|
|2020-01-03 13:08:...|2020-01-03 12:58:...|
|2020-01-03 13:10:...|2020-01-03 12:59:...|
|2020-01-03 13:11:...|2020-01-03 13:01:...|
|2020-01-03 13:13:...|2020-01-03 13:02:...|
|2020-01-03 13:14:...|2020-01-03 13:04:...|
|2020-01-03 13:16:...|2020-01-03 13:05:...|
|2020-01-03 13:17:...|2020-01-03 13:07:...|
|2020-01-03 13:19:...|2020-01-03 13:08:...|
|2020-01-03 13:20:...|2020-01-03 13:10:...|
|2020-01-03 13:22:...|2020-01-03 13:11:...|
|2020-01-03 13:23:...|2020-01-03 13:13:...|
|2020-01-03 13:24:...|2020-01-03 13:14:...|
|2020-01-03 13:26:...|2020-01-03 13:16:...|
+--------------------+--------------------+
I would like to find the difference, however, if one of the values is null, I am getting an error.
Is there a way to cirvumvent this?
This is the error i am getting:
An error was encountered: "cannot resolve '(TIME_STAMP -
TIME_STAMP2)' due to data type mismatch: '(TIME_STAMP -
TIME_STAMP2)' requires (numeric or calendarinterval) type, not
timestamp;;

You can cast the timestamp values to long and subtract them. You'll get the difference in seconds:
from pyspark.sql import functions as f
df.withColumn('diff_in_seconds', f.col('TIME_STAMP').cast('long') - f.col('TIME_STAMP2').cast('long'))
df.show(10, False)
Note that if any of the values is "null" the result will be "null" as well.

Related

How to use dense_rank and rangeBetween on timestamp value?

In Pyspark, I am trying to use dense_rank() to group rows into the same group based on the userId and the time value.
Here is my inital dataframe :
+--------------------+--------------------+--------------------+
| userId| BeginTime| EndTime|
+--------------------+--------------------+--------------------+
| A|2021-02-09 15:56:...|2021-02-09 15:56:...|
| A|2021-02-09 15:57:...|2021-02-09 15:57:...|
| A|2021-02-09 15:58:...|2021-02-09 15:58:...|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...|
| B|2021-02-05 18:27:...|2021-02-05 18:37:...|
+--------------------+--------------------+--------------------+
One row represents one action made by one user and gives the startDate and the endDate of each action. I want to gather actions that were made in succession, so if the duration between two action is more than 1 hour, I consider these two actions where not made in succession.
So here is what I expect :
+--------------------+--------------------+--------------------+---------+
| userId| BeginTime| EndTime| sequence|
+--------------------+--------------------+--------------------+---------+
| A|2021-02-09 15:56:...|2021-02-09 15:56:...| 1|
| A|2021-02-09 15:57:...|2021-02-09 15:57:...| 1|
| A|2021-02-09 15:58:...|2021-02-09 15:58:...| 1|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...| 1|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...| 1|
| B|2021-02-05 18:27:...|2021-02-05 18:37:...| 2|
+--------------------+--------------------+--------------------+---------+
I try to use dense_rank() and rangeBetween in my Window like this :
w_rank = (Window
.partitionBy("userId")
.orderBy(col("BeginTime").cast("timestamp").cast("long"))
.rangeBetween(0,3600 )
df = df.withColumn('sequence', dense_rank().over(w_rank))
But i have this error :
AnalysisException : Window Frame specifiedwindowframe(RangeFrame, currentrow$(), 3600) must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$());
I am quite new with pyspark so if anyone could help me on this one I'll be very grateful. Thanks in advance !
So I manage to find something that works for my case, I post my answer here in case it will help someone :
w = Window.partitionBy('userId').orderBy(col("BeginTime"))
df = df.withColumn('duration_between_series', col('BeginTime').cast('long') - lag(col('EndTime').over(w) )
.withColumn('rank', dense_rank().over(w))
.withColumn('sequence_temp', when(col('rank')==1, 1).when(col('duration_between_series')>3600, col('rank')).otherwise(None))
.withColumn('sequence', last('sequence_temp', True).over(w.rowsBetween(-sys.maxsize, 0))).drop('sequence_temp', 'duration_between_series')
Output :
+--------------------+--------------------+--------------------+---------+---------+
| userId| BeginTime| EndTime| rank| sequence|
+--------------------+--------------------+--------------------+---------+---------+
| A|2021-02-09 15:56:...|2021-02-09 15:56:...| 1| 1|
| A|2021-02-09 15:57:...|2021-02-09 15:57:...| 2| 1|
| A|2021-02-09 15:58:...|2021-02-09 15:58:...| 3| 1|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...| 1| 1|
| B|2021-02-05 13:16:...|2021-02-05 13:16:...| 2| 1|
| B|2021-02-05 18:27:...|2021-02-05 18:37:...| 3| 3|
+--------------------+--------------------+--------------------+---------+---------+
the column sequence is not exactly like what I expected but as much as I have different values for each group, I am fine with that :)

How do I count the occurrences of a string in a PySpark dataframe column?

Suppose I have the following PySpark Dataframe:
+---+------+-------+-----------------+
|age|height| name| friends |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
How do I count the number of people who have 'Sarah' as a friend without creating another column?
I have tried df.friends.apply(lambda x: x[x.str.contains('Sarah')].count()) but got TypeError: 'Column' object is not callable
you can try the following code:
df = df.withColumn('sarah', lit('Sarah'))
df.filter(df['friends'].contains(df['sarah'])).count()
Thanks pault
df.where(df.friends.like('%Sarah%')).count()

How do i filter bad or corrupted rows from a spark data frame after casting

df1
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80|
| 03| spark| 1|
| 04| 300| 1|
+-------+-------+-----+
after casting Score to int and hits to float I get the below dataframe:
df2
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80.0|
| 03| Null| 1.0|
| 04| 300| 1.0|
+-------+-------+-----+
Now I want to extract only the bad records , bad records mean that null produced after casting.
I want to do the operations only on existing dataframe. Please help me out if there is any build-in way to get the bad records after casting.
Please also consider this is sample dataframe. The solution should solve for any number of columns and any scenario.
I tried by separating the null records from both dataframes and compare them. Also i have thought of adding another column with number of nulls and then compare the both dataframes if number of nulls is grater in df2 than in df1 then those are bad one. But i think these solutions are pretty old school.
I would like to know the better way to resolve it.
You can use a custom function/udf to convert string to integer and map non integer values to specific number eg. -999999999.
Later you can filter on -999999999 to identify originally non integer records.
def udfInt(value):
if value is None:
return None
elif value.isdigit():
return int(value)
else:
return -999999999
spark.udf.register('udfInt', udfInt)
df.selectExpr("*",
"udfInt(Score) AS new_Score").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 01| 100|null| 100|
#| 02| null| 80| null|
#| 03|spark| 1|-999999999|
#| 04| 300| 1| 300|
#+---+-----+----+----------+
Filter on -999999999 to identify non integer (bad records)
df.selectExpr("*","udfInt(Score) AS new_Score").filter("new_score == -999999999").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 03|spark| 1|-999999999|
#+---+-----+----+----------+
The same way you can have customized udf for float conversion.

Reading a dataframe after converting to csv file renders incorrect dataframe in Scala

I am trying to write the below dataframe into a csv file:
df:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description| genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...|Computer|44.95| 2000-10-01|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16|
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
I am using this code to write to a csv file:
df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
Using this, it creates 3 different csv files in the folder hdfsOut. When I trying to read that dataframe using
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv("hdfsOut")
csvdf.show()
it displays the dataframe in incorrect form like this:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description|genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| null| null| null|
| society in ...| the young surviv...| null| null| null| null| null| null| null| null| null|
| foundation ...| Fantasy| 5.95| 2000-11-17| null| null| null| null| null| null| null|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| null| null| null|
| an evil sor...| and her own chil...| null| null| null| null| null| null| null| null| null|
| of the world."| Fantasy| 5.95| 2000-12-16| null| null| null| null| null| null| null|
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...| null| null| null|
| with XML...| Computer| 44.95| 2000-10-01| null| null| null| null| null| null| null|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
I need this csv file in order to feed it to Amazon Athena. When I do this, Athena also renders the data in same format as shown in the second output. Ideally, it should show me only 3 rows after reading it from converted csv file.
Any idea why this is happening and how can I resolve this issue to render the csv data in its correct form as shown in the first output?
Your data in description column should have data with new line characters and commas as below
"After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society"
so for test purpose I created a dataframe as
val df = Seq(
("Maeve Ascendant", "_CONFIG_CONTEXT", "#id13", "dfdf", "20", "bk103", "Corets, Eva", "After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society", "Fantasy", "5.95", "2000-11-17")
).toDF("title", "UserData.UserValue._title", "UserData.UserValue._valueRef", "UserData.UserValue._valuegiven", "UserData._idUser", "_id", "author", "description", "genre", "price", "publish_date")
df.show() showed me the same dataframe format as you have in your question
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description| genre|price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
|Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103|Corets, Eva|After the collaps...|Fantasy| 5.95| 2000-11-17|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
But df.show(false) gave the exact values as
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|title |UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|_id |author |description |genre |price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|Maeve Ascendant|_CONFIG_CONTEXT |#id13 |dfdf |20 |bk103|Corets, Eva|After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society|Fantasy|5.95 |2000-11-17 |
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
And when you saved it as csv, spark saves it as text file with line feed and comma to be treated as simple text csv file. And in csv format, line feed generates a new line and comma generates a new field. Thats the culprit format in your data.
Solution 1
you can use parquet format to save the dataframe as parquet saves the properties of a dataframe and read it as parquet as
df.write.parquet("hdfsOut")
var csvdf = spark.read.parquet("hdfsOut")
Solution 2
save it as csv format and use multiLine option while reading it
df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
var csvdf = spark.read.format("org.apache.spark.csv").option("multiLine", "true").option("header", true).csv("hdfsOut")
I hope the answer is helpful

Remove Nulls in specific Rows in Dataframe and combine rows

Need to do the below activity in Spark Dataframes using Scala.
Have tried doing some basic filters isNotNull conditions and others. But no luck.
Input
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Output
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
|[DD,DE,QQ]|[AH,EE,CC]|[AE,AA,CV]|
+----------+----------+----------+
If the input dataframe is limited to only
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Then doing the following should get you the desired final dataframe
import org.apache.spark.sql.functions._
df.select(collect_list("Amber")(0).as("Amber"), collect_list("Green")(0).as("Green"), collect_list("Red")(0).as("Red")).show(false)
You should be getting
+------------+------------+------------+
|Amber |Green |Red |
+------------+------------+------------+
|[DD, DE, QQ]|[AH, EE, CC]|[AE, AA, CV]|
+------------+------------+------------+
collect_list inbuilt function ignores the null values.