Relationalize json deep nested array - pyspark

I have following catalog and want to use AWS glue to flatten it
| accountId | resourceId | items |
|-----------|------------|-----------------------------------------------------------------|
| 1 | r1 | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1 | r2 | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2 | r3 | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |
Here is my schema
root
|-- accountId:
|-- resourceId:
|-- PeriodId:
|-- items:
| |-- application:
| | |-- component: array
I want to flatten it to following:
| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1 | r1 | tool | 1.0 |
| 1 | r1 | app | 1.0 |
| 1 | r2 | tool | 2.0 |
| 1 | r2 | app | 2.0 |
| 2 | r3 | tool | 3.0 |
| 2 | r3 | app | 3.0 |

From what I understand from your schema and data, yours is a deeply nested structure, so you could explode on items.application.component, and then select your name and version columns from that.
This link might help you understand: https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()
+---------+----------+----+-------+
|accountId|resourceId|name|version|
+---------+----------+----+-------+
| 1| r1|tool| 1.0|
| 1| r1| app| 1.0|
| 1| r2|tool| 2.0|
| 1| r2| app| 2.0|
| 2| r3|tool| 3.0|
| 2| r3| app| 3.0|
+---------+----------+----+-------+

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

PySpark Windows Function with Conditional Reset

I have a dataframe like this
| user_id | acivity_date |
| -------- | ------------ |
| 49630701 | 1/1/2019 |
| 49630701 | 1/10/2019 |
| 49630701 | 1/28/2019 |
| 49630701 | 2/5/2019 |
| 49630701 | 3/10/2019 |
| 49630701 | 3/21/2019 |
| 49630701 | 5/25/2019 |
| 49630701 | 5/28/2019 |
| 49630701 | 9/10/2019 |
| 49630701 | 1/1/2020 |
| 49630701 | 1/10/2020 |
| 49630701 | 1/28/2020 |
| 49630701 | 2/10/2020 |
| 49630701 | 3/10/2020 |
What I would need to create is the "Group" column, the logic is For every User we need to retain the Group # until the cumulative date difference is less than 30 days, whenever the cumulative date difference is greater than 30 days then we need to increment the group # as well as reset the cumulative date difference to zero
| user_id | acivity_date | Group |
| -------- | ------------ | ----- |
| 49630701 | 1/1/2019 | 1 |
| 49630701 | 1/10/2019 | 1 |
| 49630701 | 1/28/2019 | 1 |
| 49630701 | 2/5/2019 | 2 | <- Cumulative date diff till here is 35, which is greater than 30, so increment the Group by 1 and reset the cumulative diff to 0
| 49630701 | 3/10/2019 | 3 |
| 49630701 | 3/21/2019 | 3 |
| 49630701 | 5/25/2019 | 4 |
| 49630701 | 5/28/2019 | 4 |
| 49630701 | 9/10/2019 | 5 |
| 49630701 | 1/1/2020 | 6 |
| 49630701 | 1/10/2020 | 6 |
| 49630701 | 1/28/2020 | 6 |
| 49630701 | 2/10/2020 | 7 |
| 49630701 | 3/10/2020 | 7 |
I tried with the below code with the loop, but it is not efficient, it is running for hours. Is there a better way to achieve this? Any help would be really appreciated
df= spark.read.table('excel_file)
df1 = df.select(col("user_id"), col("activity_date")).distinct()
partitionWindow = Window.partitionBy("user_id").orderBy(col("activity_date").asc())
lagTest = lag(col("activity_date"), 1, "0000-00-00 00:00:00").over(partitionWindow)
df1 = df1.select(col("*"), (datediff(col("activity_date"),lagTest)).cast("int").alias("diff_val_with_previous"))
df1 = df1.withColumn('diff_val_with_previous', when(col('diff_val_with_previous').isNull(), lit(0)).otherwise(col('diff_val_with_previous')))
distinctUser = [i['user_id'] for i in df1.select(col("user_id")).distinct().collect()]
rankTest = rank().over(partitionWindow)
df2 = df1.select(col("*"), rankTest.alias("rank"))
interimSessionThreshold = 30
totalSessionTimeThreshold = 30
rowList = []
for x in distinctUser:
tempDf = df2.filter(col("user_id") == x).orderBy(col('activity_date'))
cumulDiff = 0
group = 1
startBatch = True
len_df = tempDf.count()
dp = 0
for i in range(1, len_df+1):
r = tempDf.filter(col("rank") == i)
dp = r.select("diff_val_with_previous").first()[0]
cumulDiff += dp
if ((dp <= interimSessionThreshold) & (cumulDiff <= totalSessionTimeThreshold)):
startBatch=False
rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])
else:
group += 1
cumulDiff = 0
startBatch = True
dp = 0
rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])
ddf = spark.createDataFrame(rowList, ['user_id', 'activity_date', 'group'])
I can think of two solutions but none of them are matching exactly what you want :
from pyspark.sql import functions as F, Window
df.withColumn(
"idx", F.monotonically_increasing_id()
).withColumn(
"date_as_num", F.unix_timestamp("activity_date")
).withColumn(
"group", F.min("idx").over(Window.partitionBy('user_id').orderBy("date_as_num").rangeBetween(- 60 * 60 * 24 * 30, 0))
).withColumn(
"group", F.dense_rank().over(Window.partitionBy("user_id").orderBy("group"))
).show()
+--------+-------------+----------+-----------+-----+
| user_id|activity_date| idx|date_as_num|group|
+--------+-------------+----------+-----------+-----+
|49630701| 2019-01-01| 0| 1546300800| 1|
|49630701| 2019-01-10| 1| 1547078400| 1|
|49630701| 2019-01-28| 2| 1548633600| 1|
|49630701| 2019-02-05| 3| 1549324800| 2|
|49630701| 2019-03-10| 4| 1552176000| 3|
|49630701| 2019-03-21| 5| 1553126400| 3|
|49630701| 2019-05-25| 6| 1558742400| 4|
|49630701| 2019-05-28|8589934592| 1559001600| 4|
|49630701| 2019-09-10|8589934593| 1568073600| 5|
|49630701| 2020-01-01|8589934594| 1577836800| 6|
|49630701| 2020-01-10|8589934595| 1578614400| 6|
|49630701| 2020-01-28|8589934596| 1580169600| 6|
|49630701| 2020-02-10|8589934597| 1581292800| 7|
|49630701| 2020-03-10|8589934598| 1583798400| 8|
+--------+-------------+----------+-----------+-----+
or
df.withColumn(
"group",
F.datediff(
F.col("activity_date"),
F.lag("activity_date").over(
Window.partitionBy("user_id").orderBy("activity_date")
),
),
).withColumn(
"group", F.sum("group").over(Window.partitionBy("user_id").orderBy("activity_date"))
).withColumn(
"group", F.floor(F.coalesce(F.col("group"), F.lit(0)) / 30)
).withColumn(
"group", F.dense_rank().over(Window.partitionBy("user_id").orderBy("group"))
).show()
+--------+-------------+-----+
| user_id|activity_date|group|
+--------+-------------+-----+
|49630701| 2019-01-01| 1|
|49630701| 2019-01-10| 1|
|49630701| 2019-01-28| 1|
|49630701| 2019-02-05| 2|
|49630701| 2019-03-10| 3|
|49630701| 2019-03-21| 3|
|49630701| 2019-05-25| 4|
|49630701| 2019-05-28| 4|
|49630701| 2019-09-10| 5|
|49630701| 2020-01-01| 6|
|49630701| 2020-01-10| 6|
|49630701| 2020-01-28| 7|
|49630701| 2020-02-10| 7|
|49630701| 2020-03-10| 8|
+--------+-------------+-----+

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Scala spark to filter out reoccurring zero values

I created a dataframe in spark with the following schema:
root
|-- user_id: string (nullable = true)
|-- rate: decimal(32,16) (nullable =true)
|-- date: timestamp (nullable =true)
|-- type: string (nullable = true)
Data is like this in my schema
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |0 |2020-04-24 | A |
| XO_121 |0 |2020-04-25 | A |
| XO_121 |0 |2020-04-26 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |0 |2020-04-29 | A |
| XO_121 |1 |2020-04-30 | A |
I want to save space so I want to skip rate which has zero value but just want it's initial occurrence only other rate duplicates are allowed like you see case of 10 and they need to preserve Date order . So after applying filter my data should look like this
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |1 |2020-04-30 | A |
I'm new to spark so just want to find out way to filter . I used Rank concept but that don't work .If any body can provide solution to this problem
Data Preparation :
val df = Seq( ("XO_121","10","2020-04-20"),("XO_121","10","2020-04-21"),("XO_121","30","2020-04-22"),("XO_121","0","2020-04-23"),("XO_121","0","2020-04-24"),("XO_121","0","2020-04-25"),("XO_121","0","2020-04-26"),("XO_121","5","2020-04-27"),("XO_121","0","2020-04-28"),("XO_121","0","2020-04-29"),("XO_121","1","2020-04-30"))
.toDF("user_id","rate","date")
Get the previous value of rate, and check for each record "rate" === "0" && "previous_rate" === "0"
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("user_id").orderBy("date")
val finalDf = df.withColumn("previous_rate", lag("rate", 1).over(winSpec))
.filter( !($"rate" === "0" && $"previous_rate" === "0"))
.drop("previous_rate")
Output :
scala> finalDf.show
+-------+----+----------+
|user_id|rate| date|
+-------+----+----------+
| XO_121| 10|2020-04-20|
| XO_121| 10|2020-04-21|
| XO_121| 30|2020-04-22|
| XO_121| 0|2020-04-23|
| XO_121| 5|2020-04-27|
| XO_121| 0|2020-04-28|
| XO_121| 1|2020-04-30|
+-------+----+----------+
Now you can apply orderBy($"date") or orderBy($"userd_id",$"date") which ever is applicable for you.
You can use row_number() instead of Rank as below
_w = W.partitionBy("col2").orderBy("col1")
df = df.withColumn("rnk", F.row_number().over(_w))
df = df.filter(F.col("rnk") == F.lit("1"))
df.show()
+------+----+---+
| col1|col2|rnk|
+------+----+---+
|XO_121| 0| 1|
|XO_121| 10| 1|
|XO_121| 30| 1|
|XO_121| 20| 1|
|XO_121| 40| 1|
+------+----+---+
Also , you can use first() in case you know there is only repetition on value 0
df = df.groupBy("col1","col2").agg(F.first("col2").alias("col2")).orderBy("col2")
df.show()
+------+----+----+
| col1|col2|col2|
+------+----+----+
|XO_121| 0| 0|
|XO_121| 10| 10|
|XO_121| 20| 20|
|XO_121| 30| 30|
|XO_121| 50| 50|
+------+----+----+

Spark groupby filter sorting with top 3 read articles each city

I have a table data like following :
+-----------+--------+-------------+
| City Name | URL | Read Count |
+-----------+--------+-------------+
| Gurgaon | URL1 | 3 |
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL4 | 1 |
| Gurgaon | URL5 | 5 |
| Delhi | URL3 | 4 |
| Delhi | URL7 | 2 |
| Delhi | URL5 | 1 |
| Delhi | URL6 | 6 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+-------------+
I would like to see somthing like -> Top 3 Read article(if exists) each city
+-----------+--------+--------+
| City Name | URL | Count |
+-----------+--------+--------+
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL5 | 5 |
| Delhi | URL6 | 6 |
| Delhi | URL3 | 4 |
| Delhi | URL1 | 3 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+--------+
I am working on Spark 2.0.2, Scala 2.11.8
You can use window function to get the output.
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")
df.show()
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1| 3|
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL4| 1|
|Gurgaon|URL5| 5|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| DELHI|URL5| 1|
| DELHI|URL6| 6|
| Mumbai|URL5| 5|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")
dfTop.show
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL5| 5|
| Mumbai|URL5| 5|
| DELHI|URL6| 6|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
Output tested on Spark 1.6.2
Window functions are probably the way to go, and there is a built-in function for this purpose:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}
val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)