Parse through each element of an array in pyspark and apply substring - pyspark

Hi I have a pyspark dataframe with an array col shown below.
I want to iterate through each element and fetch only string prior to hyphen and create another column.
+------------------------------+
|array_col |
+------------------------------+
|[hello-123, abc-111] |
|[hello-234, def-22, xyz-33] |
|[hiiii-111, def2-333, lmn-222]|
+------------------------------+
Desired Output;
+------------------------------+--------------------+
|col1 |new_column |
+------------------------------+--------------------+
|[hello-123, abc-111] |[hello, abc] |
|[hello-234, def-22, xyz-33] |[hello, def, xyz] |
|[hiiii-111, def2-333, lmn-222]|[hiiii, def2, lmn] |
+------------------------------+--------------------+
I am trying something like below but I could not apply a regex/substring inside a udf.
cust_udf = udf(lambda arr: [x for x in arr],ArrayType(StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))
Can anyone please help on this. Thanks

From Spark-2.4 use transform higher order function.
Example:
df.show(10,False)
#+---------------------------+
#|array_col |
#+---------------------------+
#|[hello-123, abc-111] |
#|[hello-234, def-22, xyz-33]|
#+---------------------------+
df.printSchema()
#root
# |-- array_col: array (nullable = true)
# | |-- element: string (containsNull = true)
from pyspark.sql.functions import *
df.withColumn("new_column",expr('transform(array_col,x -> split(x,"-")[0])')).\
show()
#+--------------------+-----------------+
#| array_col| new_column|
#+--------------------+-----------------+
#|[hello-123, abc-111]| [hello, abc]|
#|[hello-234, def-2...|[hello, def, xyz]|
#+--------------------+-----------------+

Related

get keys from MapType column in pyspark and use it in navigation

I have a spark df with following schema:
|-- col1 : string
|-- col2 : string
|-- data: struct
| |-- items: map (nullable = true)
| | |-- key: string
| | |-- value: struct
| | | |-- id: string
| | | |-- legalNature
and I'm recieving different json with different keys below items object,
here example:
+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| col1| col2 | data |
+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| xxx | yyy |{"data":{"items":{"71f3e2e4-5c3a-466d-8063-7bfd753b303c":{"id":"123","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| ... | ... |{"data":{"items":{"86c6b41e-085c-eb11-a812-000d3ab29c25":{"id":"153","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"56c6b41e-085c-eb11-a812-000d3ab29c24":{"id":"173","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"1843f179-3687-eb11-a812-0022489bac2c":{"id":"193","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"2643f179-3687-eb11-a812-0022489bac2a":{"id":"133","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"91f3e2e4-5c3a-466d-8063-7bfd753b345i":{"id":"143","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
+-------+---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Items is a struct of type MapType(StringType(), itemsSchema), since the key string from map type may change in each json I get, how can I navigate my json schema dynamically in order to get fields inside items struct?
For instance I need something like this for performing select operation:
df
.select(
col("col1"),
col("col2").alias("my_col_2"),
col(f"data.items.{itemsKey}.legalNature"),
col(f"data.items.{itemsKey}.id"))
)
where itemsKey change for each json inside my df.
I already see that for getting key I can use map_keys function, like this:
df.select(map_keys("data.items"))
But the problem is that this function is returning a data frame and not a string, more precisley a dataframe with different itemsKey for each row.
Is there a way to get dynamicallymy itemsKey?
I hope I've been clear, any help is appreciated.
you should use explode function of spark.sql
from pyspark.sql.functions import explode
df3 = df.select(df.*,explode(df.data.items))
This will give results as
+-------+-------------------------------------------+--------------------------------------------+
| col1| key |value
+-------+-------------------------------------------+-----+---------------------------------------
| ...| a04452cb-a909-47b0-ad5a-9bc44c6014e3|brown|{"legalNature":"legalNature","id": "123",..}|
+-------+----+-----+--------------------------------|---------------------------------------------|
To filter with your particular itemId, simply use filter function
filteredValue = df3.filter(df.key == "Your_item_id").select("value").first();
legalName = filteredValue["value"]["legalName"]
id = filteredValue["value"]["id"]
...
...

Find max value from different columns in a single row in scala DataFrame

I tried to find out the max value from different columns in a single row in scala dataframe.
The data available in dataframe is as below.
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
| NUM| SIG1| SIG2| SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]| null |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|
and the schema is
scala> DF.printSchema
root
|-- NUM: string (nullable = true)
|-- SIG1: string (nullable = true)
|-- SIG2: string (nullable = true)
|-- SIG3: string (nullable = true)
The expected output is as below.
+-------+--------------+----------+------------+------------+
| NUM| TIME | SIG1| | SIG2 | SIG3 |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825 | 4.7825 | 2.7825 |
|XXXXX01| 1569560541003| 1.7825 | 8.7825 | 5.7825 |
|XXXXX01| 1569560531009| 3.7825 | 3.7825 | null |
|XXXXX02| 1569560531007| 5.7825 | 8.7825 | 3.7825 |
|XXXXX02| 1569560531010| 9.7825 | 1.7825 | 3.7825 |
I need to add a new column with highest TIME from a single row and SIG columns with their value only.
Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.
Is there any UDF/functions to achieve this?
Thanks in Advance.
Use get_json_object function to extract values from json stored as a string.
Then it's quite straightforward:
DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
get_json_object('SIG2, "$[0].TIME"),
get_json_object('SIG3, "$[0].TIME")))
.withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
.withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
.withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
.show

Pivot spark dataframe array of kv pairs into individual columns

I have following schema:
root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
|-- config: struct (nullable = true)
| |-- entry: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key: string (nullable = true)
| | | |-- value: string (nullable = true)
There will not be more than 3 key-value pairs (k1,k2,k3) in the array and I would like to make value from each key into its own column while the corresponding data would come from the value from the same kv pair.
+--------+----------+----------+----------+---------+
|id |date |k1 |k2 |k3 |
+--------+----------+----------+----------+---------+
| id1 |2019-08-12|id1-v1 |id1-v2 |id1-v3 |
| id2 |2019-08-12|id2-v1 |id2-v2 |id2-v3 |
+--------+----------+----------+----------+---------+
So far I tried something like this:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", $"config.entry" as "kvpairs")
.withColumn($"kvpairs".getItem(0).getField("key").toString(), $"kvpairs".getItem(0).getField("value"))
.withColumn($"kvpairs".getItem(1).getField("key").toString(), $"kvpairs".getItem(1).getField("value"))
.withColumn($"kvpairs".getItem(2).getField("key").toString(), $"kvpairs".getItem(2).getField("value"))
But in this case, the column names are shown as kvpairs[0][key], kvpairs[1][key] and kvpairs[2][key] as shown below:
+--------+----------+---------------+---------------+---------------+
|id |date |kvpairs[0][key]|kvpairs[1][key]|kvpairs[2][key]|
+--------+----------+---------------+---------------+---------------+
| id1 |2019-08-12| id1-v1 | id1-v2 | id1-v3 |
| id2 |2019-08-12| id2-v1 | id2-v2 | id2-v3 |
+--------+----------+---------------+---------------+---------------+
Two questions:
Is my approach right? Is there a better and easier way to pivot this
such that I get one row per array with the 3 kv pairs as 3 columns? I want to handle cases where order of the kv pairs may differ.
If the above approach is fine, how do I alias the column name to the data of the "key" element in the array?
Using multiple withColumn together with getItem will not work since the order of the kv pairs may differ. What you can do instead is explode the array and then use pivot as follows:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", explode($"config.entry") as "exploded")
.select($"id", $"date", $"exploded.*")
.groupBy("id", "date")
.pivot("key")
.agg(first("value"))
The usage of first inside the aggregation here assumes there will be a single value for each key. Otherwise collect_list or collect_set can be used.
Result:
+---+----------+------+------+------+
|id |date |k1 |k2 |k2 |
+---+----------+------+------+------+
|id1|2019-08-12|id1-v1|id1-v2|id1-v3|
|id2|2019-08-12|id2-v1|id2-v2|id2-v3|
+---+----------+------+------+------+

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

This is similar to Pyspark: cast array with nested struct to string
But, the accepted answer is not working for my case, so asking here
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
Sample JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
This gives result in a single column
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
But, how to get a result or DataFrame like this
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
EDIT:
This solution gives the wrong result
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
Output:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

NullPointerException Error when running CountVectorizer from scala [duplicate]

I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
You can define an empty array of specific type as:
fill = array().cast("array<string>")
and combine it with when clause:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
or coalesce:
topics_a = coalesce(col("topics_A"), fill)
and use it as:
df.withColumn("topics_A", topics_a)
so with example data:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
the result will be:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+
I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)
+-----+
|title|
+-----+
+-----+
+-----+
|title|
+-----+
+-----+
root
|-- title: string (nullable = true)
+--------------------+
| title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows