PySpark DataFrame change column of string to array before using explode - pyspark

I have a column called event_data in json format in my spark DataFrame, after reading it using from_json, I get this schema:
root
|-- user_id: string (nullable = true)
|-- event_data: struct (nullable = true)
| |-- af_content_id: string (nullable = true)
| |-- af_currency: string (nullable = true)
| |-- af_order_id: long (nullable = true)
I only need af_content_id from this column. This attribute can be of different formats:
a String
an Integer
a List of Int and Str. eg ['ghhjj23','123546',12356]
None (sometimes event_data doesn't contain af_content_id)
I want to use explode function in order to return a new row for each element in af_content_id when it is of format List. But as when I apply it, I get an error:
from pyspark.sql.functions import explode
def get_content_id(column):
return column.af_content_id
df_transf_1 = df_transf_1.withColumn(
"products_basket",
get_content_id(df_transf_1.event_data)
)
df_transf_1 = df_transf_1.withColumn(
"product_id",
explode(df_transf_1.products_basket)
)
cannot resolve 'explode(products_basket)' due to data type mismatch: input to function explode should be array or map type, not StringType;
I know the reason, it's because of the different types that the field af_content_id may contain, but I don't know how to resolve it. Using pyspark.sql.functions.array() directly on the column doesn't work because it become array of array and explode will not produce the expected result.
A sample code to reproduce the step that I'm stuck on:
import pandas as pd
arr = [
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28',['Gklb38','123655']],
['0E3D17EA-BEEF-4931-8104','12909841'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',[12645715, 12909837, 12909837]]
]
df = pd.DataFrame(arr, columns = ['user_id','products_basket'])
df = df[['user_id','products_basket']].astype(str)
df_transf_1 = spark.createDataFrame(df)
I'm looking for a way to convert products_basket to one only possible format: an Array so that when I apply explode, it will contain one id per row.

If you are starting with a DataFrame like:
df_transf_1.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |['Gklb38', '123655'] |
#|0E3D17EA-BEEF-4931-8104 |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
where the products_basket column is a StringType:
df.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: string (nullable = true)
You can't call explode on products_basket because it's not an array or map.
One workaround is to remove any leading/trailing square brackets and then split the string on ", " (comma followed by a space). This will convert the string into an array of strings.
from pyspark.sql.functions import col, regexp_replace, split
df_transf_new= df_transf_1.withColumn(
"products_basket",
split(regexp_replace(col("products_basket"), r"(^\[)|(\]$)|(')", ""), ", ")
)
df_transf_new.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
The regular expression pattern matches any of the following:
(^\[): An opening square bracket at the start of the string
(\]$): A closing square bracket at the end of the string
('): Any single quote (because your strings are quoted)
and replaces these with an empty string.
This assumes that your data does not contain any needed single quotes or square brackets inside the product_basket.
After the split, the schema of the new DataFrame is:
df_transf_new.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: array (nullable = true)
# | |-- element: string (containsNull = true)
Now you can call explode:
from pyspark.sql.functions import explode
df_transf_new.withColumn("product_id", explode("products_basket")).show(truncate=False)
#+--------------------------------+------------------------------+----------+
#|user_id |products_basket |product_id|
#+--------------------------------+------------------------------+----------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |Gklb38 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |123655 |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12645715 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#+--------------------------------+------------------------------+----------+

Related

Unclosed character in matching string for splitting operation in pyspark

I have the following information in a pyspark data frame column:
[["A"],["B"]]
and
["A","B"]
I would like to split the column where the values appear as per the first instance, and leave the values in the second instance intact.
However, upon trying to do this via the split operation:
df = df.selectExpr("split(col, '],[') col")
I recieve the following error:
'Unclosed character class near index...'
I have also tried to replace the actual characters with their ascii equivalent:
df = df.selectExpr("split(col, '\x5D\2C\x5B') col")
But it resulted in the same error as above.
Any suggestions are welcome. Tnx.
there is obviously something very strange when you use the char [. If you remove this char, there is no error anymore.
But you can also use the function version of split and it solves the issue :
from pyspark.sql import functions as F
df.select(F.split(F.col("col"), '],\[')).show()
+----------------+
|split(col, ],\[)|
+----------------+
| [[["A", "B"]]]|
| [["A","B"]]|
+----------------+
If we do not consider your actual problem with split but the real use case you're facing, from_json is probably a better idea :
schema_1 = T.ArrayType(T.StringType())
schema_2 = T.ArrayType(T.ArrayType(T.StringType()))
df2 = df.select(
"col",
F.from_json("col", schema_1),
F.from_json("col", schema_2)
)
df2.show()
+-------------+------------------+------------------+
| col|jsontostructs(col)|jsontostructs(col)|
+-------------+------------------+------------------+
|[["A"],["B"]]| [["A"], ["B"]]| [[A], [B]]|
| ["A","B"]| [A, B]| null|
+-------------+------------------+------------------+
df2.printSchema()
root
|-- col: string (nullable = true)
|-- jsontostructs(col): array (nullable = true)
| |-- element: string (containsNull = true)
|-- jsontostructs(col): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Join two spark Dataframe using the nested column and update one of the columns

I am working on some requirement in which I am getting one small table in from of CSV file as follow:
root
|-- ACCT_NO: string (nullable = true)
|-- SUBID: integer (nullable = true)
|-- MCODE: string (nullable = true)
|-- NewClosedDate: timestamp (nullable = true
We also have a very big external hive table in form of Avro which is stored in HDFS as follow:
root
-- accountlinks: array (nullable = true)
| | |-- account: struct (nullable = true)
| | | |-- acctno: string (nullable = true)
| | | |-- subid: string (nullable = true)
| | | |-- mcode: string (nullable = true)
| | | |-- openeddate: string (nullable = true)
| | | |-- closeddate: string (nullable = true)
Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE. If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.
I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.
val df = spark.sql("select * from db.table where archive='201711'")
val ExtractedColumn = df
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))
val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")
val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")
All you need is to explode the accountlinks array and then join the 2 dataframes like this:
val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF = explodedDF.join(ReferenceData, joinCondition, "left")
Now you can update the account struct column like below, and collect list to get back the array structure:
val FinalData = joinDF.withColumn("account",
struct($"account.acctno", $"account.subid", $"account.mcode",
$"account.openeddate", $"NewClosedDate".alias("closeddate")
)
)
.groupBy().agg(collect_list($"account").alias("accountlinks"))
The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.
If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.

Retrive subkey values of all the keys in json spark dataframe

i have a data frame with schema like below: (I have large number of keys )
|-- loginRequest: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
|-- loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
I want to create a column with status of all the keys of responseHeader.status
Expected
+--------------------+--------------------+------------+
| loginRequest| loginResponse| status |
+--------------------+--------------------+------------+
|[0,1] | null| 0 |
| null|[0,1] | 0 |
| null| [0,1]| 0 |
| null| [1,0]| 1 |
+--------------------+--------------------+-------------
Thanks in Advance
A simple select will solve your problem.
You have a nest field :
loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status
A quick way would be to flatten your dataframe.
Doing something like this:
df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))
And get it working from there:
Or,
You could use something like this:
var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))
which you helped me into and these questions:
Flattening Rows in Spark
DataFrame explode list of JSON objects
In order to get it to populate either from response or request, you can use and when condition in spark.
- How to use AND or OR condition in when in Spark
You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:
import org.apache.spark.sql.functions.{coalesce, col}
val df_status = df.withColumn("status",
coalesce(
col("loginRequest.responseHeader.status"),
col("loginResponse.responseHeader.status")
)
)
What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

Convert array<map<string,string>> type to <string,string> in scala

I am facing in problem in converting a column in my dataframe to string format. The example of the dataframe is as follows:
-- example_code_b: string (nullable = true)
-- example_code: array (nullable = true)
[info] | |-- element: map (containsNull = true)
[info] | | |-- key: string
[info] | | |-- value: string (valueContainsNull = true)
I want to convert example code to (string,string) format from the current array(map(string,string)).
The input is in the form of [Map(entity -> PER), Map(entity -> PER)] and
I want the output to be in the form of PER,PER
you can either do an UDF in DataFrame API or use Dataset-API to do it:
import spark.implicits._
df
.as[Seq[Map[String,String]]]
.map(s => s.reduce(_ ++ _))
.toDF("example_code")
.show()
Note that this does not consider the case of multiple keys, they are not "merged" but just overwritten
You can simply use explode function on any array column, which will create separate rows for each value of array.
val newDF = df.withColumn("mymap" explode(col("example_code")))

Accessing a Nested Map column in Spark Dataframes without using explode

I have a column in a Spark dataframe where the schema looks something like this:
|-- seg: map (nullable = false)
| |-- key: string
| |-- value: array (valueContainsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- id: integer (nullable = false)
| | | |-- expiry: long (nullable = false)
The value in the column looks something like this:
Map(10000124 -> WrappedArray([20185255,1561507200], [20185256,1561507200]))]
What I want to do it create a column from this Map column which only contain an array of [20185255,20185256] (The elements of the array are 1st element of each array in the WrappedArray). How do I do this ?
I am trying not to use "explode".
** Also is their a way I can use a UDF which take in the Map and get those values ?**