Pyspark : collect all the keys in a given data frame column - pyspark

I'm a spark beginner. I'm trying to collect all the keys present a particular column, with different rows having different key value pairs.
|-- A: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
A ID
name: 'Peter', age:'25'. 5
name: 'John', country:'USA', pet:'dog' 7
I need to transform this to a data frame with all the keys as new columns. I tried exploding the column, which will create new "key" and "value" columns but the data frames is a few GB big, and the spark job fails.
dataframe.select(explode("A")).select("key").show()
The expected result is :
name age ID country pet
Peter 25 5 null null
John null 7 USA dog

Related

Not able to write loop expression in withcolumn in pyspark

i have dataframe where DealKeys has data like as
[{"Charge_Type": "DET", "Country": "VN", "Tariff_Loc": "VNSGN"}]
expected out put could be
[{"keyname": "Charge_Type", "value": "DET", "description": "..."}, {"keyname": "Country", "value": "VN", "description": "..."}, {"keyname": "Tariff_Loc", "value": "VNSGN", "description": "..."}]
when i create dataframe got bellow error
df = df2.withColumn('new_column',({'keyname' : i, 'value' : dictionary[i],'description' : "..."} for i in col("Dealkeys")))
Errro: Column is not iterable
DF2 schema:
root
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- DealKeys: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Charge_Type: string (nullable = true)
| | |-- Country: string (nullable = true)
| | |-- Tariff_Loc: string (nullable = true)
We cannot iterate through a dataframe column in Pyspark, hence the error occured.
To get the expected output, you need to follow the approach specified below.
Create new_column column with value as an empty string beforehand so that we can update its value as we iterate through each row.
Since a column cannot be iterated, we can use collect() method to get DealKeys values so that we can insert them into the corresponding new_column value.
Using df.collect() returns a list of rows (rows can be iterated through). As per schema, DealKeys is also of type Row. Using dealkey_row (DealKeys column values as a row) with asDict(), perform list comprehension to create the list of dictionaries that will be inserted into corresponding Charge_No value.
# df is the initial dataframe
df = df.withColumn("new_column", lit(''))
rows = df.collect()
for row in rows:
key = row[0] #Charge_No column value (string type)
dealkey_row = row[2][0] #DealKeys column value (row type)
lst = [{'keyname' : i, 'value' : dealkey_row[i],'description' : "..."} for i in dealkey_row.asDict()] #row.asDict() to get dictionary
df = df.withColumn('new_column', when(col('Charge_No') == key, str(lst)).otherwise(col('new_column')))
df.show(truncate=False)
Row.asDict() converts a row into a dictionary so that the list comprehension can be easily. Using withColumn() along with when(<condition>,<update_value>) function in pyspark, insert the output of your list comprehension into the new_column column (‘otherwise’ helps to retain the previous value if Charge_No value doesn’t match).
The above code produced the following output when I used it.

How can I get the original datatype of the values after using f.coalesce in pyspark?

list = ["B", "A", "D", "C"]
data = [("B", "On","NULL",1632733508,"active"),
("B", "Off","NULL",1632733508, "active"),
("A","On","NULL",1632733511,"active"),
("A","Off","NULL",1632733512,"active"),
("D","NULL",450,1632733513,"inactive"),
("D","NULL",431,1632733515,"inactive"),
("C","NULL",20,1632733518,"inactive"),
("C","NULL",30,1632733521,"inactive")]
df = spark.createDataFrame(data, ["unique_string", "ID", "string_value", "numeric_value", "timestamp","mode"])
For splitting a df according to a List I have the following code.
split_df = (df.filter(
f.col('listname') == list)
.select(
f.coalesce(f.col('string_value'),
f.col('double_value')).alias(list),
f.col('timestamp'), f.col('mode')
))
return split_df
dfs = [split_df(df, list) for id in list]
Startpoint
ID string_value numeric_value timestamp mode
0 B On NULL 1632733508 active
1 B Off NULL 1632733508 active
2 A On NULL 1632733511 active
3 A Off NULL 1632733512 active
4 D NULL 450 1632733513 inactive
5 D NULL 431 1632733515 inactive
6 C NULL 20 1632733518 inactive
7 C NULL 30 1632733521 inactive
After using the Function split_df there is a list of df like this below.
dfs[1].show()
D timestamp mode
0 450 1632733513 inactive
1 431 1632733515 inactive
After using f.coalesce all values in each column will be a string. This is not good in the case of a numeric variable like ID "D". As printSchema shows is the ID "D" a string and not a double and the "timestamp" also a string and not a long.
dfs[1].printSchema()
root
|-- D: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- mode: string (nullable = true)
What do I have to do with the function to keep the original data types?
You can just cast it to whatever datatype you want, so something like
f.coalesce(
f.col('str_col'),
f.col('int_col'),
).cast('int')

Filter on length of arrays in a column containing arrays in Scala Spark dataframe [duplicate]

This question already has answers here:
Get the size/length of an array column
(3 answers)
Closed 4 years ago.
I have a schema for a DataFrame called "mydf" as follows:
root
|--properties
| |-- arrayCol: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- unimportantElem1: string (nullable = true)
| | | |-- unimportantElem2: integer (nullable = true)
I want to filter rows based on the "arrayCol" column having arrays with size (length of the array) equaling "s", and count the number of such rows.
mydf filter(size($"properties.arrayCol") === 4) count
Here I am filtering rows to find all rows having arrays of size 4 in column arrayCol.
Note that the arrayCol is nested (properties.arrayCol) so it might help someone with the use case of filtering on nested columns. I got the answer while posting the question.

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+
You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

UDF to Concatenate Arrays of Undefined Case Class Buried in a Row Object

I have a dataframe, called sessions, with columns that may change over time. (Edit to Clarify: I do not have a case class for the columns - only a reflected schema.) I will consistently have a uuid and clientId in the outer scope with some other inner and outer scope columns that might constitute a tracking event so ... something like:
root
|-- runtimestamp: long (nullable = true)
|-- clientId: long (nullable = true)
|-- uuid: string (nullable = true)
|-- oldTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling> section
...
|-- newTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling>
...
I'd like to now merge oldTrackingEvents and newTrackingEvents with a UDF containing these parameters and yet-to-be resolved code logic:
val mergeTEs = udf((oldTEs : Seq[Row], newTEs : Seq[Row]) =>
// do some stuff - figure best way
// - to merge both groups of tracking events
// - remove duplicates tracker events structures
// - limit total tracking events < 500
return result // same type as UDF input params
)
The UDF return result would be an array of of the structure that is the resulting List of the concatenated two fields.
QUESTION:
My question is how to construct such a UDF - (1) use of correct passed-in parameter types, (2) a way to manipulate these collections within a UDF and (3) a clear way to return a value that doesn't have a compiler error. I unsuccessfully tested Seq[Row] for the input / output (with val testUDF = udf((trackingEvents : Seq[Row]) => trackingEvents) and received the error java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported for a direct return of trackingEvents. However, I get no error for returning Some(1) instead of trackingEvents ... What is the best way to manipulate the collections so that I can concatenate 2 lists of identical structures as suggested by the schema above with the UDF using the activity in the comments section. The goal is to use this operation:
sessions.select(mergeTEs('oldTrackingEvents, 'newTrackingEvents).as("cleanTrackingEvents"))
And in each row, ... get back a single array of 'trackingEvents' structure in a memory / speed efficient manner.
SUPPLEMENTAL:
Looking at a question shown to me ... There's a possible hint, if relevancy exists ... Defining a UDF that accepts an Array of objects in a Spark DataFrame? ... To create struct function passed to udf has to return Product type (Tuple* or case class), not Row.
Perhaps ... this other post is relevant / useful.
I think that the question you've linked explains it all, so just to reiterate. When working with udf:
Input representation for the StructType is weakly typed Row object.
Output type for StructType has to be Scala Product. You cannot return Row object.
If this is to much burden, you should use strongly typed Dataset
val f: T => U
sessions.as[T].map(f): Dataset[U]
where T is an algebraic data type representing Session schema, and U is algebraic data type representing the result.
Alternatively ... If your goal is to merge sequences of some random row structure / schema with some manipulation, this is an alternative generally-stated approach that avoids the partitioning talk:
From the master dataframe, create dataframes for each trackingEvents section, new and old. With each, select the exploded 'trackingEvents' section's columns. Save these val dataframe declarations as newTE and oldTE.
Create another dataframe, where columns that are picked are unique to each tracking event in the arrays of oldTrackingEvents and newTrackingEvents such as each's uuid, clientId and event timestamp. Your pseudo-schema would be:
(uuid: String, clientId : Long, newTE : Seq[Long], oldTE : Seq[Long])
Use a UDF to join the two simple sequences of your structure, both Seq[Long] which is 'something like the untested' example:
val limitEventsUDF = udf { (newTE: Seq[Long], oldTE: Seq[Long], limit: Int, tooOld: Long) => {
(newTE ++ oldTE).filter(_ > tooOld).sortWith(_ > _).distinct.take(limit)
}}
The UDF will return a dataframe of cleaned tracking events & you now have a very slim dataframe with removed events to self-join back to your exploded newTE and oldTE frames after being unioned back to each other.
GroupBy as needed thereafter using collect_list.
Still ... this seems like a lot of work - Should this be voted for this as "the answer" - I'm not sure?