Stack overflow error occurred when same data frame is repeated inside pyspark - pyspark

When same dataframe is repeated inside loop then stack overflow error occurred.
Data volume is just 40k records. Cluster size is tried with single node 14Gb/28gb.
Sample data.
FT/RT,Country,Charge_Type,Tariff_Loc,Charge_No,Status,Validity_from,Validity_to,Range_Basis,Limited_Parties,Charge_Detail,Freetime_Unit,Freetime,Count_Holidays,Majeure,Start_Event,Same/Next_Day,Next_Day_if_AFTER,Availability_Date,Route_Group,Route_Code,Origin,LoadZone,FDischZone,PODZone,FDestZone,Equipment_Group,Equipment_Type,Range_From,Range_To,Cargo_Type,commodity,SC_Group,SC_Number,IMO,Shipper_Group,Cnee_Group,Direction,Service,haulage,Transport_Type,Option1,Option2,1st_of_Route_Group,1st_of_LoadZone,1st_of_FDischZone,1st_of_PODZone,1st_of_FDestZone,1st_of_Equipment_Group,1st_of_SC_Group,1st_of_Shipper_Group,1st_of_Cnee_Group,operationalFacilityGroup,operationalFacility,operator,commodityGroup,equipmentType,consignee,consigneeGroup,shipper,shipperGroup,serviceContract,serviceContractGroup,transportMode,agreementType
FT,IN,DET,INCCU,34298,EXPIRED,02-07-2020,30-11-2020,C/B,Y,2,DAY,14,Y,N,DISCHARG,S,null,N,MSL,null,null,null,null,null,null,ADRY,null,null,null,null,2313,null,ONLINE1,null,null,null,IMP,null,null,null,null,null,A1,null,null,null,null,20BULK,null,null,null,INCCU,,MSL,MSL,null,,null,,null,ONLINE1,null,null,SPOT
Expected output as below
Works for few records,if dataframe has mroe records stackoverflow error occured.
Please find the attached screenshot.

The error occurs mainly because of the usage of DataFrame.withColumn(). Using this method multiple times/ inside loops can cause this error. Refer to this official documentation to understand about DataFrame.withColumn().
pyspark.sql.DataFrame.withColumn — PySpark 3.3.0 documentation (apache.org)
The only way to counter this error is to optimize the code. Since you want to convert the data of multiple columns to JSON data, you can try implementing the following code.
Instead of using loops to add a new column that consists of JSON data, use create_map() function. This function converts multiple Pyspark columns into one MapType() column.
from pyspark.sql.functions import *
df = df.withColumn("dealKeys",create_map(
lit("Direction"),create_map(lit("Value"),col("Direction"),lit("Description"),lit("...")),
lit("Country"),create_map(lit("Value"),col("Country"),lit("Description"),lit("..."))
))
The output will be as shown in the below image. I have created a MapType() with key as the column name and its value as a MapType(). This value consists of key value pairs of value of column and its description.
Though this output is not same as your requirement, this transformation of data is much easier to access and perform further transformations even without using loops. You can use df['dealKeys.Direction'] to get its value(MapType()) or you can use df['dealKeys.Direction.Value'] to directly get the value it is holding.

Related

How to extract specific value in a dictionary column with multiple lists

I'm trying to extract specific value inside a column in a dataframe as you can see in the next image without any success, referring back to similar question still didn't work for my code.
If there is any way to extract the values as [Culture, Climate change, technology, ...]
Data
First Try
I have tried split() function however I reached a dead end as still I need the exact value after the word "name", and this new dataframe contains 75 columns. If I can only get a for loop to extract the value after the word "name" that's my latest vision to solve my problem.

dask groupby columns are not there even after reset_index

I am using group by-apply on multiple columns with a user-defined function. However, these columns do not appear in the output data frame, even if I use reset_index.
Unfortunately, I could not create a minimal reproducible example without using my own data. My workaround was explicitly adding the group by columns in the output data frame in the user-defined function.

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you don´t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

df.withcolumn is too slow when I iterate through the column data in pyspark dataframe

I am doing the AES Encryption for pyspark dataframe column.
I am iterating the column data, and replacing the column value with encrypted value using df.withcolumn, But it is too slow
I am looking for the alternative approach, But I did not get any
'''
for i in column_data:
obj= AES.new(key, AES.MODE_CBC,v)
ciphertext= obj.encrypt(i)
df=df.withColumn(col,F.when(df[col]==i,str(ciphertext)).otherwise(df[col])) return df
'''
But it's taking long time.
Could you please suggest the other alternative
Your code is slow because of your for-loop, as it forces Spark to run only on one thread.
Please provide an example of input and expected output and someone might be able to help you with rewriting your code.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame