df.withcolumn is too slow when I iterate through the column data in pyspark dataframe - pyspark

I am doing the AES Encryption for pyspark dataframe column.
I am iterating the column data, and replacing the column value with encrypted value using df.withcolumn, But it is too slow
I am looking for the alternative approach, But I did not get any
'''
for i in column_data:
obj= AES.new(key, AES.MODE_CBC,v)
ciphertext= obj.encrypt(i)
df=df.withColumn(col,F.when(df[col]==i,str(ciphertext)).otherwise(df[col])) return df
'''
But it's taking long time.
Could you please suggest the other alternative

Your code is slow because of your for-loop, as it forces Spark to run only on one thread.
Please provide an example of input and expected output and someone might be able to help you with rewriting your code.

Related

Avoid loading into table dataframe that is empty

I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x = list(df["string"])
result = []
for i in x:
if "pass" in i:
result.append("YES")
else:
result.append("NO")
df["result"] = result
return df
The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?
There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()
df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()

Stack overflow error occurred when same data frame is repeated inside pyspark

When same dataframe is repeated inside loop then stack overflow error occurred.
Data volume is just 40k records. Cluster size is tried with single node 14Gb/28gb.
Sample data.
FT/RT,Country,Charge_Type,Tariff_Loc,Charge_No,Status,Validity_from,Validity_to,Range_Basis,Limited_Parties,Charge_Detail,Freetime_Unit,Freetime,Count_Holidays,Majeure,Start_Event,Same/Next_Day,Next_Day_if_AFTER,Availability_Date,Route_Group,Route_Code,Origin,LoadZone,FDischZone,PODZone,FDestZone,Equipment_Group,Equipment_Type,Range_From,Range_To,Cargo_Type,commodity,SC_Group,SC_Number,IMO,Shipper_Group,Cnee_Group,Direction,Service,haulage,Transport_Type,Option1,Option2,1st_of_Route_Group,1st_of_LoadZone,1st_of_FDischZone,1st_of_PODZone,1st_of_FDestZone,1st_of_Equipment_Group,1st_of_SC_Group,1st_of_Shipper_Group,1st_of_Cnee_Group,operationalFacilityGroup,operationalFacility,operator,commodityGroup,equipmentType,consignee,consigneeGroup,shipper,shipperGroup,serviceContract,serviceContractGroup,transportMode,agreementType
FT,IN,DET,INCCU,34298,EXPIRED,02-07-2020,30-11-2020,C/B,Y,2,DAY,14,Y,N,DISCHARG,S,null,N,MSL,null,null,null,null,null,null,ADRY,null,null,null,null,2313,null,ONLINE1,null,null,null,IMP,null,null,null,null,null,A1,null,null,null,null,20BULK,null,null,null,INCCU,,MSL,MSL,null,,null,,null,ONLINE1,null,null,SPOT
Expected output as below
Works for few records,if dataframe has mroe records stackoverflow error occured.
Please find the attached screenshot.
The error occurs mainly because of the usage of DataFrame.withColumn(). Using this method multiple times/ inside loops can cause this error. Refer to this official documentation to understand about DataFrame.withColumn().
pyspark.sql.DataFrame.withColumn — PySpark 3.3.0 documentation (apache.org)
The only way to counter this error is to optimize the code. Since you want to convert the data of multiple columns to JSON data, you can try implementing the following code.
Instead of using loops to add a new column that consists of JSON data, use create_map() function. This function converts multiple Pyspark columns into one MapType() column.
from pyspark.sql.functions import *
df = df.withColumn("dealKeys",create_map(
lit("Direction"),create_map(lit("Value"),col("Direction"),lit("Description"),lit("...")),
lit("Country"),create_map(lit("Value"),col("Country"),lit("Description"),lit("..."))
))
The output will be as shown in the below image. I have created a MapType() with key as the column name and its value as a MapType(). This value consists of key value pairs of value of column and its description.
Though this output is not same as your requirement, this transformation of data is much easier to access and perform further transformations even without using loops. You can use df['dealKeys.Direction'] to get its value(MapType()) or you can use df['dealKeys.Direction.Value'] to directly get the value it is holding.

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you don´t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Apache Spark Multiple Aggregations

I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")