I made a code that is supposed to load many tables (listed through the method LTables) into differents dataframes in Scala using Spark.
Here is my code:
LTables.iterator.foreach{
Table=> TableProcessor.execute(sparkSession,filterTenant,Table)
if (Table.TableDf.count()>0) {
GenerateCsv.execute(sparkSession, Table.TableDf,Table.OutputFilename, filterTenant)
}
}
In my foreach loop, I process TableProcessor.execute that makes an SQL query and put the result into a dataframe and process a filtering, and then GenerateCsv just load filtered data into a csv.
The thing is, I have a lot of tables with large amount of data to process, so the full process is very slow (I tryed with a list of 160 tables)
I know Spark is great to process a big dataframe and not that great to deal with a lot of dataframes, but I have to get tables separatly using SQL queries.
If you have solution or advice to help me make this code run faster, it would be great.
Thank's for helping
Related
So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;
I am currently doing an analysis on the execution time of a combination of queries executed by RDD's and DataFrames using PySpark. Both take the same data and return the same result, however, DataFrame is almost 30% faster.
I read a lot that PySpark DataFrame is superior, but I want to find out why. I came across the Catalyst Optimizer, which is utilized by DataFrames but not by RDD's. To check the extent of the impact Catalyst has, I would like to disable it completely and then compare the computational time again.
Is there a way do this? I found some guides for Scala, but nothing for Python.
df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?
I have 3 DataFrames, each with 50 columns and millions of records. I need to apply some common transformations on the above DataFrames.
Currently, I'm keeping those DataFrames in a Scala List and performing the operations on each of them Iteratively.
My question is, Is it Ok to keep big DataFrames in Scala Collection or will it have any Performance related Issues. If yes, what is the best way to work on multiple DataFrames in an Iterative manner?
Thanks in advance.
There is no issue doing so, as List is just a reference to your DataFrame and DataFrames in Spark are lazy eval.
So until and unless you start working on any of the DataFrame i.e. calling action on them they will not get populated.
And as soon as the action is finished it will be cleared up.
So it will be equal to calling them separately 3 times, hence there is no issue with your approach.
I have a usecase where I want to read a subsection of data and then perform some operation on it.
I am doing something like this on spark.
DataFrame salesDf = sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load()
This seems to load all the data on the table/collection. When I perform a select operation using
salesDf.registerTempTable("sales");
sqlContext.sql("SELECT * FROM sales WHERE exit_date IS 08-08-2016").show();
It seems like the select operation is being performed on the dataframe that it just loaded. This seems to be a complete waste of space to me.
My table/collection has close to 1 billion records and I want to process just 100,000 of them, loading 1B records in an object seems to be a complete waste of space to me.
Please excuse the naivety of my question. I am very very new to spark. It would be wonderful if someone could provide me a way to just load subsection of table data instead of loading everything and then processing.