Does `pl.concat([lazyframe1, lazyframe2])` strictly preserve the order of the input dataframes? - python-polars

Suppose I create a polars Lazyframe from a list of csv files using pl.concat():
df = pl.concat([pl.scan_csv(file) for file in ['file1.csv', 'file2.csv']])
Is the data in the resulting dataframe guaranteed to have the exact order of the input files, or could there be a scenario where the query optimizer would mix things up?

The order is maintained. The engine may execute them in a different order, but the final result will always have the same order as the lazy computations provided by the caller.

Related

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Comparing columns in two data frame in spark

I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.

Apache Spark Multiple Aggregations

I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")