Saving a dataframe with columns (e.g. "a", "b") as parquet and then reading the parquet at later point in time does not deliver the same column order (could be "b", "a" f.e.) as the file was saved with.
Unfortunately, I was not able to figure out, how the order is influenced and how I can control it.
How to keep original column order when reading in parquet?
PARQUET-188 suggests that column ordering is not part of the parquet spec, so it's probably not a good idea to rely on the ordering. You could however manage this yourself, e.g. by loading/saving the dataframe columns in lexicographical order, or by storing the column names.
Related
Suppose I create a polars Lazyframe from a list of csv files using pl.concat():
df = pl.concat([pl.scan_csv(file) for file in ['file1.csv', 'file2.csv']])
Is the data in the resulting dataframe guaranteed to have the exact order of the input files, or could there be a scenario where the query optimizer would mix things up?
The order is maintained. The engine may execute them in a different order, but the final result will always have the same order as the lazy computations provided by the caller.
I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.
I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?
I have a use case where I have to load millions of json formatted data into Apache Hive Tables.
So my solution was simply , load them into dataframe and write them as Parquet files .
Then I shall create an external table on them .
I am using Apache Spark 2.1.0 with scala 2.11.8.
It so happens all the messages follow a sort of flexible schema .
For example , a column "amount" can have value - 1.0 or 1 .
Since I am transforming data from semi-structured format to structured format but my schema is slightly
variable , I have compensated by thinking inferSchema option for datasources like json will help me .
spark.read.option("inferSchema","true").json(RDD[String])
When I have used inferSchema as true while reading json data ,
case 1 : for smaller data , all the parquet files have amount as double .
case 2 : For larger data , some parquet files have amount as double and others have int64 .
I tried to debug and found certain concepts like schema evolution and schema merging which
went over my head leaving me with more doubts than answers.
My doubts/questions are
When I try to infer schema , does it not enforce the inferred schema onto full dataset ?
Since I cannot enforce any schema due to my contraints , I thought to cast the whole
column to double datatype as it can have both integers and decimal numbers .
Is there a simpler way ?
My guess being ,Since the data is partitioned , the inferSchema works per partition and then
it gives me a general schema but it does not do anything like enforcing schema or anything
of such sort . Please correct me if I am wrong .
Note : The reason I am using inferSchema option is because the incoming data is too much flexible/variable
to enforce a case class of my own though some of the columns are mandatory . If you have a simpler solution, please suggest .
Infer schema really just processes all the rows to find the types.
Once it does that, it then merges the results to find a schema common to the whole dataset.
For example, some of your fields may have values in some rows, but not on other rows. So the inferred schema for this field then becomes nullable.
To answer your question, it's fine to infer schema for your input.
However, since you intend to use the output in Hive you should ensure all the output files have the same schema.
An easy way to do this is to use casting (as you suggest). I typically like to do a select at the final stage of my jobs and just list all the columns and types. I feel this makes the job more human-readable.
e.g.
df
.coalesce(numOutputFiles)
.select(
$"col1" .cast(IntegerType).as("col1"),
$"col2" .cast( StringType).as("col2"),
$"someOtherCol".cast(IntegerType).as("col3")
)
.write.parquet(outPath)
I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical