Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs

Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs - pyspark

let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.

Related

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.
example:
my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet
the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet
I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet
and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/

df
.write
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("XXXXXXXX")
.mode(SaveMode.Append)
.format(fileFormat)
.save(path)
When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.
A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.
Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

Why are new columns added to parquet tables not available from glue pyspark ETL jobs?

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?

This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.

Spark dataframe CSV vs Parquet

I am beginner in Spark and trying to understand the mechanics of spark dataframes.
I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.
dataframe_csv = sqlcontext.read.format("csv").load()
dataframe_parquet = sqlcontext.read.parquet()
Please explain the reason for the difference.

The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.

write an RDD into HDFS in a spark-streaming context

I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.
val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)
In order to perform my analysis on DStream data I have to transform it into an Array
var arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Then I use data obtained to extract the information I want and to save them on HDFS.
val myRDD = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")
Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...") (which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...
With an arr.foreach(println) I am able to see the correct results of the transofmations.
My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString()) but always only one foldes is created and the output files are still empty.
How can I write an RDD into HDFS in a spark-streaming context?

You are using Spark Streaming in a way it wasn't designed. I'd either recommend drop using Spark for your use case, or adapt your code so it works the Spark way. Collecting the array to the driver defeats the purpose of using a distributed engine and makes your app effectively single-machine (two machines will also cause more overhead than just processing the data on a single machine).
Everything you can do with an array, you can do with Spark. So just run your computations inside the stream, distributed on the workers, and write your output using DStream.saveAsTextFiles(). You can use foreachRDD + saveAsParquet(path, overwrite = true) to write to a single file.

#vzamboni: Spark 1.5+ dataframes api has this feature :
dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);