Write parquet files concurrently with pyspark - pyspark

In Azure Databricks I would like to write to the same set of parquet files concurrently from multiple notebooks using python / pyspark. I partitioned the target files so the partitions are disjoint / written independently which is supported according to databricks docs.
However I keep getting an error in my cluster logs and one of the concurrent write operations fails:
Py4JJavaError: An error occurred while calling o1033.save.
: org.apache.spark.SparkException: Job aborted.
...
Caused by: org.apache.hadoop.fs.PathIOException: `<filePath>/_SUCCESS': Input/output error: Parallel access to the create path detected. Failing request to honor single writer semantics
Here is the base path of where the parquet files are written to.
Why is this happening? What are the _SUCCESS files even for? Can I disable them somehow to avoid this issue?

_SUCCESS is an empty file which is written at the very end of the process to confirm that everything went fine.
The link you provided is about delta only, which is a special format. Appently, you are trying to write a parquet format file, not a delta format. This is the reason why you are having conflicts.

Related

PySpark mergeSchema on Read operation Parquet vs Avro

I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours.
If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.
Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?

Glue PySpark Job: An error occurred while calling o73.save. ThreadPoolExecutor already shutdown

I have a PySpark glue job which runs in 6 parallel instances for 6 different datasets, in this job I am reading data from s3 using glue_context.create_dynamic_frame.from_catalog() and after merging new data writing spark dataframe to s3 again like this
merged_df.write.mode("overwrite").format("json").partitionBy("year","month","day").save(destination_path)
this job is running fine for 5 datasets but it fails for one of any some times with error
An error occurred while calling o73.save. ThreadPoolExecutor already shutdown
Please help me fixing this.

Is there a way to handle parquet file with INT96 parquet type residing in GCS using Data Fusion?

I want to load a parquet file with INT96 parquet type residing in GCS to BigQuery using Data Fusion.
Created a pipeline with GCS ad BigQuery component without any Wrangler as Wrangler does not support parquet format.
"MapReduce program 'phase-1' failed with error: MapReduce JobId job_1567423947791_0001 failed. Please check the system logs for more details"
Q.1:- Can we check detailed Map reduce log for this job id ? I know we can do this in Cloudera supported Apache Hadoop.
Q.2:- Failure without wrangler is not only occurring in case of parquet but even in case of plain text file. Does Wrangler is mandatory to have in pipeline ?
Q.3:- When we tried Spark Engine instead of Map Reduce it resulted in showing failure reason as “INT96 not yet implemented”. Any work around to overcome this error ? Parquet file without INT96 field got processed successfully.

Spark standalone cluster read parquet files after saving

I've a two-node spark standalone cluster and I'm trying to read some parquet files that I just saved but am getting files not found exception.
Checking the location, it looks like all the parquet files got created on one of the nodes in my standalone cluster.
The problem now, reading the parquet files back, it says cannot find xasdad.part file.
The only way I manage to load it is to scale down the standalone spark cluster to one node.
My question is how can I load my parquet files while running more than one node in my standalone cluster ?
You have to put your files on a shard directory which is accessible to all spark nodes with the same path. Otherwise, use spark with Hadoop HDFS : a distributed file system.

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')