PySpark mergeSchema on Read operation Parquet vs Avro - pyspark

I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours.
If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.
Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?

Related

Can Confluent S3SourceConnector read a CSV file from S3 bucket?

Do we have any Confluent Kafka in-build Connector to read the data from from S3 bucket from a CSV file.
Can S3SourceConnector do the job for me?
Try using
format.class=io.confluent.connect.s3.format.string.StringFormat
This should read lines from files.
You'd be better suited to use something else to actually parse the data, such as SparkSQL

Write parquet files concurrently with pyspark

In Azure Databricks I would like to write to the same set of parquet files concurrently from multiple notebooks using python / pyspark. I partitioned the target files so the partitions are disjoint / written independently which is supported according to databricks docs.
However I keep getting an error in my cluster logs and one of the concurrent write operations fails:
Py4JJavaError: An error occurred while calling o1033.save.
: org.apache.spark.SparkException: Job aborted.
...
Caused by: org.apache.hadoop.fs.PathIOException: `<filePath>/_SUCCESS': Input/output error: Parallel access to the create path detected. Failing request to honor single writer semantics
Here is the base path of where the parquet files are written to.
Why is this happening? What are the _SUCCESS files even for? Can I disable them somehow to avoid this issue?
_SUCCESS is an empty file which is written at the very end of the process to confirm that everything went fine.
The link you provided is about delta only, which is a special format. Appently, you are trying to write a parquet format file, not a delta format. This is the reason why you are having conflicts.

Is there a way to handle parquet file with INT96 parquet type residing in GCS using Data Fusion?

I want to load a parquet file with INT96 parquet type residing in GCS to BigQuery using Data Fusion.
Created a pipeline with GCS ad BigQuery component without any Wrangler as Wrangler does not support parquet format.
"MapReduce program 'phase-1' failed with error: MapReduce JobId job_1567423947791_0001 failed. Please check the system logs for more details"
Q.1:- Can we check detailed Map reduce log for this job id ? I know we can do this in Cloudera supported Apache Hadoop.
Q.2:- Failure without wrangler is not only occurring in case of parquet but even in case of plain text file. Does Wrangler is mandatory to have in pipeline ?
Q.3:- When we tried Spark Engine instead of Map Reduce it resulted in showing failure reason as “INT96 not yet implemented”. Any work around to overcome this error ? Parquet file without INT96 field got processed successfully.

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Spark standalone cluster read parquet files after saving

I've a two-node spark standalone cluster and I'm trying to read some parquet files that I just saved but am getting files not found exception.
Checking the location, it looks like all the parquet files got created on one of the nodes in my standalone cluster.
The problem now, reading the parquet files back, it says cannot find xasdad.part file.
The only way I manage to load it is to scale down the standalone spark cluster to one node.
My question is how can I load my parquet files while running more than one node in my standalone cluster ?
You have to put your files on a shard directory which is accessible to all spark nodes with the same path. Otherwise, use spark with Hadoop HDFS : a distributed file system.