Is there a way to handle parquet file with INT96 parquet type residing in GCS using Data Fusion? - google-cloud-dataproc

I want to load a parquet file with INT96 parquet type residing in GCS to BigQuery using Data Fusion.
Created a pipeline with GCS ad BigQuery component without any Wrangler as Wrangler does not support parquet format.
"MapReduce program 'phase-1' failed with error: MapReduce JobId job_1567423947791_0001 failed. Please check the system logs for more details"
Q.1:- Can we check detailed Map reduce log for this job id ? I know we can do this in Cloudera supported Apache Hadoop.
Q.2:- Failure without wrangler is not only occurring in case of parquet but even in case of plain text file. Does Wrangler is mandatory to have in pipeline ?
Q.3:- When we tried Spark Engine instead of Map Reduce it resulted in showing failure reason as “INT96 not yet implemented”. Any work around to overcome this error ? Parquet file without INT96 field got processed successfully.

Related

PySpark mergeSchema on Read operation Parquet vs Avro

I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours.
If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.
Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?

Write parquet files concurrently with pyspark

In Azure Databricks I would like to write to the same set of parquet files concurrently from multiple notebooks using python / pyspark. I partitioned the target files so the partitions are disjoint / written independently which is supported according to databricks docs.
However I keep getting an error in my cluster logs and one of the concurrent write operations fails:
Py4JJavaError: An error occurred while calling o1033.save.
: org.apache.spark.SparkException: Job aborted.
...
Caused by: org.apache.hadoop.fs.PathIOException: `<filePath>/_SUCCESS': Input/output error: Parallel access to the create path detected. Failing request to honor single writer semantics
Here is the base path of where the parquet files are written to.
Why is this happening? What are the _SUCCESS files even for? Can I disable them somehow to avoid this issue?
_SUCCESS is an empty file which is written at the very end of the process to confirm that everything went fine.
The link you provided is about delta only, which is a special format. Appently, you are trying to write a parquet format file, not a delta format. This is the reason why you are having conflicts.

Can you append to a file using the alpakka HDFS connector?

I'm trying to use this connector to pull messages from Kafka and write them to HDFS. Works fine as long as the file doesn't already exist, but if it does then it throws a FileAlreadyExistsException. Is there a way to append to an already-existing file using this connector? I'm using an HdfsFlow.dataWithPassThrough flow, and it takes an HdfsWritingSettings, but that only allows you to set an overwrite boolean; there's no append option.

Spark standalone cluster read parquet files after saving

I've a two-node spark standalone cluster and I'm trying to read some parquet files that I just saved but am getting files not found exception.
Checking the location, it looks like all the parquet files got created on one of the nodes in my standalone cluster.
The problem now, reading the parquet files back, it says cannot find xasdad.part file.
The only way I manage to load it is to scale down the standalone spark cluster to one node.
My question is how can I load my parquet files while running more than one node in my standalone cluster ?
You have to put your files on a shard directory which is accessible to all spark nodes with the same path. Otherwise, use spark with Hadoop HDFS : a distributed file system.

How to append the streaming log data into an hdfs file in Flume? Does anyone have the MR source code to append the data to a file in hdfs

I need to append the streaming data into hdfs using Flume. Without overwriting the existing log file I need to append the streaming data to existing file in hdfs. Could you please provide links for the MR code for the same.
Flume does not overwrite existing data in hdfs directory by default. It is because, flume save incoming data with folder name appended sink timestamp, such as
Flume.2345234523 so if you run flume again in the same directory in hdfs it will create another file, under the same hdfs path.