Spark - write to parquet never finish - scala

I have some strange issue with my spark on emr.
when I run spark job and save dataframe to CSV the job finish successfully , but when I try to save to parquet, the Spark application never finish , but i see that all internal tasks are finished.
also I can see that all parquet files created in relevant partitions
I run on EMR : emr-5.13.0
and spark 2.3.0
scala 2.11
the write to parquet is :
newDf.coalesce(partitions)
.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("key1", "key2")
.mode(SaveMode.Append)
.parquet(destination)

Related

pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP

I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.
I have written a simple pyspark code and submitting the job in spark shell using the command "spark-submit gs://<pyspark_script>>.py". However, the script runs once and does not take the next cycle.
Code sample :
SourceDF.writeStream
.format("delta")
.outputMode("append") -- I have also tried "update"
.foreachBatch(mergeToDelta)
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>")
.trigger(processingTime="10 minutes") -- I have tried continuous='10 minutes"
.start()
How to submit the spark jobs in dataproc in google cloud for continuous streaming?
Both source and target for streaming job are delta tables.

Not able to execute Pyspark script using spark action in Oozie - Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Fault tolerance in Spark streaming

I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()

Launching Spark job with Oozie fails (Error MetricsSystem)

I have a spark jar that I launch with spark-submit and it works fine (reading files, generate RDD, storing in hdfs). However, when I tried to launch the same jar within an Oozie job (oozie:spark-action) the spark job fails.
When I looked the logs, the first error to shows up is :
Error MetricsSystem: Sink class
org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated.
Furthermore, when I started playing with the spark script, I found out that the problem has to do with saveAsText funtion. When I lunch the same spark job without writing to HDFS the whole workflow works fine.
Any suggestions ?
The problem was in the side of the cluste where i am executing oozie jobs.
I needed to explicitely add arguments in the job workflow, simply because they weren't taken into consideration:
<spark-opts>--queue HQ_IBNF --conf "spark.executor.extraJavaOptions=-Djava.library.path=/opt/application/Hadoop/current/lib/native"</spark-opts>