I have a spark jar that I launch with spark-submit and it works fine (reading files, generate RDD, storing in hdfs). However, when I tried to launch the same jar within an Oozie job (oozie:spark-action) the spark job fails.
When I looked the logs, the first error to shows up is :
Error MetricsSystem: Sink class
org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated.
Furthermore, when I started playing with the spark script, I found out that the problem has to do with saveAsText funtion. When I lunch the same spark job without writing to HDFS the whole workflow works fine.
Any suggestions ?
The problem was in the side of the cluste where i am executing oozie jobs.
I needed to explicitely add arguments in the job workflow, simply because they weren't taken into consideration:
<spark-opts>--queue HQ_IBNF --conf "spark.executor.extraJavaOptions=-Djava.library.path=/opt/application/Hadoop/current/lib/native"</spark-opts>
Related
In Azure Databricks I would like to write to the same set of parquet files concurrently from multiple notebooks using python / pyspark. I partitioned the target files so the partitions are disjoint / written independently which is supported according to databricks docs.
However I keep getting an error in my cluster logs and one of the concurrent write operations fails:
Py4JJavaError: An error occurred while calling o1033.save.
: org.apache.spark.SparkException: Job aborted.
...
Caused by: org.apache.hadoop.fs.PathIOException: `<filePath>/_SUCCESS': Input/output error: Parallel access to the create path detected. Failing request to honor single writer semantics
Here is the base path of where the parquet files are written to.
Why is this happening? What are the _SUCCESS files even for? Can I disable them somehow to avoid this issue?
_SUCCESS is an empty file which is written at the very end of the process to confirm that everything went fine.
The link you provided is about delta only, which is a special format. Appently, you are trying to write a parquet format file, not a delta format. This is the reason why you are having conflicts.
I'm trying to run a streaming job in databricks notebook on spark 2.4.5 version. Trying it with not using checkpointing. Tried every possible solution but unable to get active streamingContext and everytime.
The job re-attempts to run the job while testing, but the standard coding pattern above, fails with the exception (also expected): "Only one StreamingContext may be started in this JVM. Currently running StreamingContext was started atorg.apache.spark.streaming.api.java.JavaStreamingContext.start".
So which spark streaming coding pattern should I use to be able to automatically re-run streaming jobs, stop the old streaming context and start a new one without having to restart the cluster and, ideally, without restarting the spark context?
I have a PySpark glue job which runs in 6 parallel instances for 6 different datasets, in this job I am reading data from s3 using glue_context.create_dynamic_frame.from_catalog() and after merging new data writing spark dataframe to s3 again like this
merged_df.write.mode("overwrite").format("json").partitionBy("year","month","day").save(destination_path)
this job is running fine for 5 datasets but it fails for one of any some times with error
An error occurred while calling o73.save. ThreadPoolExecutor already shutdown
Please help me fixing this.
I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.
We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !