I am running Oozie Spark action in cloudera and I need a way to capture the stdout of oozie spark action and store in HDFS
Related
I am new to spark and need to clarify some doubts i have.
Can I schedule Spark Jobs through Airflow
My Airflow (Spark) jobs process raw csv files present in S3 bucket and then transforms into parquet format , stores it into S3 bucket and then finally stores it into Presto Hive after completely processed. End user connects to Presto and queries the data to create visualisation.
Can this processed data be stored in Hive only or Presto only so that user can connect to Presto or Hive and accordingly to perform query on the database.
Well you can always spark_submit_operator
to schedule and submit your spark jobs or you can use bash operator
where you can use the spark-submit bash command to schedule and submit spark jobs.
to your second question, After spark created parquet files you can use spark (same spark instance) to write it to hive or presto.
I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.
I have some strange issue with my spark on emr.
when I run spark job and save dataframe to CSV the job finish successfully , but when I try to save to parquet, the Spark application never finish , but i see that all internal tasks are finished.
also I can see that all parquet files created in relevant partitions
I run on EMR : emr-5.13.0
and spark 2.3.0
scala 2.11
the write to parquet is :
newDf.coalesce(partitions)
.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("key1", "key2")
.mode(SaveMode.Append)
.parquet(destination)
I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.
I have a spark jar that I launch with spark-submit and it works fine (reading files, generate RDD, storing in hdfs). However, when I tried to launch the same jar within an Oozie job (oozie:spark-action) the spark job fails.
When I looked the logs, the first error to shows up is :
Error MetricsSystem: Sink class
org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated.
Furthermore, when I started playing with the spark script, I found out that the problem has to do with saveAsText funtion. When I lunch the same spark job without writing to HDFS the whole workflow works fine.
Any suggestions ?
The problem was in the side of the cluste where i am executing oozie jobs.
I needed to explicitely add arguments in the job workflow, simply because they weren't taken into consideration:
<spark-opts>--queue HQ_IBNF --conf "spark.executor.extraJavaOptions=-Djava.library.path=/opt/application/Hadoop/current/lib/native"</spark-opts>