Spark 2.2.0 - unable to read recursively into directory structure - scala

Problem summary:
I am unable to read from nested subdirectories using my Spark program, despite setting the required Hadoop configuration (see attempted).
I get the error pasted below.
Any help is appreciated.
Version:
Spark 2.2.0
Input directory layout:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939225073/part-00000-3a44cd00-e895-4a01-9ab9-946064b739d4-c000.parquet
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939234036/part-00000-cbd47353-0590-4cc1-b10d-c18886df1c25-c000.parquet
...
Input directory parameter passed:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/*/*
Attempted (1):
Set parameter in code...
val sparkSession: SparkSession =SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support & loglevel
import sparkSession.implicits._sparkSession.sparkContext.hadoopConfiguration.setBoolean("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", true)
Did not see the configuration in place in Spark UI.
Attempted (2):
Passed the config from the CLI - spark-submit, and set it in code (see below).
spark-submit --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \...
I do see the configuration in the Spark UI, but same error – cannot traverse into the directory structure..
Code:
//Spark Session
val sparkSession: SparkSession=SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support
val conf= new SparkConf()
val cliRecursiveGlobConf=conf.get("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive")
import sparkSession.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", cliRecursiveGlobConf)
Error & overall output:
Full error is at - https://gist.github.com/airawat/77fbdb821410a5a87dfd29ffaf60fdf9
17/08/18 15:59:29 INFO state.StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
Exception in thread "main" java.io.FileNotFoundException: File /user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=*/* does not exist.

Related

HiveException when running a sql example in Spark shell

a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.

register hive udf in scala - java.net.MalformedURLException: unknown protocol: s3

I am trying to register a udf in scala spark like this where registering the following udf works in hive create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'
val sparkSess = SparkSession.builder()
.appName("Opens")
.enableHiveSupport()
.config("set hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
sparkSess.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
I get an error saying
Exception in thread "main" java.net.MalformedURLException: unknown protocol: s3
Would like to know if I have to set something in config or anything else , I have just started learning.
Any help with this is appreciated.
Why not add this gdpr-hive-udfs-hadoop.jar as an external jar to your project and then do this to register the udf:
val sqlContext = sparkSess.sqlContext
val udf_parallax = sqlContext.udf .register("udf_parallax", com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash _)
Update:
1.If your hive is running on remote server:
val sparkSession= SparkSession.builder()
.appName("Opens")
.config("hive.metastore.uris", "thrift://METASTORE:9083")
.config("set hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSession.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
2.If hive is not running on remote server:
Copy the hive-site.xml from your /hive/conf/ directory to /spark/conf/ directory and create the SparkSession as you have mentioned in the question

Error instantiating 'org.apache.spark.sql.hive.HiveSessionState': on Linux server

I have a Scala Spark application that I'm trying to run on a Linux server using a shell script. I am getting the error:
Exception in thread "main" java.lang.IllegalArgumentException: Error
while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
However, I don't understand what is wrong. I am doing this to instantiate Spark:
val sparkConf = new SparkConf().setAppName("HDFStoES").setMaster("local")
val spark: SparkSession = SparkSession.builder.enableHiveSupport().config(sparkConf).getOrCreate()
Am I doing this correctly, if so what could be the error?
sparkSession = SparkSession.builder().appName("Test App").master("local[*])
.config("hive.metastore.warehouse.dir", hiveWareHouseDir)
.config("spark.sql.warehouse.dir", hiveWareHouseDir).enableHiveSupport().getOrCreate();
Use above, you need to specify the "hive.metastore.warehouse.dir" directory to enable hive support in spark session.

How to use mesos master url in a self-contained Scala Spark program

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.

Running Sparkling-Water with external H2O backend

I was following the steps for running Sparkling water with external backend from here. I am using spark 1.4.1, sparkling-water-1.4.16, I've build the extended h2o jar and exported the H2O_ORIGINAL_JAR and H2O_EXTENDED_JAR system variables. I start the h2o backend with
java -jar $H2O_EXTENDED_JAR -md5skip -name test
But when I start sparkling water via
./bin/sparkling-shell
and in it try to get the H2OConf with
import org.apache.spark.h2o._
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test”)
val hc = H2OContext.getOrCreate(sc, conf)
it fails on the second line with
<console>:24: error: trait H2OConf is abstract; cannot be instantiated
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
^
I've tried adding the newly build extended h2o jar with --jars parameter either to sparkling water or standalone spark with no progress. Does any one have any hints?
This is unsupported for versions of Spark earlier than 2.0.
Download the latest version of sparkling jar and add it to while starting the spark-shell:
./bin/sparkling-shell --master yarn-client --jars "<path to the jar located>"
Then run the code by setting the extended h2o driver:
import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("//home//xyz//sparkling-water-2.2.5/bin//h2odriver-sw2.2.5-hdp2.6-extended.jar").setNumOfExternalH2ONodes(2).setMapperXmx("6G")
val hc = H2OContext.getOrCreate(spark, conf)