What causes java.lang.NoClassDefFoundError exception during EMR execution?

What causes java.lang.NoClassDefFoundError exception during EMR execution? - scala

I have a scala jar file called SGA.jar. Within, there is a class called org/SGA/MainTest, which uses the underlying SGA.jar logic to perform some graph operations and looks like:
package org.SGA
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import java.io._
import scala.util._
object MainTest {
def initialize() : Unit = {
println("Initializing")
}
def perform(collection : Iterable[String]) : Unit = {
val conf = new SparkConf().setAppName("maintest")
val sparkContext = new SparkContext(conf)
sparkContext.setLogLevel("ERROR")
val edges = sparkContext.parallelize(collection.toList).map(_.split(" ")).map { edgeCoordinates => new Edge(edgeCoordinates(0).toLong, edgeCoordinates(1).toLong, edgeCoordinates(2).toDouble) }
println("Creating graph")
val graph : Graph[Any, Double] = Graph.fromEdges(edges, 0)
println("Graph created")
// ...
}
}
SGA.jar is embedded into scalaWrapper.jar, which is a java wrapper around the scala SGA.jar and the necessary datasets. It's folder structure looks like this:
scalaWrapper.jar
| META-INF
| | MANIFEST.MF
| scalawrapper
| | datasets
| | | data1.txt
| | jars
| | | SGA.jar
| | FileParser.java
| | FileParser.class
| | WrapperClass.java
| | WrapperClass.class
| .classpath
| .project
The FileParser class, basically converts the data available in the text files into usable structures and is of no further interest here. The main class is WrapperClass, however:
package scalawrapper;
import scala.collection.*;
import scala.collection.Iterable;
import java.util.List;
import org.SGA.*;
public class WrapperClass {
public static void main(String[] args) {
FileParser fileparser = new FileParser();
String filepath = "/scalawrapper/datasets/data1.txt";
MainTest.initialize();
List<String> list = fileparser.Parse(filepath);
Iterable<String> scalaIterable = JavaConversions.collectionAsScalaIterable(list);
MainTest.perform(scalaIterable);
}
}
SGA.jar is built via SBT and the java jar is developed and exported from Eclipse. When running locally (in which case the SparkConf has appended .setMaster("local[*]").set("spark.executor.memory","7g") to facilitate a local execution), there are no issues and the code behaves as expected.
The problem arises when the scalaWrapper.jar is expected to run on an EMR cluster. The cluster is defined as a 1 master + 4 worker nodes, with an additional spark application step:
Main class : None
Arguments : spark-submit --deploy-mode cluster --class scalawrapper.WrapperClass --executor-memory 17g --executor-cores 16 --driver-memory 17g s3://scalaWrapperCluster/scalaWrapper.jar
The execution fails with :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__1619195545177535823.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for TERM
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for HUP
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for INT
19/04/22 16:56:43 INFO SecurityManager: Changing view acls to: yarn,hadoop
19/04/22 16:56:43 INFO SecurityManager: Changing modify acls to: yarn,hadoop
19/04/22 16:56:43 INFO SecurityManager: Changing view acls groups to:
19/04/22 16:56:43 INFO SecurityManager: Changing modify acls groups to:
19/04/22 16:56:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
19/04/22 16:56:44 INFO ApplicationMaster: Preparing Local resources
19/04/22 16:56:44 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1555952041027_0001_000001
19/04/22 16:56:44 INFO ApplicationMaster: Starting the user application in a separate Thread
19/04/22 16:56:44 INFO ApplicationMaster: Waiting for spark context initialization...
19/04/22 16:56:44 ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/SGA/MainTest
java.lang.NoClassDefFoundError: org/SGA/MainTest
at scalawrapper.WrapperClass.main(WrapperClass.java:20)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: java.lang.ClassNotFoundException: org.SGA.MainTest
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more
Note that WrapperClass.java:20 corresponds to MainTest.initialize();.
This exception seems to be quite popular, as I came upon quite a few attempts to solve (example), yet none solved my problem. I tried including in the scalaWrapper.jar file also the scala-library that was used for building SGA.jar, eliminating static fields, searching for mistakes in the project definitions, but had no luck.

I resolved the issue by uploading SGA.jar to S3 separately and by adding it as the --jars parameter to spark-submit.
spark-submit --deploy-mode cluster --jars s3://scalaWrapperCluster/SGA.jar --class scalawrapper.WrapperClass --executor-memory 17g --executor-cores 16 --driver-memory 17g s3://scalaWrapperCluster/scalaWrapper.jar
Note that the original functionality within scalaWrapper.jar (including the already built-in SGA.jar) didn't change. And the separately uploaded SGA.jar is the one being executed.

Related

How to use PySpark Structure Streaming +Kafka

I trying use spark structure streaming with kafka and i have problem when use spark submit, Consumer still receive data from produce but Spark Structure is error. Please help me find issue on my code
Here my code in test.py:
from kafka import KafkaProducer
from kafka import KafkaConsumer
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stream_test').getOrCreate()
import random
producer = KafkaProducer(bootstrap_servers=["localhost:9092"])
for i in range(0,100):
lg_value = str(random.uniform(5000, 10000))
producer.send(topic = 'test', value = bytes(lg_value, encoding='utf-8'))
producer.flush()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","test").load()
df_to_string = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
print("done")
when i run :
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 test.py
terminal output:
> 20/07/12 19:39:09 INFO Executor: Starting executor ID driver on host
> 192.168.31.129 20/07/12 19:39:09 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 38885. 20/07/12 19:39:09 INFO NettyBlockTransferService: Server
> created on 192.168.31.129:38885 20/07/12 19:39:09 INFO BlockManager:
> Using org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 20/07/12 19:39:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver, 192.168.31.129, 38885,
> None) 20/07/12 19:39:09 INFO BlockManagerMasterEndpoint: Registering
> block manager 192.168.31.129:38885 with 413.9 MiB RAM,
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,
> 192.168.31.129, 38885, None) 20/07/12 19:39:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir ('file:/home/thoaint2/spark-warehouse').
> 20/07/12 19:39:11 INFO SharedState: Warehouse path is
> 'file:/home/thoaint2/spark-warehouse'. Traceback (most recent call
> last): File "/home/thoaint2/test.py", line 15, in <module>
> df = spark.readStream.format("kafka").option('kafka.bootstrap.servers','localhost:9092')
> \ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 420, in load File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1304, in __call__ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 131, in deco File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
> occurred while calling o31.load. : java.lang.NoClassDefFoundError:
> org/apache/kafka/common/serialization/ByteArraySerializer at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
> at
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> py4j.Gateway.invoke(Gateway.java:282) at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79) at
> py4j.GatewayConnection.run(GatewayConnection.java:238) at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.kafka.common.serialization.ByteArraySerializer at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)

NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer
This package is part of kafka-clients JAR, which you'll want to add to your --packages. e.g. spark-submit ... --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0,org.apache.kafka:kafka-clients:<<version>>
Also note that Spark works as a producer as well, so you don't need a different Python Kafka library.
If you simply want to process Kafka Streams without using a JVM then look into Faust

Unable to Send Spark Data Frame to Kafka (java.lang.ClassNotFoundException: Failed to find data source: kafka.)

I am facing issue while pushing data to Kafka with Spark data frame.
Let me explain my scenario in detail with sample example. I want to load the data to spark and send the spark output to kafka. I am using Gradle 3.5 and Spark 2.3.1 & Kafka 1.0.1
Here is build.gradle
buildscript {
ext {
springBootVersion = '1.5.15.RELEASE'
}
repositories {
mavenCentral()
}
dependencies {
classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
}
}
apply plugin: 'scala'
apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'
group = 'com.sample'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = 1.8
repositories {
mavenCentral()
}
dependencies {
compile('org.springframework.boot:spring-boot-starter')
compile ('org.apache.spark:spark-core_2.11:2.3.1')
compile ('org.apache.spark:spark-sql_2.11:2.3.1')
compile ('org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.1')
compile ('org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1')
testCompile('org.springframework.boot:spring-boot-starter-test')
}
And here is my code:
package com.sample
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
object SparkConnection {
case class emp(empid:Integer, empname:String, empsal:Float)
def main(args:Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark
Connection").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val dataRdd = sc.textFile("/home/sample/data/sample.txt")
val mapRdd = dataRdd.map(row => row.split(","))
val empRdd = mapRdd.map( row => emp(row(0).toInt, row(1), row(2).toFloat))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val empDF = empRdd.toDF()
empDF.
select(to_json(struct(empDF.columns.map(column):_*)).alias("value"))
.write.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "my-kafka-topic").save()
}
}
Please ignore spring boot framework API in build.gradle.
After build my package using Gradle, I can able to see all dependent classes mentioned in .gradle file.
But when I run the code with spark-submit like
spark-submit --class com.sample.SparkConnection spark_kafka_integration.jar
I am getting the following error
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
at com.iniste.SparkConnection$.main(SparkConnection.scala:29)
at com.iniste.SparkConnection.main(SparkConnection.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
2018-09-05 17:41:17 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-09-05 17:41:17 INFO AbstractConnector:318 - Stopped Spark#51684e4a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-05 17:41:17 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-09-05 17:41:17 INFO MemoryStore:54 - MemoryStore cleared
2018-09-05 17:41:17 INFO BlockManager:54 - BlockManager stopped
2018-09-05 17:41:17 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-09-05 17:41:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-05 17:41:17 INFO SparkContext:54 - Successfully stopped SparkContext
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Shutdown hook called
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd4cb4ef-3883-4c26-a93f-f355b13ef306
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-156dfdbd-cff4-4c70-943f-35ef403a01ed
Please help me to get out of this error. And some of the blogs they suggested me to use --packages option with spark-submit. But there is some proxy limitation with me which is required to download the mentioned packages. But I am unable to understand that why spark-submit is unable to fetch the jars which are already available. Please correct me where I am doing wrong.

As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.11 and its dependencies can be directly added to spark-submit using --packages, such as below in your case
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 com.sample.SparkConnection spark_kafka_integration.jar
This can be found here
However, as per cricket_007 suggestions i have added the shadowjar to your build.gradle
So the new one may look similar to this.
buildscript {
ext {
springBootVersion = '1.5.15.RELEASE'
}
repositories {
mavenCentral()
}
dependencies {
classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
}
}
plugins {
id "com.github.johnrengelman.shadow" version "2.0.4"
}
apply plugin: 'scala'
apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'
apply plugin: "com.github.johnrengelman.shadow"
group = 'com.sample'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = 1.8
repositories {
mavenCentral()
}
dependencies {
compile('org.springframework.boot:spring-boot-starter')
compile ('org.apache.spark:spark-core_2.11:2.3.1')
compile ('org.apache.spark:spark-sql_2.11:2.3.1')
compile ('org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.1')
compile ('org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1')
compile 'org.scala-lang:scala-library:2.11.12'
// https://mvnrepository.com/artifact/org.apache.kafka/kafka
//compile group: 'org.apache.kafka', name: 'kafka_2.10', version: '0.8.0'
testCompile('org.springframework.boot:spring-boot-starter-test')
}
shadowJar{
baseName = "spark_kafka_integration"
zip64 true
classifier = null
version = null
}
So to create your jar the command would be just :shadowJar from your gradle.

Kafka java.io.EOFException - NetworkReceive.readFromReadableChannel

I'm trying to connect to IBM Message Hub from Apache Spark 2.2.1 Structured Streaming.
The connection code is quite basic:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("StreamingRetailTransactions").getOrCreate()
import spark.implicits._
val df = spark.readStream.
format("kafka").
option("kafka.bootstrap.servers", "kafka03-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka04-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka01-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka02-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093").
option("subscribe", "transactions_load").
option("security.protocol", "SASL_SSL").
option("sasl.mechanism", "PLAIN").
option("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"*****\" password=\"*****\";").
option("ssl.protocol", "TLSv1.2").
option("ssl.enabled.protocols", "TLSv1.2").
option("ssl.endpoint.identification.algorithm", "HTTPS").
option("auto.offset.reset","earliest").
option("group.id", System.currentTimeMillis).
load()
val query = df.writeStream.format("console").start()
I'm starting the spark shell with:
~/spark-2.2.1-bin-hadoop2.7$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
However, I'm getting a disconnected error:
scala> 18/01/09 08:34:17 WARN NetworkClient: Bootstrap broker kafka02-prod02.messagehub.services.eu-gb.bluemix.net:9093 disconnected
18/01/09 08:34:17 WARN NetworkClient: Bootstrap broker kafka03-prod02.messagehub.services.eu-gb.bluemix.net:9093 disconnected
18/01/09 08:34:17 WARN NetworkClient: Bootstrap broker kafka01-prod02.messagehub.services.eu-gb.bluemix.net:9093 disconnected
<<repeats forever>>
I've increased debug output with sc.setLogLevel("DEBUG") and I get:
<<log output ommitted for brevity>>
18/01/09 08:38:28 DEBUG SessionState: SessionState user: null
18/01/09 08:38:28 DEBUG SessionState: HDFS root scratch dir: /tmp/hive with schema null, permission: rwx-wx-wx
18/01/09 08:38:28 INFO SessionState: Created local directory: /tmp/2b4557ce-dd17-46d1-9ab0-f9a36fd750f9_resources
18/01/09 08:38:28 INFO SessionState: Created HDFS directory: /tmp/hive/snowch/2b4557ce-dd17-46d1-9ab0-f9a36fd750f9
18/01/09 08:38:28 INFO SessionState: Created local directory: /tmp/snowch/2b4557ce-dd17-46d1-9ab0-f9a36fd750f9
18/01/09 08:38:28 INFO SessionState: Created HDFS directory: /tmp/hive/snowch/2b4557ce-dd17-46d1-9ab0-f9a36fd750f9/_tmp_space.db
18/01/09 08:38:28 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/snowch/spark-2.2.1-bin-hadoop2.7/spark-warehouse
18/01/09 08:38:28 DEBUG SessionState: Session is using authorization class class org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider
18/01/09 08:38:28 DEBUG StateStoreCoordinatorRef: Retrieved existing StateStoreCoordinator endpoint
18/01/09 08:38:28 DEBUG StreamExecution: Starting Trigger Calculation
18/01/09 08:38:28 INFO StreamExecution: Starting new streaming query.
18/01/09 08:38:28 DEBUG UserGroupInformation: PrivilegedAction as:snowch (auth:SIMPLE) from:org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:331)
18/01/09 08:38:28 DEBUG KafkaSource$$anon$1: Unable to find batch /tmp/temporary-fb3e6e1a-fbbe-4098-991c-0b29f63ecade/sources/0/0
18/01/09 08:38:28 DEBUG AbstractCoordinator: Sending coordinator request for group spark-kafka-source-9a14bb54-8f1b-47db-8497-19c083128496--998588290-driver-0 to broker kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093 (id: -5 rack: null)
18/01/09 08:38:28 DEBUG NetworkClient: Initiating connection to node -5 at kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093.
18/01/09 08:38:28 DEBUG NetworkClient: Initialize connection to node -3 for sending metadata request
18/01/09 08:38:28 DEBUG NetworkClient: Initiating connection to node -3 at kafka01-prod02.messagehub.services.eu-gb.bluemix.net:9093.
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--5.bytes-sent
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--5.bytes-received
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--5.latency
18/01/09 08:38:28 DEBUG NetworkClient: Completed connection to node -5
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--3.bytes-sent
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--3.bytes-received
18/01/09 08:38:28 DEBUG Metrics: Added sensor with name node--3.latency
18/01/09 08:38:28 DEBUG NetworkClient: Completed connection to node -3
18/01/09 08:38:28 DEBUG NetworkClient: Sending metadata request {topics=[transactions_load]} to node -5
18/01/09 08:38:28 DEBUG Selector: Connection with kafka05-prod02.messagehub.services.eu-gb.bluemix.net/159.8.179.153 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:154)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:135)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:323)
at org.apache.kafka.common.network.Selector.poll(Selector.java:283)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:179)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:974)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:174)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1$$anonfun$apply$9.apply(KafkaOffsetReader.scala:172)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply$mcV$sp(KafkaOffsetReader.scala:263)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:262)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt$1.apply(KafkaOffsetReader.scala:262)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.org$apache$spark$sql$kafka010$KafkaOffsetReader$$withRetriesWithoutInterrupt(KafkaOffsetReader.scala:261)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:172)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anonfun$fetchLatestOffsets$1.apply(KafkaOffsetReader.scala:172)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.runUninterruptibly(KafkaOffsetReader.scala:230)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.fetchLatestOffsets(KafkaOffsetReader.scala:171)
at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$initialPartitionOffsets$1.apply(KafkaSource.scala:132)
at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$initialPartitionOffsets$1.apply(KafkaSource.scala:129)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:129)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:97)
at org.apache.spark.sql.kafka010.KafkaSource.getOffset(KafkaSource.scala:163)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$10$$anonfun$apply$6.apply(StreamExecution.scala:521)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$10$$anonfun$apply$6.apply(StreamExecution.scala:521)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$10.apply(StreamExecution.scala:520)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$10.apply(StreamExecution.scala:518)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$constructNextBatch(StreamExecution.scala:518)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$populateStartOffsets(StreamExecution.scala:492)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:297)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
18/01/09 08:38:28 DEBUG NetworkClient: Node -5 disconnected.
18/01/09 08:38:28 WARN NetworkClient: Bootstrap broker kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093 disconnected
18/01/09 08:38:28 DEBUG ConsumerNetworkClient: Cancelled GROUP_COORDINATOR request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler#4eacac4e, request=RequestSend(header={api_key=10,api_version=0,correlation_id=0,client_id=consumer-1}, body={group_id=spark-kafka-source-9a14bb54-8f1b-47db-8497-19c083128496--998588290-driver-0}), createdTimeMs=1515487108479, sendTimeMs=1515487108598) with correlation id 0 due to node -5 being disconnected
18/01/09 08:38:28 DEBUG NetworkClient: Sending metadata request {topics=[transactions_load]} to node -3
18/01/09 08:38:28 DEBUG Selector: Connection with kafka01-prod02.messagehub.services.eu-gb.bluemix.net/159.8.179.149 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
<<repeated>>
I have seen the following similar questions:
Kafka Error in I/O java.io.EOFException: null - however, this question is for a much older version of Kafka
I understand that some of the output is just 'noise', however, my spark streaming application does not appear to be receiving any data. I have connected with a console client and I am able to see data.
Update 1 - I've tried configuring JaaS, but still getting the same error. The issue may be that the JaaS code needs to run on each worker node, but isn't getting run on them.
sc.setLogLevel("DEBUG")
def jaasClientConfig(username: String, password: String): Unit = {
import javax.security.auth.login.AppConfigurationEntry
import javax.security.auth.login.Configuration
import javax.security.auth.login.LoginException
import scala.collection.JavaConversions._
System.setProperty("java.security.auth.login.config", "")
Configuration.setConfiguration(new Configuration() {
def getAppConfigurationEntry(name: String): Array[AppConfigurationEntry] = {
val idMap = Map(
"serviceName" -> "kafka",
"username" -> username,
"password" -> password
)
val ace = new AppConfigurationEntry(
"org.apache.kafka.common.security.plain.PlainLoginModule",
AppConfigurationEntry.LoginModuleControlFlag.REQUIRED,
idMap
)
return Array(ace)
}
})
}
def startSparkStreaming(): Unit = {
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("StreamingRetailTransactions").getOrCreate()
import spark.implicits._
val df = spark.readStream.
format("kafka").
option("kafka.bootstrap.servers", "kafka03-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka04-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka01-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka02-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093").
option("subscribe", "transactions_load").
option("security.protocol", "SASL_SSL").
option("sasl.mechanism", "PLAIN").
option("ssl.protocol", "TLSv1.2").
option("ssl.enabled.protocols", "TLSv1.2").
option("ssl.endpoint.identification.algorithm", "HTTPS").
option("auto.offset.reset","earliest").
option("group.id", System.currentTimeMillis).
load()
val query = df.writeStream.format("console").start()
}
jaasClientConfig("****","****")
startSparkStreaming()
Update 2
I've also tried with a jaas.conf:
~/spark-2.2.1-bin-hadoop2.7$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=jaas.conf" --files "jaas.conf"
and ...
KafkaClient {
org.apache.kafka.common.security.plain.PlainLoginModule required
username="*****"
password="*****";
}
Still the same problem ...

First I needed to run spark-shell with --conf that points the executor and driver to my jaas.conf:
./bin/spark-shell --master local[1] \
--jars external/kafka-0-10-sql/target/spark-sql-kafka-0-10_2.11-2.2.2-SNAPSHOT.jar,external/kafka-0-10-assembly/target/spark-streaming-kafka-0-10-assembly_2.11-2.2.2-SNAPSHOT.jar \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/path/to/jaas.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/path/to/jaas.conf" \
--num-executors 1 --executor-cores 1
Next I had to add some kafka options:
val df = spark.readStream.
format("kafka").
option("kafka.bootstrap.servers", "kafka03-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka04-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka01-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka02-prod02.messagehub.services.eu-gb.bluemix.net:9093,kafka05-prod02.messagehub.services.eu-gb.bluemix.net:9093").
option("subscribe", "transactions_load").
option("kafka.security.protocol", "SASL_SSL").
option("kafka.sasl.mechanism", "PLAIN").
option("kafka.ssl.protocol", "TLSv1.2").
option("kafka.ssl.enabled.protocols", "TLSv1.2").
load()
Note that the kafka options need to be prefixed with kafka., for example:
security.protocol => kafka.security.protocol
These changes solved the connectivity issue for me.

This appears to be a failed authentication. At the current MH version of Kafka the server just closes the connection when authentication fails

the "sasl.jaas.config" setting is used in the Kafka client from version 0.10.2
the Kafka client used by Spark 2.2.1 is 0.10.0 so authentication fails as suspected.
you can use the java.security.auth.login.config system property to specify a jaas file
Alternatively you can programmatically set the credentials for the client with a snippet like this
public static void jaasClientConfig(final String username, final String password) throws Exception {
System.setProperty("java.security.auth.login.config", "");
Configuration.setConfiguration(new Configuration() {
public AppConfigurationEntry[] getAppConfigurationEntry(String name) {
HashMap<String, String> idMap = new HashMap<>();
idMap.put("serviceName", "kafka"); // Seems to be optional
idMap.put("username", username);
idMap.put("password", password);
AppConfigurationEntry ace = new AppConfigurationEntry("org.apache.kafka.common.security.plain.PlainLoginModule",
AppConfigurationEntry.LoginModuleControlFlag.REQUIRED, idMap);
AppConfigurationEntry[] entry = { ace };
return entry;
}
});
}

Spark program behave differently based on --master is set to local[4] or yarn-client

I am using the movielens dataset to load Movie information into a spark program and print the same using the following code snippet
import org.apache.spark.{SparkConf, SparkContext}
object MovieApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("movie-recommender")
val sc = new SparkContext(conf)
val movieFile = "/mnt/DATASETS/ml-1m/movies.dat"
val movieData = sc.textFile(movieFile)
val movies = movieData.map(_.split("::") match { case Array(movieid, title, genres) =>
val genreList = genres.split("|")
(movieid, title, genreList)
})
println("Num movies:" + movies.count())
movies.foreach { case movielist =>
println("ID:" + movielist._1 + "Title:" + movielist._2)
}
}
}
When I run the code using the command
spark-submit --master local[4] --class "MovieApp" movie-recommender.jar I get the expected output as
*root#philli ml]# /usr/lib/spark/bin/spark-submit --master local[4] --class "MovieApp" movie-recommender_2.10-1.0.jar
14/12/05 00:17:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Num movies:3883
ID:2020 Title:Dangerous Liaisons (1988)
ID:2021 Title:Dune (1984)
ID:2022 Title:Last Temptation of Christ, The (1988)
ID:2023 Title:Godfather: Part III, The (1990)
ID:2024 Title:Rapture, The (1991)
ID:2025 Title:Lolita (1997)
ID:2026 Title:Disturbing Behavior (1998)
ID:2027 Title:Mafia! (1998)
ID:2028 Title:Saving Private Ryan (1998)
ID:2029 Title:Billy's Hollywood Screen Kiss (1997)
...
*
but when I run the same on a hadoop cluster using the command
spark-submit --master yarn-client --class "MovieApp" movie-recommender.jar the output is different as below (no movie details???)
*[root#philli ml]# /usr/lib/spark/bin/spark-submit --master yarn-client --class "MovieApp" movie-recommender_2.10-1.0.jar
14/12/05 00:21:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/12/05 00:21:07 WARN BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
--args is deprecated. Use --arg instead.
Num movies:3883
[root#philli ml]# *
Why should the behavior of the program change between running it as local vs on the cluster....I have built spark-1.1.1 for hadoop using the command
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
package
The cluster I am using is HDP2.1
Sample movies.dat file is as follows:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller

When you run the program on a cluster, the foreach closure will be executed in the workers, therefore the println is happening, but on the worker's stdout and not on the driver.
Look into the yarn logs and you will find the expected output.

shark/spark throws NPE when querying a table

The development part of shark/spark wiki is really brief, so I tried to put together a code in an effort to programmatically query a table. Here it is ...
object Test extends App {
val master = "spark://localhost.localdomain:8084"
val jobName = "scratch"
val sparkHome = "/home/shengc/Downloads/software/spark-0.6.1"
val executorEnvVars = Map[String, String](
"SPARK_MEM" -> "1g",
"SPARK_CLASSPATH" -> "",
"HADOOP_HOME" -> "/home/shengc/Downloads/software/hadoop-0.20.205.0",
"JAVA_HOME" -> "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64",
"HIVE_HOME" -> "/home/shengc/Downloads/software/hive-0.9.0-bin"
)
val sc = new shark.SharkContext(master, jobName, sparkHome, Nil, executorEnvVars)
sc.sql2console("create table src")
sc.sql2console("load data local inpath '/home/shengc/Downloads/software/hive-0.9.0-bin/examples/files/kv1.txt' into table src")
sc.sql2console("select count(1) from src")
}
I can create table src and load data into src fine, but the last query threw NPE and failed, here is the output...
13/01/06 17:33:20 INFO execution.SparkTask: Executing shark.execution.SparkTask
13/01/06 17:33:20 INFO shark.SharkEnv: Initializing SharkEnv
13/01/06 17:33:20 INFO execution.SparkTask: Adding jar file:///home/shengc/workspace/shark/hive/lib/hive-builtins-0.9.0.jar
java.lang.NullPointerException
at shark.execution.SparkTask$$anonfun$execute$5.apply(SparkTask.scala:58)
at shark.execution.SparkTask$$anonfun$execute$5.apply(SparkTask.scala:55)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
at shark.execution.SparkTask.execute(SparkTask.scala:55)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
at shark.SharkContext.sql(SharkContext.scala:58)
at shark.SharkContext.sql2console(SharkContext.scala:84)
at Test$delayedInit$body.apply(Test.scala:20)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:60)
at scala.App$$anonfun$main$1.apply(App.scala:60)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:30)
at scala.App$class.main(App.scala:60)
at Test$.main(Test.scala:4)
at Test.main(Test.scala)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask13/01/06 17:33:20 ERROR ql.Driver: FAILED: Execution Error, return code -101 from shark.execution.SparkTask
13/01/06 17:33:20 INFO ql.Driver: </PERFLOG method=Driver.execute start=1357511600030 end=1357511600054 duration=24>
13/01/06 17:33:20 INFO ql.Driver: <PERFLOG method=releaseLocks>
13/01/06 17:33:20 INFO ql.Driver: </PERFLOG method=releaseLocks start=1357511600054 end=1357511600054 duration=0>
However, I can query src table by typing in select * from src within the shell invoked by bin/shark-withinfo
You might ask me how about trying that sql in the shell trigged by "bin/shark-shell". Well, I cannot get into that shell. Here is the error I came across...
https://groups.google.com/forum/?fromgroups=#!topic/shark-users/glZzrUfabGc
[EDIT 1]: this NPE seems to be resulting from SharkENV.sc has not been set, so I added
shark.SharkEnv.sc = sc
right before any sql2console opertions are executed. It then complained ClassNotFoundException of scala.tools.nsc, so I manually put scala-compiler in the classpath. After that, the code complained another ClassNotFoundException, which I cannot figure out how to fix it, since I did put shark jar in classpath.
13/01/06 18:09:34 INFO cluster.TaskSetManager: Lost TID 1 (task 1.0:1)
13/01/06 18:09:34 INFO cluster.TaskSetManager: Loss was due to java.lang.ClassNotFoundException: shark.execution.TableScanOperator$$anonfun$preprocessRdd$3
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
[EDIT 2]: OK, I figured out another code which can fulfill what I want by following exactly shark's source code of how to initialize the interactive repl.
System.setProperty("MASTER", "spark://localhost.localdomain:8084")
System.setProperty("SPARK_MEM", "1g")
System.setProperty("SPARK_CLASSPATH", "")
System.setProperty("HADOOP_HOME", "/home/shengc/Downloads/software/hadoop-0.20.205.0")
System.setProperty("JAVA_HOME", "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64")
System.setProperty("HIVE_HOME", "/home/shengc/Downloads/software/hive-0.9.0-bin")
System.setProperty("SCALA_HOME", "/home/shengc/Downloads/software/scala-2.9.2")
shark.SharkEnv.initWithSharkContext("scratch")
val sc = shark.SharkEnv.sc.asInstanceOf[shark.SharkContext]
sc.sql2console("select * from src")
this is ugly, but at least it works. Any comments of how to write a more robust piece of code is welcome!!
For whoever wishes to programmatically operate on shark, please note that all hive and shark jars must be in your CLASSPATH, and scala compiler has to be in your classpath too. The other important thing is hadoop's conf should be in the classpath too.

I believe the issue is your SharkEnv is not initialized.
I'm using shark 0.9.0 (but I believe you have to initialize SharkEnv in 0.6.1 too), and my SharkEnv is initialized in the following way:
// SharkContext
val sc = new SharkContext(master,
jobName,
System.getenv("SPARK_HOME"),
Nil,
executorEnvVar)
// Initialize SharkEnv
SharkEnv.sc = sc
// create and populate table
sc.runSql("CREATE TABLE src(key INT, value STRING)")
sc.runSql("LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src")
// print result to stdout
println(sc.runSql("select * from src"))
println(sc.runSql("select count(*) from src"))
Also, try to query data from src table (comment line with "select count(*) ...") without aggregating functions, I had similar issue when data query was ok, but count(*) throwed exception, fixed by adding mysql-connector-java.jar to yarn.application.classpath in my case.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What causes java.lang.NoClassDefFoundError exception during EMR execution? - scala

Related

How to use PySpark Structure Streaming +Kafka

Unable to Send Spark Data Frame to Kafka (java.lang.ClassNotFoundException: Failed to find data source: kafka.)

Kafka java.io.EOFException - NetworkReceive.readFromReadableChannel

Spark program behave differently based on --master is set to local[4] or yarn-client

shark/spark throws NPE when querying a table

Categories

Resources