Spark Streaming Kinesis consumer return empty data - scala

I am trying to consume a Kinesis Stream using spark streaming libraries, org.apache.spark.streaming.kinesis.KinesisUtils. I can verify that the Stream has data in it using a python script. But however, while trying to write a consumer in scala, I have been getting empty data. here's my code:
def getKinesisData = {
val endpointUrl = "https://kinesis.us-west-2.amazonaws.com"
val streamName = "myAwesomeStream"
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials()
require(credentials != null, "No AWS credentials found.")
val kinesisClient = new AmazonKinesisClient(credentials)
kinesisClient.setEndpoint(endpointUrl)
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
val numStreams = numShards
val batchInterval = Milliseconds(2000)
val kinesisCheckpointInterval = batchInterval
val sparkConfig = new SparkConf().setAppName("myAwesomeApp").setMaster("local")
val ssc = new StreamingContext(sparkConfig, batchInterval)
val kinesisStreams = (0 until numStreams).map { i =>
println(i)
KinesisUtils.createStream(ssc, "myAwesomeApp", streamName, endpointUrl, regionName,
InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2
)
}
val unionStreams = ssc.union(kinesisStreams)
// Convert each line of Array[Byte] to String, and split into words
val words = unionStreams.flatMap(byteArray => new String(byteArray).split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.print()
}
I got this code as an example from github and I don't really care about all the unions, and flatmapping and wordcounts that have been done in the later part of the code. I just need to know how I can get the actual data from the stream.
UPDATE:
It prints the following on the console while I run it
16/12/16 14:57:01 INFO SparkContext: Running Spark version 2.0.0
16/12/16 14:57:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/16 14:57:02 INFO SecurityManager: Changing view acls to:
16/12/16 14:57:02 INFO SecurityManager: Changing modify acls to:
16/12/16 14:57:02 INFO SecurityManager: Changing view acls groups to:
16/12/16 14:57:02 INFO SecurityManager: Changing modify acls groups to:
16/12/16 14:57:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(username); groups with view permissions: Set(); users with modify permissions: Set(username); groups with modify permissions: Set()
16/12/16 14:57:02 INFO Utils: Successfully started service 'sparkDriver' on port 54774.
16/12/16 14:57:02 INFO SparkEnv: Registering MapOutputTracker
16/12/16 14:57:02 INFO SparkEnv: Registering BlockManagerMaster
16/12/16 14:57:02 INFO DiskBlockManager: Created local directory at
16/12/16 14:57:02 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
16/12/16 14:57:02 INFO SparkEnv: Registering OutputCommitCoordinator
16/12/16 14:57:02 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/12/16 14:57:02 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://<I masked this IP address and port>
16/12/16 14:57:03 INFO Executor: Starting executor ID driver on host localhost
16/12/16 14:57:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54775.
16/12/16 14:57:03 INFO NettyBlockTransferService: Server created on <I masked this IP address and port>
16/12/16 14:57:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, <I masked this IP address and port>)
16/12/16 14:57:03 INFO BlockManagerMasterEndpoint: Registering block manager <I masked this IP address and port> with 2004.6 MB RAM, BlockManagerId(driver, <I masked this IP address and port>)
16/12/16 14:57:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, <I masked this IP address and port>)
16/12/16 14:57:03 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
0 <-- printing shard
1 <-- printing shard
#### PRINTING kinesisStreams ######
Vector(org.apache.spark.streaming.kinesis.KinesisInputDStream#2650f79, org.apache.spark.streaming.kinesis.KinesisInputDStream#75fc1992)
#### PRINTING unionStreams ######
()
#### words######
org.apache.spark.streaming.dstream.FlatMappedDStream#6fd12c5
#### PRINTING wordCounts######
org.apache.spark.streaming.dstream.ShuffledDStream#790a251b
16/12/16 14:57:03 INFO SparkContext: Invoking stop() from shutdown hook
16/12/16 14:57:03 INFO SparkUI: Stopped Spark web UI at http://<I masked this IP address and port>
16/12/16 14:57:03 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/16 14:57:03 INFO MemoryStore: MemoryStore cleared
16/12/16 14:57:03 INFO BlockManager: BlockManager stopped
16/12/16 14:57:03 INFO BlockManagerMaster: BlockManagerMaster stopped
16/12/16 14:57:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/12/16 14:57:03 INFO SparkContext: Successfully stopped SparkContext
16/12/16 14:57:03 INFO ShutdownHookManager: Shutdown hook called
16/12/16 14:57:03 INFO ShutdownHookManager: Deleting directory

The problem was with the 1.5.2 version of Spark Library that does not work well with Kinesis.

Hope this can help someone having this issue.
If u are facing this issue, it could not be a real error.
Kinesis Kafka Integration uses Receiver API and it runs in a diffrent thread from either Driver or Executors. There is an initial lagging period where you think everything is started but Kinesis Receiver still running some procedures before it actually downloads data from Kinesis.
Solution: TO WAIT, in my case, data appears at Spark side after 40-50 seconds

Related

Exception in thread "main" java.lang.NullPointerException com.databricks.dbutils_v1.DBUtilsHolder$$anon$1.invoke

I would like to read a parquet file in Azure Blob, so I have mount the data from Azure Blob to local with dbultils.fs.mount
But I got the errors Exception in thread "main" java.lang.NullPointerException
Below is my log:
hello big data
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/10 23:20:10 INFO SparkContext: Running Spark version 2.1.0
20/06/10 23:20:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/06/10 23:20:11 INFO SecurityManager: Changing view acls to: Admin
20/06/10 23:20:11 INFO SecurityManager: Changing modify acls to: Admin
20/06/10 23:20:11 INFO SecurityManager: Changing view acls groups to:
20/06/10 23:20:11 INFO SecurityManager: Changing modify acls groups to:
20/06/10 23:20:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Admin); groups with view permissions: Set(); users with modify permissions: Set(Admin); groups with modify permissions: Set()
20/06/10 23:20:12 INFO Utils: Successfully started service 'sparkDriver' on port 4725.
20/06/10 23:20:12 INFO SparkEnv: Registering MapOutputTracker
20/06/10 23:20:13 INFO SparkEnv: Registering BlockManagerMaster
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/06/10 23:20:13 INFO DiskBlockManager: Created local directory at C:\Users\Admin\AppData\Local\Temp\blockmgr-c023c3b8-fd70-461a-ac69-24ce9c770efe
20/06/10 23:20:13 INFO MemoryStore: MemoryStore started with capacity 894.3 MB
20/06/10 23:20:13 INFO SparkEnv: Registering OutputCommitCoordinator
20/06/10 23:20:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/06/10 23:20:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040
20/06/10 23:20:13 INFO Executor: Starting executor ID driver on host localhost
20/06/10 23:20:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 4738.
20/06/10 23:20:13 INFO NettyBlockTransferService: Server created on 192.168.0.102:4738
20/06/10 23:20:13 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/06/10 23:20:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.102:4738 with 894.3 MB RAM, BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:14 INFO SharedState: Warehouse path is 'file:/E:/sparkdemo/sparkdemo/spark-warehouse/'.
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.databricks.dbutils_v1.DBUtilsHolder$$anon$1.invoke(DBUtilsHolder.scala:17)
at com.sun.proxy.$Proxy7.fs(Unknown Source)
at Transform$.main(Transform.scala:19)
at Transform.main(Transform.scala)
20/06/10 23:20:14 INFO SparkContext: Invoking stop() from shutdown hook
20/06/10 23:20:14 INFO SparkUI: Stopped Spark web UI at http://192.168.0.102:4040
20/06/10 23:20:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/10 23:20:14 INFO MemoryStore: MemoryStore cleared
20/06/10 23:20:14 INFO BlockManager: BlockManager stopped
20/06/10 23:20:14 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/10 23:20:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/10 23:20:14 INFO SparkContext: Successfully stopped SparkContext
20/06/10 23:20:14 INFO ShutdownHookManager: Shutdown hook called
20/06/10 23:20:14 INFO ShutdownHookManager: Deleting directory C:\Users\Admin\AppData\Local\Temp\spark-cbdbcfe7-bc70-4d34-ad8e-5baed8308ae2
My code:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import org.apache.spark.sql.SparkSession
object Demo {
def main(args:Array[String]): Unit = {
println("hello big data")
val containerName = "container1"
val storageAccountName = "storageaccount1"
val sas = "saskey"
val url = "wasbs://" + containerName + "#" + storageAccountName + ".blob.core.windows.net/"
var config = "fs.azure.sas." + containerName + "." + storageAccountName + ".blob.core.windows.net"
//Spark session
val spark : SparkSession = SparkSession.builder
.appName("SpartDemo")
.master("local[1]")
.getOrCreate()
//Mount data
dbutils.fs.mount(
source = url,
mountPoint = "/mnt/container1",
extraConfigs = Map(config -> sas))
val parquetFileDF = spark.read.parquet("/mnt/container1/test1.parquet")
parquetFileDF.show()
}
}
My sbt file:
name := "sparkdemo1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.databricks" % "dbutils-api_2.11" % "0.0.3",
"org.apache.spark" % "spark-core_2.11" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0"
)
Are you running this into a Databricks instance?
If not, that's the problem: dbutils are provided by Databricks execution context.
In that case, as far as I know, you have three options:
Package your application into a jar file and run it using a Databricks job
Use databricks-connect
Try to emulate a mocked dbutils instance outside Databricks as shown here:
com.databricks.dbutils_v1.DBUtilsHolder.dbutils0.set(
new com.databricks.dbutils_v1.DBUtilsV1{
...
}
)
Anyway, I'd say that options 1 and 2 are better than the third one. Also by choosing one of those you don't need to include "dbutils-api_2.11" dependency, as it is provided by Databricks cluster.

How to raise log level to error in Spark?

I have tried to suppress log by spark.sparkContext.setLogLevel("ERROR") in:
package com.databricks.example
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
object DFUtils extends Serializable {
#transient lazy val logger = Logger.getLogger(getClass.getName)
def pointlessUDF(raw: String) = {
raw
}
}
object DataFrameExample extends Serializable {
def main(args: Array[String]): Unit = {
val pathToDataFolder = args(0)
// println(pathToDataFolder + "data.json")
// start up the SparkSession
// along with explicitly setting a given config
val spark = SparkSession.builder().appName("Spark Example")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.getOrCreate()
// for suppresse logs by raising log level
spark.sparkContext.setLogLevel("ERROR")
// println(spark.range(1, 2000).count());
// udf registration
spark.udf.register("myUDF", DFUtils.pointlessUDF(_:String):String)
val df = spark.read.json(pathToDataFolder + "data.json")
df.printSchema()
// df.collect.foreach(println)
// val x = df.select("value").foreach(x => println(x));
val manipulated = df.groupBy("grouping").sum().collect().foreach(x => println(x))
// val manipulated = df.groupBy(expr("myUDF(group)")).sum().collect().foreach(x => println(x))
}
}
Why do I still get INFO and WARN level logs? Have I successfully raised log level to error? Thanks.
$ ~/programs/spark/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --class com.databricks.example.DataFrameExample --master local target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar /tmp/test/
20/03/19 10:09:10 WARN Utils: Your hostname, ocean resolves to a loopback address: 127.0.1.1; using 192.168.122.1 instead (on interface virbr0)
20/03/19 10:09:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/03/19 10:09:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/19 10:09:12 INFO SparkContext: Running Spark version 2.4.5
20/03/19 10:09:12 INFO SparkContext: Submitted application: Spark Example
20/03/19 10:09:12 INFO SecurityManager: Changing view acls to: t
20/03/19 10:09:12 INFO SecurityManager: Changing modify acls to: t
20/03/19 10:09:12 INFO SecurityManager: Changing view acls groups to:
20/03/19 10:09:12 INFO SecurityManager: Changing modify acls groups to:
20/03/19 10:09:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(t); groups with view permissions: Set(); users with modify permissions: Set(t); groups with modify permissions: Set()
20/03/19 10:09:13 INFO Utils: Successfully started service 'sparkDriver' on port 35821.
20/03/19 10:09:13 INFO SparkEnv: Registering MapOutputTracker
20/03/19 10:09:13 INFO SparkEnv: Registering BlockManagerMaster
20/03/19 10:09:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/19 10:09:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/19 10:09:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ce47f30a-ee1c-44a8-9f5b-204905ee3b2d
20/03/19 10:09:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/03/19 10:09:13 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/19 10:09:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/03/19 10:09:14 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.122.1:4040
20/03/19 10:09:14 INFO SparkContext: Added JAR file:/tmp/test/bookexample/target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar at spark://192.168.122.1:35821/jars/example_2.11-0.1-SNAPSHOT.jar with timestamp 1584626954295
20/03/19 10:09:14 INFO Executor: Starting executor ID driver on host localhost
20/03/19 10:09:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39215.
20/03/19 10:09:14 INFO NettyBlockTransferService: Server created on 192.168.122.1:39215
20/03/19 10:09:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/19 10:09:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.122.1:39215 with 366.3 MB RAM, BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.122.1, 39215, None)
root
|-- grouping: string (nullable = true)
|-- value: long (nullable = true)
[group_3,10]
[group_1,12]
[group_2,5]
[group_4,2]
You need to add a log4j.properties file into your resources folder. Otherwise it would use the default settings that are set in your spark folder. On Linux usually here: /etc/spark2/.../log4j-defaults.properties).
The location is also mentioned in your log file:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Make sure to set the rootCategory to ERROR, like in the following example:
# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Spark Streaming job wont schedule additional work

Spark 2.1.1 built for Hadoop 2.7.3
Scala 2.11.11
Cluster has 3 Linux RHEL 7.3 Azure VM's, running Spark Standalone Deploy Mode (no YARN or Mesos, yet)
I have created a very simple SparkStreaming job using IntelliJ, written in Scala. I'm using Maven and building the job into a fat/uber jar that contains all dependencies.
When I run the job locally it works fine. If I copy the jar to the cluster and run it with a master of local[2] it also works fine. However, if I submit the job to the cluster master it's like it does not want to schedule additional work beyond the first task. The job starts up, grabs however many events are in the Azure Event Hub, processes them successfully, then never does anymore work. It does not matter if I submit the job to the master as just an application or if it's submitted using supervised cluster mode, both do the same thing.
I've looked through all the logs I know of (master, driver (where applicable), and executor) and I am not seeing any errors or warnings that seem actionable. I've altered the log level, shown below, to show ALL/INFO/DEBUG and sifted through those logs without finding anything that seems relevant.
It may be worth noting that I had previously created several jobs that connect to Kafka, instead of the Azure Event Hub, using Java and those jobs run in supervised cluster mode without an issue on this same cluster. This leads me to believe that the cluster configuration isn't an issue, it's either something with my code (below) or the Azure Event Hub.
Any thoughts on where I might check next to isolate this issue? Here is the code for my simple job.
Thanks in advance.
Note: conf.{name} indicates values I'm loading from a config file. I've tested loading and hard-coding them, both with the same result.
package streamingJob
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import org.joda.time.DateTime
object TestJob {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
sparkConf.setAppName("TestJob")
// Uncomment to run locally
//sparkConf.setMaster("local[2]")
val sparkContext = new SparkContext(sparkConf)
sparkContext.setLogLevel("ERROR")
val streamingContext: StreamingContext = new StreamingContext(sparkContext, Seconds(1))
val readerParams = Map[String, String] (
"eventhubs.policyname" -> conf.policyname,
"eventhubs.policykey" -> conf.policykey,
"eventhubs.namespace" -> conf.namespace,
"eventhubs.name" -> conf.name,
"eventhubs.partition.count" -> conf.partitionCount,
"eventhubs.consumergroup" -> conf.consumergroup
)
val eventData = EventHubsUtils.createDirectStreams(
streamingContext,
conf.namespace,
conf.progressdir,
Map("name" -> readerParams))
eventData.foreachRDD(r => {
r.foreachPartition { p => {
p.foreach(d => {
println(DateTime.now() + ": " + d)
}) // end of EventData
}} // foreachPartition
}) // foreachRDD
streamingContext.start()
streamingContext.awaitTermination()
}
}
Here is a set of logs from when I run this as an application, not cluster/supervised.
/spark/bin/spark-submit --class streamingJob.TestJob --master spark://{ip}:7077 --total-executor-cores 1 /spark/job-files/fatjar.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/06 17:52:04 INFO SparkContext: Running Spark version 2.1.1
17/11/06 17:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/06 17:52:05 INFO SecurityManager: Changing view acls to: root
17/11/06 17:52:05 INFO SecurityManager: Changing modify acls to: root
17/11/06 17:52:05 INFO SecurityManager: Changing view acls groups to:
17/11/06 17:52:05 INFO SecurityManager: Changing modify acls groups to:
17/11/06 17:52:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/11/06 17:52:06 INFO Utils: Successfully started service 'sparkDriver' on port 44384.
17/11/06 17:52:06 INFO SparkEnv: Registering MapOutputTracker
17/11/06 17:52:06 INFO SparkEnv: Registering BlockManagerMaster
17/11/06 17:52:06 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/06 17:52:06 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/06 17:52:06 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b5e2c0f3-2500-42c6-b057-cf5d368580ab
17/11/06 17:52:06 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/11/06 17:52:06 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/06 17:52:06 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/06 17:52:06 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://{ip}:4040
17/11/06 17:52:06 INFO SparkContext: Added JAR file:/spark/job-files/fatjar.jar at spark://{ip}:44384/jars/fatjar.jar with timestamp 1509990726989
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://{ip}:7077...
17/11/06 17:52:07 INFO TransportClientFactory: Successfully created connection to /{ip}:7077 after 72 ms (0 ms spent in bootstraps)
17/11/06 17:52:07 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171106175207-0000
17/11/06 17:52:07 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44624.
17/11/06 17:52:07 INFO NettyBlockTransferService: Server created on {ip}:44624
17/11/06 17:52:07 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171106175207-0000/0 on worker-20171106173151-{ip}-46086 ({ip}:46086) with 1 cores
17/11/06 17:52:07 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO StandaloneSchedulerBackend: Granted executor ID app-20171106175207-0000/0 on hostPort {ip}:46086 with 1 cores, 1024.0 MB RAM
17/11/06 17:52:07 INFO BlockManagerMasterEndpoint: Registering block manager {ip}:44624 with 366.3 MB RAM, BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171106175207-0000/0 is now RUNNING
17/11/06 17:52:08 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

Spark streaming with Kafka connector stopping

I am beginning using Spark streaming. I want to get a stream from Kafka with a sample code I found on the Spark documentation : https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html
Here is my code :
object SparkStreaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Test_kafka_spark").setMaster("local[*]") // local parallelism 1
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9093",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("spark")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
}
}
All seemed to start well at but the job stopped immediately, logs as follow :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/19 14:37:37 INFO SparkContext: Running Spark version 2.1.0
17/04/19 14:37:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/19 14:37:37 WARN Utils: Your hostname, thibaut-Precision-M4600 resolves to a loopback address: 127.0.1.1; using 10.192.176.101 instead (on interface eno1)
17/04/19 14:37:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/04/19 14:37:37 INFO SecurityManager: Changing view acls to: thibaut
17/04/19 14:37:37 INFO SecurityManager: Changing modify acls to: thibaut
17/04/19 14:37:37 INFO SecurityManager: Changing view acls groups to:
17/04/19 14:37:37 INFO SecurityManager: Changing modify acls groups to:
17/04/19 14:37:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(thibaut); groups with view permissions: Set(); users with modify permissions: Set(thibaut); groups with modify permissions: Set()
17/04/19 14:37:37 INFO Utils: Successfully started service 'sparkDriver' on port 41046.
17/04/19 14:37:37 INFO SparkEnv: Registering MapOutputTracker
17/04/19 14:37:37 INFO SparkEnv: Registering BlockManagerMaster
17/04/19 14:37:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/04/19 14:37:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/04/19 14:37:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-266e2f13-0eb2-40a8-9d2f-d50797099a29
17/04/19 14:37:37 INFO MemoryStore: MemoryStore started with capacity 879.3 MB
17/04/19 14:37:37 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/19 14:37:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/04/19 14:37:38 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.192.176.101:4040
17/04/19 14:37:38 INFO Executor: Starting executor ID driver on host localhost
17/04/19 14:37:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39207.
17/04/19 14:37:38 INFO NettyBlockTransferService: Server created on 10.192.176.101:39207
17/04/19 14:37:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/04/19 14:37:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.176.101:39207 with 879.3 MB RAM, BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 WARN KafkaUtils: overriding enable.auto.commit to false for executor
17/04/19 14:37:38 WARN KafkaUtils: overriding auto.offset.reset to none for executor
17/04/19 14:37:38 WARN KafkaUtils: overriding executor group.id to spark-executor-test
17/04/19 14:37:38 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
17/04/19 14:37:38 INFO SparkContext: Invoking stop() from shutdown hook
17/04/19 14:37:38 INFO SparkUI: Stopped Spark web UI at http://10.192.176.101:4040
17/04/19 14:37:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/04/19 14:37:38 INFO MemoryStore: MemoryStore cleared
17/04/19 14:37:38 INFO BlockManager: BlockManager stopped
17/04/19 14:37:38 INFO BlockManagerMaster: BlockManagerMaster stopped
17/04/19 14:37:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/04/19 14:37:38 INFO SparkContext: Successfully stopped SparkContext
17/04/19 14:37:38 INFO ShutdownHookManager: Shutdown hook called
17/04/19 14:37:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-f28a1361-58ba-416b-ac8e-11da0044c1f2
Thanks for any help.
It appears you haven't started your StreamingContext.
Try adding these 2 lines at the end
ssc.start
ssc.awaitTermination
You did not call any action on DStream, so nothing gets executed (map is a transformation and is lazy), also you need to start the StreamingContext.
Please look into this complete example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

Spark cassandra connector doesn't work in Standalone Spark cluster

I have a maven scala application that submits a spark job to Spark standalone single node cluster. When job is submitted, Spark application tries to access cassandra, which is hosted on Amazon EC2 instance, using spark-cassandra-connector. Connection is established, but results are not returned. After some time connector disconnects. It works fine if I'm running spark in local mode.
I tried to create simple application and my code looks like this:
val sc = SparkContextLoader.getSC
def runSparkJob():Unit={
val table =sc.cassandraTable("prosolo_logs_zj", "logevents")
println(table.collect().mkString("\n"))
}
SparkContext.scala
object SparkContextLoader {
val sparkConf = new SparkConf()
sparkConf.setMaster("spark://127.0.1.1:7077")
sparkConf.set("spark.cores_max","2")
sparkConf.set("spark.executor.memory","2g")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.setAppName("Test application")
sparkConf.set("spark.cassandra.connection.host", "xxx.xxx.xxx.xxx")
sparkConf.set("spark.cassandra.connection.port", "9042")
sparkConf.set("spark.ui.port","4041")
val oneJar="/samplesparkmaven/target/samplesparkmaven-jar.jar"
sparkConf.setJars(List(oneJar))
#transient val sc = new SparkContext(sparkConf)
}
Console output looks like:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/14 23:11:25 INFO SparkContext: Running Spark version 2.1.0
17/02/14 23:11:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/14 23:11:27 WARN Utils: Your hostname, zoran-Latitude-E5420 resolves to a loopback address: 127.0.1.1; using 192.168.2.68 instead (on interface wlp2s0)
17/02/14 23:11:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/14 23:11:27 INFO SecurityManager: Changing view acls to: zoran
17/02/14 23:11:27 INFO SecurityManager: Changing modify acls to: zoran
17/02/14 23:11:27 INFO SecurityManager: Changing view acls groups to:
17/02/14 23:11:27 INFO SecurityManager: Changing modify acls groups to:
17/02/14 23:11:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zoran); groups with view permissions: Set(); users with modify permissions: Set(zoran); groups with modify permissions: Set()
17/02/14 23:11:28 INFO Utils: Successfully started service 'sparkDriver' on port 33995.
17/02/14 23:11:28 INFO SparkEnv: Registering MapOutputTracker
17/02/14 23:11:28 INFO SparkEnv: Registering BlockManagerMaster
17/02/14 23:11:28 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/02/14 23:11:28 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/02/14 23:11:28 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7b25a4cc-cb37-4332-a59b-e36fa45511cd
17/02/14 23:11:28 INFO MemoryStore: MemoryStore started with capacity 870.9 MB
17/02/14 23:11:28 INFO SparkEnv: Registering OutputCommitCoordinator
17/02/14 23:11:28 INFO Utils: Successfully started service 'SparkUI' on port 4041.
17/02/14 23:11:28 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.2.68:4041
17/02/14 23:11:28 INFO SparkContext: Added JAR /samplesparkmaven/target/samplesparkmaven-jar.jar at spark://192.168.2.68:33995/jars/samplesparkmaven-jar.jar with timestamp 1487142688817
17/02/14 23:11:28 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.1.1:7077...
17/02/14 23:11:28 INFO TransportClientFactory: Successfully created connection to /127.0.1.1:7077 after 62 ms (0 ms spent in bootstraps)
17/02/14 23:11:29 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20170214231129-0016
17/02/14 23:11:29 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36901.
17/02/14 23:11:29 INFO NettyBlockTransferService: Server created on 192.168.2.68:36901
17/02/14 23:11:29 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/02/14 23:11:29 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.2.68, 36901, None)
17/02/14 23:11:29 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.2.68:36901 with 870.9 MB RAM, BlockManagerId(driver, 192.168.2.68, 36901, None)
17/02/14 23:11:29 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.2.68, 36901, None)
17/02/14 23:11:29 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.2.68, 36901, None)
17/02/14 23:11:29 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/02/14 23:11:29 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
17/02/14 23:11:31 INFO Cluster: New Cassandra host /xxx.xxx.xxx.xxx:9042 added
17/02/14 23:11:31 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
17/02/14 23:11:32 INFO SparkContext: Starting job: collect at SparkConnector.scala:28
17/02/14 23:11:32 INFO DAGScheduler: Got job 0 (collect at SparkConnector.scala:28) with 6 output partitions
17/02/14 23:11:32 INFO DAGScheduler: Final stage: ResultStage 0 (collect at SparkConnector.scala:28)
17/02/14 23:11:32 INFO DAGScheduler: Parents of final stage: List()
17/02/14 23:11:32 INFO DAGScheduler: Missing parents: List()
17/02/14 23:11:32 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:18), which has no missing parents
17/02/14 23:11:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 8.4 KB, free 870.9 MB)
17/02/14 23:11:32 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.4 KB, free 870.9 MB)
17/02/14 23:11:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.2.68:36901 (size: 4.4 KB, free: 870.9 MB)
17/02/14 23:11:32 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/02/14 23:11:32 INFO DAGScheduler: Submitting 6 missing tasks from ResultStage 0 (CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:18)
17/02/14 23:11:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/14 23:11:39 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
I'm using
scala 2.11.6
spark 2.1.0 (both for standalone spark and dependency in application)
spark-cassandra-connector 2.0.0-M3
Cassandra Java driver 3.0.0
Apache Cassandra 3.9
Version compatibility table for cassandra connector doesn't show any problem with it, but I can't figure out anything else that might be the problem.
I've finally solved the problem I had. It turned out to be the problem with path. I was using local path to the jar, but missed to add "." at the beginning, so it was treated as absolute path.
Unfortunately, there was no exception in the application indicating that file doesn't exist on the provided path, and the only exception I had was from the worker which could not find jar file in the Spark cluster.