System memory 259522560 must be at least 4.718592E8. Please use a larger heap size - scala

I have this error when I run my Spark scripts with version 1.6 of Spark.
My scripts are working with version 1.5.
Java version: 1.8
scala version: 2.11.7
I tried to change the system env variable JAVA_OPTS=-Xms128m -Xmx512m many times, with different values of Xms and Xmx but it didn't change anything ...
I also tried to modify the memory settings of Intellij
help/change memory settings...
file/settings/scal compiler...
Nothing worked.
I have different users in the computer, and Java is setup at the root of the computer while intellij is setup in the folder of one of the users. Can it have an impact?
Here are the logs of the error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/30 17:06:54 INFO SparkContext: Running Spark version 1.6.0
20/04/30 17:06:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/30 17:06:55 INFO SecurityManager: Changing view acls to:
20/04/30 17:06:55 INFO SecurityManager: Changing modify acls to:
20/04/30 17:06:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(); users with modify permissions: Set()
20/04/30 17:06:56 INFO Utils: Successfully started service 'sparkDriver' on port 57698.
20/04/30 17:06:57 INFO Slf4jLogger: Slf4jLogger started
20/04/30 17:06:57 INFO Remoting: Starting remoting
20/04/30 17:06:57 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.1.5.175:57711]
20/04/30 17:06:57 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 57711.
20/04/30 17:06:57 INFO SparkEnv: Registering MapOutputTracker
20/04/30 17:06:57 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 4.718592E8. Please use a larger heap size.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:193)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:175)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:354)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:457)
at batch.BatchJob$.main(BatchJob.scala:23)
at batch.BatchJob.main(BatchJob.scala)
20/04/30 17:06:57 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.IllegalArgumentException: System memory 259522560 must be at least 4.718592E8. Please use a larger heap size.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:193)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:175)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:354)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:457)
at batch.BatchJob$.main(BatchJob.scala:23)
at batch.BatchJob.main(BatchJob.scala)
And the beginning of the code:
package batch
import java.lang.management.ManagementFactory
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, SQLContext}
object BatchJob {
def main (args: Array[String]): Unit = {
// get spark configuration
val conf = new SparkConf()
.setAppName("Lambda with Spark")
// Check if running from IDE
if (ManagementFactory.getRuntimeMXBean.getInputArguments.toString.contains("IntelliJ IDEA")) {
System.setProperty("hadoop.home.dir", "C:\\Libraries\\WinUtils") // required for winutils
conf.setMaster("local[*]")
}
// setup spark context
val sc = new SparkContext(conf)
implicit val sqlContext = new SQLContext(sc)
...

Finally could find a solution:
add -Xms2g -Xmx4g in VM options directly in Intellij Scala Console.
That's the only thing that worked for me

Related

Exception in thread "main" java.lang.NullPointerException com.databricks.dbutils_v1.DBUtilsHolder$$anon$1.invoke

I would like to read a parquet file in Azure Blob, so I have mount the data from Azure Blob to local with dbultils.fs.mount
But I got the errors Exception in thread "main" java.lang.NullPointerException
Below is my log:
hello big data
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/10 23:20:10 INFO SparkContext: Running Spark version 2.1.0
20/06/10 23:20:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/06/10 23:20:11 INFO SecurityManager: Changing view acls to: Admin
20/06/10 23:20:11 INFO SecurityManager: Changing modify acls to: Admin
20/06/10 23:20:11 INFO SecurityManager: Changing view acls groups to:
20/06/10 23:20:11 INFO SecurityManager: Changing modify acls groups to:
20/06/10 23:20:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Admin); groups with view permissions: Set(); users with modify permissions: Set(Admin); groups with modify permissions: Set()
20/06/10 23:20:12 INFO Utils: Successfully started service 'sparkDriver' on port 4725.
20/06/10 23:20:12 INFO SparkEnv: Registering MapOutputTracker
20/06/10 23:20:13 INFO SparkEnv: Registering BlockManagerMaster
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/06/10 23:20:13 INFO DiskBlockManager: Created local directory at C:\Users\Admin\AppData\Local\Temp\blockmgr-c023c3b8-fd70-461a-ac69-24ce9c770efe
20/06/10 23:20:13 INFO MemoryStore: MemoryStore started with capacity 894.3 MB
20/06/10 23:20:13 INFO SparkEnv: Registering OutputCommitCoordinator
20/06/10 23:20:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/06/10 23:20:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040
20/06/10 23:20:13 INFO Executor: Starting executor ID driver on host localhost
20/06/10 23:20:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 4738.
20/06/10 23:20:13 INFO NettyBlockTransferService: Server created on 192.168.0.102:4738
20/06/10 23:20:13 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/06/10 23:20:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.102:4738 with 894.3 MB RAM, BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:13 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.102, 4738, None)
20/06/10 23:20:14 INFO SharedState: Warehouse path is 'file:/E:/sparkdemo/sparkdemo/spark-warehouse/'.
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.databricks.dbutils_v1.DBUtilsHolder$$anon$1.invoke(DBUtilsHolder.scala:17)
at com.sun.proxy.$Proxy7.fs(Unknown Source)
at Transform$.main(Transform.scala:19)
at Transform.main(Transform.scala)
20/06/10 23:20:14 INFO SparkContext: Invoking stop() from shutdown hook
20/06/10 23:20:14 INFO SparkUI: Stopped Spark web UI at http://192.168.0.102:4040
20/06/10 23:20:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/10 23:20:14 INFO MemoryStore: MemoryStore cleared
20/06/10 23:20:14 INFO BlockManager: BlockManager stopped
20/06/10 23:20:14 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/10 23:20:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/10 23:20:14 INFO SparkContext: Successfully stopped SparkContext
20/06/10 23:20:14 INFO ShutdownHookManager: Shutdown hook called
20/06/10 23:20:14 INFO ShutdownHookManager: Deleting directory C:\Users\Admin\AppData\Local\Temp\spark-cbdbcfe7-bc70-4d34-ad8e-5baed8308ae2
My code:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import org.apache.spark.sql.SparkSession
object Demo {
def main(args:Array[String]): Unit = {
println("hello big data")
val containerName = "container1"
val storageAccountName = "storageaccount1"
val sas = "saskey"
val url = "wasbs://" + containerName + "#" + storageAccountName + ".blob.core.windows.net/"
var config = "fs.azure.sas." + containerName + "." + storageAccountName + ".blob.core.windows.net"
//Spark session
val spark : SparkSession = SparkSession.builder
.appName("SpartDemo")
.master("local[1]")
.getOrCreate()
//Mount data
dbutils.fs.mount(
source = url,
mountPoint = "/mnt/container1",
extraConfigs = Map(config -> sas))
val parquetFileDF = spark.read.parquet("/mnt/container1/test1.parquet")
parquetFileDF.show()
}
}
My sbt file:
name := "sparkdemo1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.databricks" % "dbutils-api_2.11" % "0.0.3",
"org.apache.spark" % "spark-core_2.11" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0"
)
Are you running this into a Databricks instance?
If not, that's the problem: dbutils are provided by Databricks execution context.
In that case, as far as I know, you have three options:
Package your application into a jar file and run it using a Databricks job
Use databricks-connect
Try to emulate a mocked dbutils instance outside Databricks as shown here:
com.databricks.dbutils_v1.DBUtilsHolder.dbutils0.set(
new com.databricks.dbutils_v1.DBUtilsV1{
...
}
)
Anyway, I'd say that options 1 and 2 are better than the third one. Also by choosing one of those you don't need to include "dbutils-api_2.11" dependency, as it is provided by Databricks cluster.

How to raise log level to error in Spark?

I have tried to suppress log by spark.sparkContext.setLogLevel("ERROR") in:
package com.databricks.example
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
object DFUtils extends Serializable {
#transient lazy val logger = Logger.getLogger(getClass.getName)
def pointlessUDF(raw: String) = {
raw
}
}
object DataFrameExample extends Serializable {
def main(args: Array[String]): Unit = {
val pathToDataFolder = args(0)
// println(pathToDataFolder + "data.json")
// start up the SparkSession
// along with explicitly setting a given config
val spark = SparkSession.builder().appName("Spark Example")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.getOrCreate()
// for suppresse logs by raising log level
spark.sparkContext.setLogLevel("ERROR")
// println(spark.range(1, 2000).count());
// udf registration
spark.udf.register("myUDF", DFUtils.pointlessUDF(_:String):String)
val df = spark.read.json(pathToDataFolder + "data.json")
df.printSchema()
// df.collect.foreach(println)
// val x = df.select("value").foreach(x => println(x));
val manipulated = df.groupBy("grouping").sum().collect().foreach(x => println(x))
// val manipulated = df.groupBy(expr("myUDF(group)")).sum().collect().foreach(x => println(x))
}
}
Why do I still get INFO and WARN level logs? Have I successfully raised log level to error? Thanks.
$ ~/programs/spark/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --class com.databricks.example.DataFrameExample --master local target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar /tmp/test/
20/03/19 10:09:10 WARN Utils: Your hostname, ocean resolves to a loopback address: 127.0.1.1; using 192.168.122.1 instead (on interface virbr0)
20/03/19 10:09:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/03/19 10:09:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/19 10:09:12 INFO SparkContext: Running Spark version 2.4.5
20/03/19 10:09:12 INFO SparkContext: Submitted application: Spark Example
20/03/19 10:09:12 INFO SecurityManager: Changing view acls to: t
20/03/19 10:09:12 INFO SecurityManager: Changing modify acls to: t
20/03/19 10:09:12 INFO SecurityManager: Changing view acls groups to:
20/03/19 10:09:12 INFO SecurityManager: Changing modify acls groups to:
20/03/19 10:09:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(t); groups with view permissions: Set(); users with modify permissions: Set(t); groups with modify permissions: Set()
20/03/19 10:09:13 INFO Utils: Successfully started service 'sparkDriver' on port 35821.
20/03/19 10:09:13 INFO SparkEnv: Registering MapOutputTracker
20/03/19 10:09:13 INFO SparkEnv: Registering BlockManagerMaster
20/03/19 10:09:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/19 10:09:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/19 10:09:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ce47f30a-ee1c-44a8-9f5b-204905ee3b2d
20/03/19 10:09:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/03/19 10:09:13 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/19 10:09:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/03/19 10:09:14 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.122.1:4040
20/03/19 10:09:14 INFO SparkContext: Added JAR file:/tmp/test/bookexample/target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar at spark://192.168.122.1:35821/jars/example_2.11-0.1-SNAPSHOT.jar with timestamp 1584626954295
20/03/19 10:09:14 INFO Executor: Starting executor ID driver on host localhost
20/03/19 10:09:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39215.
20/03/19 10:09:14 INFO NettyBlockTransferService: Server created on 192.168.122.1:39215
20/03/19 10:09:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/19 10:09:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.122.1:39215 with 366.3 MB RAM, BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.122.1, 39215, None)
20/03/19 10:09:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.122.1, 39215, None)
root
|-- grouping: string (nullable = true)
|-- value: long (nullable = true)
[group_3,10]
[group_1,12]
[group_2,5]
[group_4,2]
You need to add a log4j.properties file into your resources folder. Otherwise it would use the default settings that are set in your spark folder. On Linux usually here: /etc/spark2/.../log4j-defaults.properties).
The location is also mentioned in your log file:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Make sure to set the rootCategory to ERROR, like in the following example:
# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Spark Streaming job wont schedule additional work

Spark 2.1.1 built for Hadoop 2.7.3
Scala 2.11.11
Cluster has 3 Linux RHEL 7.3 Azure VM's, running Spark Standalone Deploy Mode (no YARN or Mesos, yet)
I have created a very simple SparkStreaming job using IntelliJ, written in Scala. I'm using Maven and building the job into a fat/uber jar that contains all dependencies.
When I run the job locally it works fine. If I copy the jar to the cluster and run it with a master of local[2] it also works fine. However, if I submit the job to the cluster master it's like it does not want to schedule additional work beyond the first task. The job starts up, grabs however many events are in the Azure Event Hub, processes them successfully, then never does anymore work. It does not matter if I submit the job to the master as just an application or if it's submitted using supervised cluster mode, both do the same thing.
I've looked through all the logs I know of (master, driver (where applicable), and executor) and I am not seeing any errors or warnings that seem actionable. I've altered the log level, shown below, to show ALL/INFO/DEBUG and sifted through those logs without finding anything that seems relevant.
It may be worth noting that I had previously created several jobs that connect to Kafka, instead of the Azure Event Hub, using Java and those jobs run in supervised cluster mode without an issue on this same cluster. This leads me to believe that the cluster configuration isn't an issue, it's either something with my code (below) or the Azure Event Hub.
Any thoughts on where I might check next to isolate this issue? Here is the code for my simple job.
Thanks in advance.
Note: conf.{name} indicates values I'm loading from a config file. I've tested loading and hard-coding them, both with the same result.
package streamingJob
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import org.joda.time.DateTime
object TestJob {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
sparkConf.setAppName("TestJob")
// Uncomment to run locally
//sparkConf.setMaster("local[2]")
val sparkContext = new SparkContext(sparkConf)
sparkContext.setLogLevel("ERROR")
val streamingContext: StreamingContext = new StreamingContext(sparkContext, Seconds(1))
val readerParams = Map[String, String] (
"eventhubs.policyname" -> conf.policyname,
"eventhubs.policykey" -> conf.policykey,
"eventhubs.namespace" -> conf.namespace,
"eventhubs.name" -> conf.name,
"eventhubs.partition.count" -> conf.partitionCount,
"eventhubs.consumergroup" -> conf.consumergroup
)
val eventData = EventHubsUtils.createDirectStreams(
streamingContext,
conf.namespace,
conf.progressdir,
Map("name" -> readerParams))
eventData.foreachRDD(r => {
r.foreachPartition { p => {
p.foreach(d => {
println(DateTime.now() + ": " + d)
}) // end of EventData
}} // foreachPartition
}) // foreachRDD
streamingContext.start()
streamingContext.awaitTermination()
}
}
Here is a set of logs from when I run this as an application, not cluster/supervised.
/spark/bin/spark-submit --class streamingJob.TestJob --master spark://{ip}:7077 --total-executor-cores 1 /spark/job-files/fatjar.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/06 17:52:04 INFO SparkContext: Running Spark version 2.1.1
17/11/06 17:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/06 17:52:05 INFO SecurityManager: Changing view acls to: root
17/11/06 17:52:05 INFO SecurityManager: Changing modify acls to: root
17/11/06 17:52:05 INFO SecurityManager: Changing view acls groups to:
17/11/06 17:52:05 INFO SecurityManager: Changing modify acls groups to:
17/11/06 17:52:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/11/06 17:52:06 INFO Utils: Successfully started service 'sparkDriver' on port 44384.
17/11/06 17:52:06 INFO SparkEnv: Registering MapOutputTracker
17/11/06 17:52:06 INFO SparkEnv: Registering BlockManagerMaster
17/11/06 17:52:06 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/06 17:52:06 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/06 17:52:06 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b5e2c0f3-2500-42c6-b057-cf5d368580ab
17/11/06 17:52:06 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/11/06 17:52:06 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/06 17:52:06 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/06 17:52:06 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://{ip}:4040
17/11/06 17:52:06 INFO SparkContext: Added JAR file:/spark/job-files/fatjar.jar at spark://{ip}:44384/jars/fatjar.jar with timestamp 1509990726989
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://{ip}:7077...
17/11/06 17:52:07 INFO TransportClientFactory: Successfully created connection to /{ip}:7077 after 72 ms (0 ms spent in bootstraps)
17/11/06 17:52:07 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171106175207-0000
17/11/06 17:52:07 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44624.
17/11/06 17:52:07 INFO NettyBlockTransferService: Server created on {ip}:44624
17/11/06 17:52:07 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171106175207-0000/0 on worker-20171106173151-{ip}-46086 ({ip}:46086) with 1 cores
17/11/06 17:52:07 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO StandaloneSchedulerBackend: Granted executor ID app-20171106175207-0000/0 on hostPort {ip}:46086 with 1 cores, 1024.0 MB RAM
17/11/06 17:52:07 INFO BlockManagerMasterEndpoint: Registering block manager {ip}:44624 with 366.3 MB RAM, BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, {ip}, 44624, None)
17/11/06 17:52:07 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171106175207-0000/0 is now RUNNING
17/11/06 17:52:08 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

SparkSQL-Scala with POM

I have some problem with Cloudera VM and Spark. First of all, I'm completely new on Spark, and my boss asked to me to run Spark on Scala in a Virtual Machine for some test.
I have downloaded the Virtual Machine on Virtual Box environment, so I open Eclipse and I had a new Project on Maven.
Obliviously, after I run previously the Cloudera environment and start all services, like Spark, Yarn, Hive and so on.
All services work fine, and all check, in Cloudera services are green. I had do some test with Impala and that works perfectly.
With Eclipse and Scala-Maven environment, the things became worst: that is my very simple code in Scala:
package org.test.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object TestSelectAlgorithm {
def main(args: Array[String]) = {
val conf = new SparkConf()
.setAppName("TestSelectAlgorithm")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.sql("SELECT * FROM products").show()
}
}
The test is very simple, because the table "products" exist: if I copy-and-paste the same query on Impala, the query works fine!
On the Eclipse environment, otherwise, I have some problem:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/06/30 05:43:17 INFO SparkContext: Running Spark version 1.6.0
16/06/30 05:43:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/30 05:43:18 WARN Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/06/30 05:43:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/06/30 05:43:18 INFO SecurityManager: Changing view acls to: cloudera
16/06/30 05:43:18 INFO SecurityManager: Changing modify acls to: cloudera
16/06/30 05:43:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
16/06/30 05:43:19 INFO Utils: Successfully started service 'sparkDriver' on port 53730.
16/06/30 05:43:19 INFO Slf4jLogger: Slf4jLogger started
16/06/30 05:43:19 INFO Remoting: Starting remoting
16/06/30 05:43:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.0.2.15:39288]
16/06/30 05:43:19 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 39288.
16/06/30 05:43:19 INFO SparkEnv: Registering MapOutputTracker
16/06/30 05:43:19 INFO SparkEnv: Registering BlockManagerMaster
16/06/30 05:43:19 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7d685fc0-ea88-423a-9335-42ca12db85da
16/06/30 05:43:19 INFO MemoryStore: MemoryStore started with capacity 1619.3 MB
16/06/30 05:43:20 INFO SparkEnv: Registering OutputCommitCoordinator
16/06/30 05:43:20 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/06/30 05:43:20 INFO SparkUI: Started SparkUI at http://10.0.2.15:4040
16/06/30 05:43:20 INFO Executor: Starting executor ID driver on host localhost
16/06/30 05:43:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57294.
16/06/30 05:43:20 INFO NettyBlockTransferService: Server created on 57294
16/06/30 05:43:20 INFO BlockManagerMaster: Trying to register BlockManager
16/06/30 05:43:20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:57294 with 1619.3 MB RAM, BlockManagerId(driver, localhost, 57294)
16/06/30 05:43:20 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table not found: products;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:306)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:315)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:310)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:310)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:300)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at org.test.spark.TestSelectAlgorithm$.main(TestSelectAlgorithm.scala:18)
at org.test.spark.TestSelectAlgorithm.main(TestSelectAlgorithm.scala)
16/06/30 05:43:22 INFO SparkContext: Invoking stop() from shutdown hook
16/06/30 05:43:22 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
16/06/30 05:43:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/30 05:43:22 INFO MemoryStore: MemoryStore cleared
16/06/30 05:43:22 INFO BlockManager: BlockManager stopped
16/06/30 05:43:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/06/30 05:43:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/30 05:43:22 INFO SparkContext: Successfully stopped SparkContext
16/06/30 05:43:22 INFO ShutdownHookManager: Shutdown hook called
16/06/30 05:43:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-29d381e9-b5e7-485c-92f2-55dc57ca7d25
The main error is (for me):
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table not found: products;
I searched on other site and documentation, and I founded that the problem is connected with the Hive table... but I don't use the Hive table, I use SparkSql...
Can anyone help me, please?
Thank you for any reply.
In spark, For impala there is no direct support as hive has .So, You have to load file. If it is csv you can use spark-csv,
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("your .csv file location")
import sqlContext.implicits._
import sqlContext._
df.registerTempTable("products")
sqlContext.sql("select * from products").show()
pom dependency for spark-csv
<!-- https://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
for avro there is spark-avro
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro("your .avro file location")
import sqlContext.implicits._
import sqlContext._
df.registerTempTable("products")
val result= sqlContext.sql("select * from products")
val result.show()
result.write
.format("com.databricks.spark.avro")
.save("Your ouput location")
pom dependency for avro
<!-- http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10 -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>2.0.1</version>
</dependency>
and parquet spark has in-build support
val sqlContext = new SQLContext(sc)
val parquetFile = sqlContext.read.parquet("your parquet file location")
parquetFile.registerTempTable("products")
sqlContext.sql("select * from products").show()
Can you check /user/cloudera/.sparkStaging/stagingArea location exist or it contains .avro file?? And please change "Your ouput location" by directory location.
Please check avro github page for more detail. https://github.com/databricks/spark-avro

Unable to connect to Spark master

I start my DataStax cassandra instance with Spark:
dse cassandra -k
I then run this program (from within Eclipse):
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Start {
def main(args: Array[String]): Unit = {
println("***** 1 *****")
val sparkConf = new SparkConf().setAppName("Start").setMaster("spark://127.0.0.1:7077")
println("***** 2 *****")
val sparkContext = new SparkContext(sparkConf)
println("***** 3 *****")
}
}
And I get the following output
***** 1 *****
***** 2 *****
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/29 15:27:50 INFO SparkContext: Running Spark version 1.5.2
15/12/29 15:27:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/29 15:27:51 INFO SecurityManager: Changing view acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: Changing modify acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nayan); users with modify permissions: Set(nayan)
15/12/29 15:27:52 INFO Slf4jLogger: Slf4jLogger started
15/12/29 15:27:52 INFO Remoting: Starting remoting
15/12/29 15:27:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.1.88:55126]
15/12/29 15:27:53 INFO Utils: Successfully started service 'sparkDriver' on port 55126.
15/12/29 15:27:53 INFO SparkEnv: Registering MapOutputTracker
15/12/29 15:27:53 INFO SparkEnv: Registering BlockManagerMaster
15/12/29 15:27:53 INFO DiskBlockManager: Created local directory at /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/blockmgr-21a96671-c33e-498c-83a4-bb5c57edbbfb
15/12/29 15:27:53 INFO MemoryStore: MemoryStore started with capacity 983.1 MB
15/12/29 15:27:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/spark-fce0a058-9264-4f2c-8220-c32d90f11bd8/httpd-2a0efcac-2426-49c5-982a-941cfbb48c88
15/12/29 15:27:53 INFO HttpServer: Starting HTTP Server
15/12/29 15:27:53 INFO Utils: Successfully started service 'HTTP file server' on port 55127.
15/12/29 15:27:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/29 15:27:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/29 15:27:53 INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
15/12/29 15:27:54 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/29 15:27:54 INFO AppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
15/12/29 15:27:54 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#127.0.0.1:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/12/29 15:28:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1f22aef0 rejected from java.util.concurrent.ThreadPoolExecutor#176cb4af[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
So something is happening during the creation of the spark context.
When i look in $DSE_HOME/logs/spark, it is empty. Not sure where else to look.
It turns out that the problem was the spark library version AND the Scala version. DataStax was running Spark 1.4.1 and Scala 2.10.5, while my eclipse project was using 1.5.2 & 2.11.7 respectively.
Note that BOTH the Spark library and Scala appear to have to match. I tried other combinations, but it only worked when both matched.
I am getting pretty familiar with this part of your posted error:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://...
It can have numerous causes, pretty much all related to misconfigured IPs. First I would do whatever zero323 says, then here's my two cents: I have solved my own problems recently by using IP addresses, not hostnames, and the only config I use in a simple standalone cluster is SPARK_MASTER_IP.
SPARK_MASTER_IP in the $SPARK_HOME/conf/spark-env.sh on your master then should lead the master webui to show the IP address you set:
spark://your.ip.address.numbers:7077
And your SparkConf setup can refer to that.
Having said that, I am not familiar with your specific implementation but I notice in the error two occurrences containing:
/private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/
Have you looked there to see if there's a logs directory? Is that where $DSE_HOME points? Alternatively connect to the driver where it creates it's webui:
INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
and you should see a link to an error log there somewhere.
More on the IP vs. hostname thing, this very old bug is marked as Resolved but I have not figured out what they mean by Resolved, so I just tend toward IP addresses.