how to read data from mongodb in zepplin using spark? - mongodb

I m working with zeppelin in hdp 2.6 I want to read collection from mongodb using spark2 interpreter
util.Properties.versionString
spark.version
res22: String = version 2.11.8
res23: String = 2.2.0.2.6.4.0-91
I m using MongoDB 3.4.14 mongo-spark-connector 2.2.2 mongo-java-driver 3.5.0 when I try this
val customReadConfig = ReadConfig(Map("readPreference.name" -> "secondaryPreferred" ,"uri" -> "mongodb://127.0.0.1:27017/test.collections"))
val df5 = spark.sparkSession.read.mongo(customReadConfig)
I get this error
customReadConfig: com.mongodb.spark.config.ReadConfig.Self =ReadConfig(test,collections,Some(mongodb://127.0.0.1:27017/test.collections),1000,DefaultMongoPartitioner,Map(),15,ReadPreferenceConfig(secondaryPreferred,None),ReadConcernConfig(None),false)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 20, localhost, executor driver): java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at com.mongodb.spark.rdd.MongoRDD$MongoCursorIterator.<init>(MongoRDD.scala:174)
at com.mongodb.spark.rdd.MongoRDD.compute(MongoRDD.scala:152)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

Related

scala null point exception while creating a dataframe

I'm trying to read files from a location and load it into a spark dataframe. The below code works correctly:
val tempDF:DataFrame=spark.read.orc(targetDirectory)
When I try to provide the schema for the same, the code fails with the issue:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, brdn6136.target.com, executor 25): java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.orc.OrcColumnVector.getDouble(OrcColumnVector.java:152)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Below is the code I used:
val schema = StructType(
List(
StructField("Col1",DoubleType,true),
StructField("Col2",StringType,true),
StructField("Col3",DoubleType,true),
StructField("Col4",DoubleType,true),
StructField("Col5",DoubleType,true),
StructField("Col6",StringType,true),
StructField("Col7",StringType,true),
StructField("Col8",StringType,true),
StructField("Col9",StringType,true),
StructField("Col10",StringType,true),
StructField("Col11",StringType,true),
StructField("Col12",StringType,true)
)
)
val df:DataFrame=spark.read.format("orc")
.schema(schema)
.load(targetReadDirectory)
Can anyone please help me to resolve the issue ?

Spark SQL execution in scala

I have a following data(alldata) which has SQL query and viewname name.
Select_Query|viewname
select v1,v2 from conditions|cond
select w1,w2 from locations|loca
I have split and properly assigned it to the temptable(alldata)
val Select_Querydf = spark.sql("select Select_Query,ViewName from alldata")
while I try to execute the query and register a tempview or table out of it, its showing nullpointer error. But the PRINTLN shows all the values in the table right when I comment out the spark.sql stmt.
Select_Querydf.foreach{row =>
val Selectstmt = row(0).toString()
val viewname = row(1).toString()
println(Selectstmt+"-->"+viewname)
spark.sql(Selectstmt).registerTempTable(viewname)//.createOrReplaceTempView(viewname)
}
output:
select v1,v2 from conditions-->cond
select w1,w2 from locations-->loca
But while I execute it with spark.sql, it shows the following error, Please help where I am going wrong.
19/12/09 02:43:12 ERROR Executor: Exception in task 0.0 in stage 4.0
(TID 4) java.lang.NullPointerException at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:128)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:36)
at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:891) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source) 19/12/09 02:43:12 ERROR
TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor
driver): java.lang.NullPointerException at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:128)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:36)
at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:891) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Here the spark.sql which is SparkSession cannot be used in foreach of Dataframe. Sparksession is created in Driver and foreach is executed in worker and not serialized.
I hope the you have a small list for Select_Querydf, if so you can collect as a list and use it as below.
Select_Querydf.collect().foreach { row =>
val Selectstmt = row.getString(0)
val viewname = row.getString(1)
println(Selectstmt + "-->" + viewname)
spark.sql(Selectstmt).createOrReplaceTempView(viewname)
}
Hope this helps!

Scala: Converting Iterator to RDD

Input : MyRDD.collect()
Array[(String, List[Int])] = Array((4,List(97, 99, 101, 102, 103)), (8,List(97, 98, 99, 102, 103, 104)), (19,List(97, 98, 102)), (15,List(97, 99, 101))
Output:
Array[(String, List[Int])] = Array((4,List(97, 99, 101, 102, 103)), (8,List(97, 98, 99, 102, 103, 104)), (19,List(97, 98, 102)), (15,List(97, 99, 101))
Function Call: MyRDD.mapPartitions(myfunc).collect()
I'm trying to convert the Iterator to RDD inside myfunc() to perform few other transformations on it.
def myfunc(chunk : Iterator[(String, List[Int])] ): Iterator[Int] =
{
var k = 1
var a=chunk.toMap
val trainRdd: RDD[(String, List[Int])] = sc.parallelize(a.toSeq).map{case (k,v) => (k,v)}
//Will perform few operations on RDD in next few steps
return Iterator(k)
}
while the conversion code works fine in scala shell when executed without the function, the same line of code doesn't seem to be working inside the function. Assume the RDD has atleast 2 partitions. It throws Null pointer exception which is shown below.
ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 10)
java.lang.NullPointerException
at $line46.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.somefuncpartition(<console>:37)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/01 00:09:44 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 9)
java.lang.NullPointerException
at $line46.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.somefuncpartition(<console>:37)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/01 00:09:45 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 10, localhost, executor driver): java.lang.NullPointerException
at $line46.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.somefuncpartition(<console>:37)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at $line47.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:44)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/01 00:09:45 ERROR TaskSetManager: Task 1 in stage 7.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 1 times, most recent failure: Lost task 1.0 in stage 7.0 (TID 10, localhost, executor driver): java.lang.NullPointerException
at somefuncpartition(<console>:37)
at $anonfun$1.apply(<console>:44)
at $anonfun$1.apply(<console>:44)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedule r$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935) ... 48 elided
Caused by: java.lang.NullPointerException
at somefuncpartition(<console>:37)
at $anonfun$1.apply(<console>:44)
at $anonfun$1.apply(<console>:44)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this

This question is the continuation of the previous question.
I am trying to execute the Elasticsearch DSL query in Spark 2.2 and Scala 2.11.8. The version of Elasticsearch if 2.4.4. This is the library that I use in Spark:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.2.2</version>
</dependency>
This is my current code:
val spark = SparkSession.builder()
.config("es.nodes","localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
.config("es.read.field.as.array.include","true")
.appName("ES test")
.master("local[*]")
.getOrCreate()
val myquery = """{"query":
{"bool": {
"must": [
{
"has_child": {
"filter": {
...
}
}
}
]
}
}}"""
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/items")
.select("test_user", "test_reply")
The issue that I get seems to be related to this one: https://github.com/elastic/elasticsearch-hadoop/issues/1058
But it's not clear how to deal with it.
I get a lot of warnings:
18/01/22 22:01:44 WARN ScalaRowValueReader: Field 'author' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/22 22:01:44 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/22 22:01:44 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
and after that I've got a error
18/01/22 22:01:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
scala.MatchError: Buffer(13473953) (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/01/22 22:01:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): scala.MatchError: Buffer(13473953) (of class scala.collection.convert.Wrappers$JListWrapper)
Then, df.count() works well, and df.printSchema() returns this result:
root
|-- test_user: string (nullable = true)
|-- test_reply: string (nullable = true)
What do these warnings mean and how to avoid them?
Also, if I don't not use select, I get this error:
ala.MatchError: Buffer(13473953) (of class scala.collection.convert.Wrappers$JListWrapper)
Please try replacing the below statement
.config("es.read.field.as.array.include","true")
with
.config("es.read.field.as.array.include","author,client,project")
"es.read.field.as.array.include" expects all the array fields that need to be included.
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/mapping.html#mapping-multi-values

Multiple apps are getting submitted to spark Cluster and keeps in waiting and then exits withError

I am submitting a spark application to the Cluster by using the following command
/root/spark/bin/spark-submit --conf spark.driver.momory=10g --class com.knoldus.SampleApp /pathToJar/Application.jar
But what is happening is : Multiple apps are getting submitted and one is running and all others are waiting and then after sometime the code exits with an exception.
The Spark UI looks something like this :
After this the code exits with this error :
8.149.243): java.io.IOException: Failed to write statements to keyspace.tableName.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:167)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/14 06:26:28 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 561, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/14 06:26:28 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 563, 10.178.149.225): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1904)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:37)
at com.knoldus.xml.RNF2Driver$.main(RNFIngestPipeline.scala:56)
at com.knoldus.xml.RNF2Driver.main(RNFIngestPipeline.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Spark-conf is :
private val conf = new SparkConf()
.setAppName("SampleApp")
.setMaster(sparkClusterIP)
.set("spark.sql.shuffle.partitions", "8")
.set("spark.cassandra.connection.host", cassandraIP)
.set("spark.sql.crossJoin.enabled", "true")
.set("spark.kryoserializer.buffer.max", "640m")
.set("spark.executor.memory", "10g")
.set("spark.executor.cores", "3")
.set("spark.cassandra.output.batch.size.rows", "10")
.set("spark.cassandra.output.batch.size.bytes", "20480")
This is my sample code. Can anyone please let me know what the problem is :
val cassandraIDs = sc.cassandraTable[A](keySpace,tableName).map(_.filename.split("/").last.split("\\.")(0).toLong).collect()
val broadCastList = sc.broadcast(cassandraIDs)
val files = sc.wholeTextFiles(hdFSIP).map(_._1).filter { path =>
val listOfCassandraID = broadCastList.value
!listOfCassandraID.contains(path.split("/").last.split("\\.")(0).toLong)
}.take(100)
import sqlContext.implicits._
val fileNameRDD = sc.parallelize(files)
val cassandraRdd = fileNameRDD.map { path =>
...
//do some task
}.toDF(columnNames)
cassandraRdd.saveToCassandra(keySpace,tablename)
println(s"Completed Processing of $numOfDocs in ${System.currentTimeMillis() - start} milliseconds")
sc.stop()
Why multiple apps are getting submitted?
Because you are submitting multiple spark jobs in your driver code. Each one of the below statements will trigger a new job in the current spark context,
val cassandraIDs = sc.cassandraTable[A]....toLong).collect()
sc.wholeTextFiles
sc.parallelize(files)
cassandraRdd.saveToCassandra(keySpace,tablename)
.toDF(columnNames)
sc.broadcast
(Not sure about the last two)
I haven't worked with cassandra via spark much. But, above code doesn't utilize the power of spark.
You should pipeline the tasks as much as possible, so that spark can plan and run the tasks in distributed way.
Example:
In your code, you are creating files by calling take(100) and then creating an RDD called fileNameRDD using sc.parallelize(files).
This triggers two spark jobs, one to take 100 items, and one to create a new RDD using those 100 items.
Instead you should combine these two tasks so that they can be pipelined by spark.
sc.wholeTextFiles(hdFSIP).map(_._1).filter { path =>
val listOfCassandraID = broadCastList.value
!listOfCassandraID.contains(path.split("/").last.split("\\.")(0).toLong)
}.map { path => /*<---Combine the two tasks like this*/
...
//do some task
}.toDF(columnNames)
Note: I have skipped take(100) part, but you should be able to do that using filters
Why are you getting NoSuchElementException: None.get error
This is more likely because of scala version mismatch between your code and the scala version that is used to build the spark.