How to parse dynamic Json with dynamic keys inside it in Scala - scala

I am trying to parse Json structure which is dynamic in nature and load into database. But facing difficulty where json has dynamic keys inside it. Below is my sample json: Have tried using explode function but didn't help.
moslty similar thing is described here How to parse a dynamic JSON key in a Nested JSON result?
{
"_id": {
"planId": "5f34dab0c661d8337097afb9",
"version": {
"$numberLong": "1"
},
"period": {
"name"
: "3Q20",
"startDate": 20200629,
"endDate": 20200927
},
"line": "b443e9c0-fafc-4791-87c9-
8e32339c7f3c",
"channelId": "G7k5_-HWRIuF0-afe7q-rQ"
},
"unitRates": {
"units": {
"$numberLong":
"0"
},
"rate": 0.0,
"rcRate": 0.0
},
"demoValues": {
"66": {
"cpm": 0.0,
"cpp": 0,
"vpvh": 0.0,
"imps"
:
0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "66"
},
"63": {
"cpm": 0.0,
"cpp": 0,
"vpvh":
0.0,
"imps": 0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "63"
},
"21": {
"cpm": 0.0,
"cpp"
:
0,
"vpvh": 0.0,
"imps": 0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "21"
}
},
"hh-imps":
0.0
}
Below is my scala code:
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import com.google.gson.JsonObject
import org.apache.spark.sql.types.{ArrayType, MapType, StringType,
StructField, StructType}
import org.codehaus.jettison.json.JSONObject
object ParseDynamic_v2 {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\hadoop")
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"_id" : {"planId" : "5f34dab0c661d8337097afb9","version" : {"$numberLong" : "1"},"period" : {"name" : "3Q20","startDate" : 20200629,"endDate" : 20200927},"line" : "b443e9c0-fafc-4791-87c9-8e32339c7f3c","channelId" : "G7k5_-HWRIuF0-afe7q-rQ"},"unitRates" : {"units" : {"$numberLong" : "0"},"rate" : 0.0,"rcRate" : 0.0},"demoValues" : {"66" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "66"},"63" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "63"},"21" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "21"}},"hh-imps" : 0.0}""")
))
jsonStringDs.show
import spark.implicits._
val df = spark.read.json(jsonStringDs)
df.show(false)
val app = df.select("demoValues.*")
app.createOrReplaceTempView("app")
app.printSchema
app.show(false)
val verticaProperties: Map[String, String] = Map(
"db" -> "dbname", // Database name
"user" -> "user", // Database username
"password" -> "****", // Password
"table" -> "tablename", // vertica table name
"dbschema" -> "public", // schema of vertica where the table will be
residing
"host" -> "localhost", // Host on which vertica is currently running
"hdfs_url" -> "hdfs://localhost:8020/user/hadoop/planheader/", // HDFS directory url in which intermediate orc file will persist before sending it to vertica
"web_hdfs_url" -> "webhdfs://localhost:50070/user/hadoop/planheader/"
)
val verticaDataSource = "com.vertica.spark.datasource.DefaultSource"
//read mode
val loadStream = df.write.format(verticaDataSource).options(verticaProperties).mode("overwrite").save()
//read stream mode
val saveToVertica: DataFrame => Unit =
dataFrame =>
dataFrame.write.format(verticaDataSource).options(verticaProperties).mode("append").save()
val checkpointLocation = "/user/hadoop/planheader/checkpoint"
val streamingQuery = df.writeStream
.outputMode(OutputMode.Append)
.option("checkpointLocation", checkpointLocation)
//.trigger(ProcessingTime("25 seconds"))
.foreachBatch((ds, _) => saveToVertica(ds)).start()
streamingQuery.awaitTermination()
}
}
expected output:

Here you see what I did using Vertica:
I created a flex table, loaded it, and used Vertica's flex table function COMPUTE_FLEXTABLE_KEYS_AND_CREATE_VIEW() to get a view.
Turned out to be a single-row table:
-- CREATE the Flex Table
CREATE FLEX TABLE demovals();
-- copy it using the built-in JSON Parser (it creates a map container,
-- with all key-value pairs
COPY demovals FROM '/home/gessnerm/1/Vertica/supp/l.json' PARSER fjsonparser();
-- out vsql:/home/gessnerm/._vfv.sql:1: ROLLBACK 4213: Object "demovals" already exists
-- out Rows Loaded
-- out -------------
-- out 1
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 112.540 ms. All rows formatted: 112.623 ms
-- the function on the next line guesses the data types in the values
-- matching the keys, stores the guessed data types in a second table,
-- and builds a view from all found keys
SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('demovals');
-- out COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW
-- out --------------------------------------------------------------------------------------------------------
-- out Please see dbadmin.demovals_keys for updated keys
-- out The view dbadmin.demovals_view is ready for querying
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 467.551 ms. All rows formatted: 467.583 ms
-- now, select from the single-row view on the flex table,
-- one row per column in the report (extended view: "\x" )
\x
SELECT * FROM dbadmin.demovals_view;
-- out -[ RECORD 1 ]---------------+-------------------------------------
-- out _id.channelid | G7k5_-HWRIuF0-afe7q-rQ
-- out _id.line | b443e9c0-fafc-4791-87c9-8e32339c7f3c
-- out _id.period.enddate | 20200927
-- out _id.period.name | 3Q20
-- out _id.period.startdate | 20200629
-- out _id.planid | 5f34dab0c661d8337097afb9
-- out _id.version.$numberlong | 1
-- out demovalues.21.cpm | 0.00
-- out demovalues.21.cpp | 0
-- out demovalues.21.demoid | 21
-- out demovalues.21.grps | 0.00
-- out demovalues.21.imps | 0.00
-- out demovalues.21.rcimps | 0.00
-- out demovalues.21.ue | 0.00
-- out demovalues.21.vpvh | 0.00
-- out demovalues.63.cpm | 0.00
-- out demovalues.63.cpp | 0
-- out demovalues.63.demoid | 63
-- out demovalues.63.grps | 0.00
-- out demovalues.63.imps | 0.00
-- out demovalues.63.rcimps | 0.00
-- out demovalues.63.ue | 0.00
-- out demovalues.63.vpvh | 0.00
-- out demovalues.66.cpm | 0.00
-- out demovalues.66.cpp | 0
-- out demovalues.66.demoid | 66
-- out demovalues.66.grps | 0.00
-- out demovalues.66.imps | 0.00
-- out demovalues.66.rcimps | 0.00
-- out demovalues.66.ue | 0.00
-- out demovalues.66.vpvh | 0.00
-- out hh-imps | 0.00
-- out unitrates.rate | 0.00
-- out unitrates.rcrate | 0.00
-- out unitrates.units.$numberlong | 0
For the children, for example:
CREATE FLEX TABLE children();
TRUNCATE TABLE children;
COPY children FROM '/home/gessnerm/1/Vertica/supp/l.json' PARSER fjsonparser(start_point='demoValues');
SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('children');
\x
SELECT * FROM dbadmin.children_view;
-- out Time: First fetch (0 rows): 7.303 ms. All rows formatted: 7.308 ms
-- out Rows Loaded
-- out -------------
-- out 1
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 13.848 ms. All rows formatted: 13.876 ms
-- out COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW
-- out --------------------------------------------------------------------------------------------------------
-- out Please see dbadmin.children_keys for updated keys
-- out The view dbadmin.children_view is ready for querying
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 140.381 ms. All rows formatted: 140.404 ms
-- out -[ RECORD 1 ]---
-- out 21.cpm | 0.00
-- out 21.cpp | 0
-- out 21.demoid | 21
-- out 21.grps | 0.00
-- out 21.imps | 0.00
-- out 21.rcimps | 0.00
-- out 21.ue | 0.00
-- out 21.vpvh | 0.00
-- out 63.cpm | 0.00
-- out 63.cpp | 0
-- out 63.demoid | 63
-- out 63.grps | 0.00
-- out 63.imps | 0.00
-- out 63.rcimps | 0.00
-- out 63.ue | 0.00
-- out 63.vpvh | 0.00
-- out 66.cpm | 0.00
-- out 66.cpp | 0
-- out 66.demoid | 66
-- out 66.grps | 0.00
-- out 66.imps | 0.00
-- out 66.rcimps | 0.00
-- out 66.ue | 0.00
-- out 66.vpvh | 0.00

Not sure how much efficient my code is but it does the job.
//reading data from json file
val df1 = spark.read.json("src/main/resources/data.json")
// defining schema here.
val schema = StructType(
StructField("planid", StringType, true) ::
StructField("periodname", IntegerType, false) ::
StructField("cpm", StringType, false)::
StructField("vpvh", StringType, false)::
StructField("imps", StringType, false)::
StructField("demoids", StringType, false)
:: Nil)
var someDF = spark.createDataFrame(spark.sparkContext
.emptyRDD[Row], schema)
val app = df1.select("demoValues.*","_id.planId","_id.period.name")
//this will have all the dynamic keys as column
val arr=app.columns
for(i <- 0 to arr.length-3) {
println("columnname: "+arr(i))
// traversing through keys to get the values .ex: demoValues.63.cpm
val cpm = "demoValues."+arr(i)+".cpm"
val vpvh = "demoValues."+arr(i)+".vpvh"
val imps="demoValues."+arr(i)+".imps"
val df2 = df1.select(col("_id.planId"),col("_id.period.name"),col(cpm),
col(vpvh),col(imps),lit(arr(i)).as("demoids"))
df2.show(false)
someDF=someDF.union(df2)
}
someDF.show()

Related

How convert sequential numerical processing of Cassandra table data to parallel in Spark?

We are doing some mathematical modelling on data from Cassandra table using the spark cassandra connector and the execution is currently sequential to get the output. How do you parallelize this for faster execution?
I'm new to Spark and I tried a few things but I'm unable understand how to use tabular data in map , groupby, reduceby functions. If someone can help explain (with some code snippets) how to parrellize tabular data, it will be really helpful.
import org.apache.spark.sql.{Row, SparkSession}
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class SparkExample(sparkSession: SparkSession, pathToCsv: String) {
private val sparkContext = sparkSession.sparkContext
sparkSession.stop()
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","127.0.0.1")
.setAppName("cassandra").setMaster("local[*]")
val sc = new SparkContext(conf)
def testExample(): Unit = {
val KNMI_rdd = sc.cassandraTable ("dbks1","knmi_w")
val Table_count = KNMI_rdd.count()
val KNMI_idx = KNMI_rdd.zipWithIndex
val idx_key = KNMI_idx.map{case (k,v) => (v,k)}
var i = 0
var n : Int = Table_count.toInt
println(Table_count)
for ( i <- 1 to n if i < n) {
println(i)
val Row = idx_key.lookup(i)
println(Row)
val firstRow = Row(0)
val yyyy_var = firstRow.get[Int]("yyyy")
val mm_var = firstRow.get[Double]("mm")
val dd_var = firstRow.get[Double]("dd")
val dr_var = firstRow.get[Double]("dr")
val tg_var = firstRow.get[Double]("tg")
val ug_var = firstRow.get[Double]("ug")
val loc_var = firstRow.get[String]("loc")
val pred_factor = (((0.15461 * tg_var) + (0.8954 * ug_var)) / ((0.0000451 * dr_var) + 0.0004487))
println(yyyy_var,mm_var,dd_var,loc_var)
println(pred_factor)
}
}
}
//test data
// loc | yyyy | mm | dd | dr | tg | ug
//-----+------+----+----+-----+-----+----
// AMS | 2019 | 1 | 1 | 35 | 5 | 84
// AMS | 2019 | 1 | 2 | 76 | 34 | 74
// AMS | 2019 | 1 | 3 | 46 | 33 | 85
// AMS | 2019 | 1 | 4 | 35 | 1 | 84
// AMS | 2019 | 1 | 5 | 29 | 0 | 93
// AMS | 2019 | 1 | 6 | 32 | 25 | 89
// AMS | 2019 | 1 | 7 | 42 | 23 | 89
// AMS | 2019 | 1 | 8 | 68 | 75 | 92
// AMS | 2019 | 1 | 9 | 98 | 42 | 86
// AMS | 2019 | 1 | 10 | 92 | 12 | 76
// AMS | 2019 | 1 | 11 | 66 | 0 | 71
// AMS | 2019 | 1 | 12 | 90 | 56 | 85
// AMS | 2019 | 1 | 13 | 83 | 139 | 90
Edit 1:
I tired using map function and I'm able to calculate the mathematical computations, how do I add keys in front of these values which is defined by WeatherId?
case class Weather( loc: String, yyyy: Int, mm: Int, dd: Int,dr: Double, tg: Double, ug: Double)
case class WeatherId(loc: String, yyyy: Int, mm: Int, dd: Int)
val rows = dataset1
.map(line => Weather(
line.getAs[String]("loc"),
line.getAs[Int]("yyyy"),
line.getAs[Int]("mm"),
line.getAs[Int]("dd"),
line.getAs[Double]("dr"),
line.getAs[Double]("tg"),
line.getAs[Double]("ug")
) )
val pred_factor = rows
.map(x => (( ((x.dr * betaz) + (x.tg * betay)) + (x.ug) * betaz)))
Thanks
TL;DR;
Use a Dataframe/Dataset instead of an RDD.
The argument for DFs over RDDs is long but the short of it is that DFs and their structured alternative DS' outperform the low level RDDs.
With the spark-cassandra connector you can configure input split size that dictate the size of partition size in spark, more partitions more parallelism.
val lastdf = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne",
"spark.cassandra.input.split.size_in_mb" -> 48 // smaller size = more partitions
)
).load()

Spark dataframe Column content modification

I have a dataframe as shown below df.show():
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
Can I transform the above data frame to the below using some SQL?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+
You can use the idea of foldLeft here
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df){(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
}
newDF.show()
Output:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
you can do that using simple sql select statement if you want can use udf as well
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table
val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf { (colName:String, colvalue:String) => colName + ":"+ colvalue}
var in = df
for (e <- cols) {
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
}
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

SPARK : groupByKey vs reduceByKey which is better and efficient to combine the Maps?

I have a data frame [df] :
+------------+-----------+------+
| id | itemName |Value |
--------------------------------+
| 1 | TV | 4 |
| 1 | Movie | 5 |
| 2 | TV | 6 |
I am trying to transform it to :
{id : 1, itemMap : { "TV" : 4, "Movie" : 5}}
{id : 2, itemMap : {"TV" : 6}}
I want the final result to be in RDD[(String, String)] with itemMap as the value's name
So I am doing :
case class Data (itemMap : Map[String, Int]) extends Serializable
df.map{
case r =>
val id = r.getAs[String]("id")
val itemName = r.getAs[String]("itemName")
val Value = r.getAs[Int]("Value")
(id, Map(itemName -> Value ))
}.reduceByKey((x, y) => x ++ y).map{
case (k, v) =>
(k, JacksonUtil.toJson(Data(v)))
}
But it takes forever to run. Is it efficient to use reducebyKey here ? Or can I use groupByKey ? Is there any other efficient way to do the transformation ?
My Config :
I have 10 salves and a master of type r3.8xLarge
spark.driver.cores 30
spark.driver.memory 200g
spark.executor.cores 16
spark.executor.instances 40
spark.executor.memory 60g
spark.memory.fraction 0.95
spark.yarn.executor.memoryOverhead 2000
Is this the correct type of machine for this task ?

How to do 2 distinct groupby conditions on the same data frame in Scala?

I have a data frame, I need to two different groupbys on the same data frame.
+----+-------+--------+----------------------------+
| id | type | item | value | timestamp |
+----+-------+--------+----------------------------+
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | buy | tv | 12 |2016-09-20T00:00:00Z
| 1 | rent | movie | 12 |2016-09-20T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
+----+-------+-------+------------------------------+
I would like to get the result as :
id : 1
totalValue : 72 --- group by based on id
typeCount : {"rent" : 3, "buy" : 3} --- group by based on id
itemCount : {"dvd" : 2, "tv" : 1, "movie" : 3 } --- group by based on id
typeForDay : {"rent: 2, "buy" : 2 } --- group By based on id and dayofmonth(col("timestamp")) atmost 1 type per day
I tried :
val count_by_value = udf {( listValues :scala.collection.mutable.WrappedArray[String]) => if (listValues == null) null else listValues.groupBy(identity).mapValues(_.size)}
val group1 = df.groupBy("id").agg(collect_list("type"),sum("value") as "totalValue", collect_list("item"))
val group1Result = group1.withColumn("typeCount", count_by_value($"collect_list(type)"))
.drop("collect_list(type)")
.withColumn("itemCount", count_by_value($"collect_list(item)"))
.drop("collect_list(item)")
val group2 = df.groupBy("id", dayofmonth(col("timestamp"))).agg(collect_set("type"))
val group2Result = group2.withColumn("typeForDay", count_by_value($"collect_set(type)"))
.drop("collect_set(type)")
val groupedResult = group1Result.join(group2Result, "id").show()
But it takes time, is there any other efficient way of doing this ?
Better approach is to add each group field to key & reduce them instead of groupBy(). You can use these:
df1.map(rec => (rec(0), rec(3).toString().toInt)).
reduceByKey(_+_).take(5).foreach(println)
=> (1,72)
df1.map(rec => ((rec(0), rec(1)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,rent,3)
(1,buy,3)
df1.map(rec => ((rec(0), rec(2)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,dvd,2)
(1,tv,1)
(1,movie,3)
df1.map(rec => ((rec(0), rec(1), rec(4).toString().substring(8,10)), 1)).
reduceByKey(_+_).map(x => (x._1._1, x._1._2,x._1._3,x._2)).
take(5).foreach(println)
=>(1,rent,19,2)
(1,buy,20,1)
(1,buy,18,2)
(1,rent,20,1)

Scala : Getting error while adding new column to an existing data frame Spark?

I have a data frame : df
|---itemId----|----Country------------|
| 11 | US |
| 13 | France |
| 101 | Fra nce |
How do I add the column values in the same data frame :
|---itemId----|----Country------------|----Type-----|
| 11 | US | NA |
| 13 | France | EU |
| 101 | France | EU |
I tried :
df: org.apache.spark.sql.DataFrame = [itemId: string, Country: string]
testMap: scala.collection.Map[String,com.model.PeopleInfo]
val peopleMap = sc.broadcast(testMap)
val getTypeFunc : (String => String) = (country: String) => {
if (StringUtils.isNotBlank(peopleMap.value(itemId).getType)) {
peopleMap.value(itemId).getType
}
"Unknown Type"
}
val typefunc = udf(getTypeFunc)
val newDF = df.withColumn("Type",typefunc(col("Country")))
But I keep getting error :
org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:220) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:205) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:211) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:207) at org.apache.zeppelin.scheduler.Job.run(Job.java:170) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:304) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
I am using spark 1.6 with EMR emr-4.3.0 Zeppelin-Sandbox 0.5.5 :
Cluster size = 30 type = r3.8Xlarge
spark.executor.instances 170
spark.executor.cores 5
spark.driver.memory 219695M
spark.yarn.driver.memoryOverhead 21969
spark.executor.memory 38G
spark.yarn.executor.memoryOverhead 21969
spark.default.parallelism 1856
spark.kryoserializer.buffer.max 512m
spark.sql.hive.convertMetastoreParquet false
spark.hadoop.mapreduce.input.fileinputformat.split.maxsize 33554432
Am I doing anything wrong here ?