Spark Structured Streaming output not displaying in inteliJ console - scala

I'm trying to emulate this example from Jacek Laskowski Book of read a CSV file and aggregate the data in the console, but for some reason, the output is not displaying in the InteliJ console.
scala> spark.version
res4: String = 2.2.0
I found some reference in some places (1, 2, 3, 4, 5) here in SO, I tried everything but I didn't solve the problem.
This is the code:
package org.sample
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object App {
def main(args : Array[String]): Unit = {
val DIR = new java.io.File(".").getCanonicalPath + "dataset/stream_in"
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Spark Structured Streaming Job")
val spark = SparkSession.builder()
.appName("Spark Structured Streaming Job")
.master("local[*]")
.getOrCreate()
val reader = spark.readStream
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("latestFirst", "true")
.schema(SchemaDefinition.csvSchema)
.load(DIR + "/*")
reader.createOrReplaceTempView("user_records")
val tranformation = spark.sql(
"""
SELECT carrier, marital_status, COUNT(1) as num_users
FROM user_records
GROUP BY carrier, marital_status
"""
)
val consoleStream = tranformation
.writeStream
.format("console")
.option("truncate", false)
.outputMode("complete")
.start()
consoleStream.awaitTermination()
}
}
My output it is only:
18/11/30 15:40:31 INFO StreamExecution: Streaming query made progress: {
"id" : "9420f826-0daf-40c9-a427-e89ed42ee738",
"runId" : "991c9085-3425-4ea6-82af-4cef20007a66",
"name" : null,
"timestamp" : "2018-11-30T14:40:31.117Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 2,
"triggerExecution" : 2
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/structured-streamming-taskdataset/stream_in/*]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink#6a62e7ef"
}
}

I redefine the file and right now worked for me:
Differences:
remove the unnecessary conf. With SparkSession we do not need to
call the conf
The .load(/*) didn't work. What worked was the keep only the path
dataset/stream_in;
The data to tranformation was wrong (the fields didn't match the
file)
Final code:
package org.sample
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object StreamCities {
def main(args : Array[String]): Unit = {
// Turn off logs in console
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.appName("Spark Structured Streaming get CSV and agregate")
.master("local[*]")
.getOrCreate()
// 01. Schema Definition: We'll put the structure of our
// CSV file. Can be done using a class, but for simplicity
// I'll keep it here
import org.apache.spark.sql.types._
def csvSchema = StructType {
StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true),
StructField("city", StringType, true)
))
}
// 02. Read the Stream: Create DataFrame representing the
// stream of the CSV according our Schema. The source it is
// the folder in the .load() option
val users = spark.readStream
.format("csv")
.option("sep", ",")
.option("header", true)
.schema(csvSchema)
.load("dataset/stream_in")
// 03. Aggregation of the Stream: To use the .writeStream()
// we must pass a DF aggregated. We can do this using the
// Untyped API or SparkSQL
// 03.1: Aggregation using untyped API
//val aggUsers = users.groupBy("city").count()
// 03.2: Aggregation using Spark SQL
users.createOrReplaceTempView("user_records")
val aggUsers = spark.sql(
"""
SELECT city, COUNT(1) as num_users
FROM user_records
GROUP BY city"""
)
// Print the schema of our aggregation
println(aggUsers.printSchema())
// 04. Output the stream: Now we'll write our stream in
// console and as new files will be included in the folder
// that Spark it's listening the results will be updated
val consoleStream = aggUsers.writeStream
.outputMode("complete")
.format("console")
.start()
.awaitTermination()
}
}

I had the same issue, I solved it with:
.option("startingOffsets", "earliest") \
in the
def read_from_kafka(spark):
df_sales = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "spark_topic_sales") \
.option("startingOffsets", "earliest") \
.load()
return df_sales
this will make sure the data is read from the topic from the earliest. hope it will help someone and he would not spend 2 hours like I did!

Related

How to send parquet to kafka in batches using strcutured spark streaming?

I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.

java.lang.IllegalArgumentException: 'path' is not specified // Spark Consumer Issue

I am trying to create SparkConsumer so I can send messeges in this case a csv file to Kafka through Spark Streaming. But I have an error that 'path' is not specified.
See my code below
My code is as follows:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.OutputMode
object sparkConsumer extends App {
val conf = new SparkConf().setMaster("local").setAppName("Name")
val sc = new SparkContext(conf)
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("Spark-Kafka-Integration")
.master("local")
.getOrCreate()
val schema = StructType(Array(
StructField("InvoiceNo", StringType, nullable = true),
StructField("StockCode", StringType, nullable = true),
StructField("Description", StringType, nullable = true),
StructField("Quantity", StringType, nullable = true)
))
val streamingDataFrame = spark.readStream.schema(schema).csv("C:/Users/me/Desktop/Tasks/Tasks1/test.csv")
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic_test")
.load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json($"value", schema).as("data"), $"timestamp")
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.outputMode(OutputMode.Append)
.start()
.awaitTermination()
}
I become the following error:
Exception in thread "main" java.lang.IllegalArgumentException: 'path' is not specified
Does anyone know what I am missing?
It seems that it can be a problem on this part of your code:
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
because you use use a "csv" format but you donĀ“t set the file location that it needs. Instead you configure Kafka properties to use a kafka topic as your sink. So if you change the format to "kafka" it should work.
Another problem you can experiment using csv as source is that your path should be a directory not file. In your case, if you create a directory and move your csv file it will work.
Just for testing, create a directoy named C:/Users/me/Desktop/Tasks/Tasks1/test.csv and create a file with the name part-0000.csv inside. Then include your csv content in this new file and start again the process.

Each query takes more time using Structured Streaming with Spark

I'm using Spark 2.3.0, Scala 2.11.8 and Kafka and I'm trying to write into parquet files all the messages from Kafka with Structured Streaming but for each query that my implementation does the total time for each one increase a lot Spark Stages Image.
I would like to know why this happens, I tried with different possibles triggers (Continues,0 seconds, 1 seconds, 10 seconds,10 minutes, etc) and always I get the same behavior. My code has this structure:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SparkSession}
import com.name.proto.ProtoMessages
import java.io._
import java.text.{DateFormat, SimpleDateFormat}
import java.util.Date
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
object StructuredStreaming {
def message_proto(value:Array[Byte]): Map[String, String] = {
try {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val impression_proto = ProtoMessages.TrackingRequest.parseFrom(value)
val json = Map(
"id_req" -> (impression_proto.getIdReq().toString),
"ts_imp_request" -> (impression_proto.getTsRequest().toString),
"is_after" -> (impression_proto.getIsAfter().toString),
"type" -> (impression_proto.getType().toString)
)
return json
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
return Map("error" -> "error")
}
}
def main(args: Array[String]){
val proto_impressions_udf = udf(message_proto _)
val spark = SparkSession.builder.appName("Structured Streaming ").getOrCreate()
//fetchOffset.numRetries, fetchOffset.retryIntervalMs
val stream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "ip:9092")
.option("subscribe", "ssp.impressions")
.option("startingOffsets", "latest")
.option("max.poll.records", "1000000")
.option("auto.commit.interval.ms", "100000")
.option("session.timeout.ms", "10000")
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("failOnDataLoss", "false")
.option("latestFirst", "true")
.load()
try{
val query = stream.select(col("value").cast("string"))
.select(proto_impressions_udf(col("value")) as "value_udf")
.select(col("value_udf")("id_req").as("id_req"), col("value_udf")("is_after").as("is_after"),
date_format(col("value_udf")("ts_request"), "yyyy").as("date").as("year"),
date_format(col("value_udf")("ts_request"), "MM").as("date").as("month"),
date_format(col("value_udf")("ts_request"), "dd").as("date").as("day"),
date_format(col("value_udf")("ts_request"), "HH").as("date").as("hour"))
val query2 = query.writeStream.format("parquet")
.option("checkpointLocation", "/home/data/impressions/checkpoint")
.option("path", "/home/data/impressions")
.outputMode(OutputMode.Append())
.partitionBy("year", "month", "day", "hour")
.trigger(Trigger.ProcessingTime("1 seconds"))
.start()
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
}
}
}
I attached others images from the Spark UI:
Your problem is related to the batches, you need to define a good time for processing each batch, and that depends on your cluster processing power. Also, the time for solve each batch depends whether you are receiving all the fields without null because if you receive a lot of fields on null the process will take less time to process the batch.

How to display the records from Kafka to console?

I'm learning Structured Streaming and I was not able to display the output to my console.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
object kafka_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
// val schema = StructType().add("a", IntegerType()).add("b", StringType())
val schema = StructType(Seq(
StructField("a", IntegerType, true),
StructField("b", StringType, true)
))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val values = df.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
My input to Kafka
my name is abc how are you ?
I just want to display strings from Kafka to spark console

Flattening JSON into Tabular Structure using Spark-Scala RDD only fucntion

I have nested JSON and like to have output in tabular structure. I am able to parse the JSON values individually , but having some problems in tabularizing it. I am able to do it via dataframe easily. But I want do it using "RDD ONLY " functions. Any help much appreciated.
Input JSON:
{ "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}
Expected Output:
I tried Below Spark code . But getting output like this and Row() object is not able to parse this.
079562193,EA,List(SELLABLE, HELD),List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z),List(1400.0, 800.0),List(SINGLE, SINGLE)
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"TPNB").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
//Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
println(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
}).collect()
// sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
HI below is the "DATAFRAME" ONLY Solution which I developed. Looking for complete "RDD ONLY" solution
def main (Args : Array[String]):Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark DataFrame few more options").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val sourceJsonDF = sqlContext.read.json("product.json")
val jsonFlatDF_level = sourceJsonDF.withColumn("explode_states",explode($"level.states"))
.withColumn("explode_link",explode($"level._link"))
.select($"level.productReference.TPNB".as("TPNB"),
$"level.productReference.unitOfMeasure".as("level_unitOfMeasure"),
$"level.locationReference.location".as("level_location"),
$"level.locationReference.type".as("level_type"),
$"explode_states.state".as("level_state"),
$"explode_states.effectiveDateTime".as("level_effectiveDateTime"),
$"explode_states.stockQuantity.quantity".as("level_quantity"),
$"explode_states.stockQuantity.stockKeepingLevel".as("level_stockKeepingLevel"),
$"explode_link.rel".as("level_rel"),
$"explode_link.href".as("level_href"),
$"explode_link.method".as("level_method"))
jsonFlatDF_oldLevel.show()
}
DataFrame and DataSet are much more optimized than rdd and there are a lot of options to try with to reach to the solution we desire.
In my opinion, DataFrame is developed to make the developers comfortable viewing data in tabular form so that logics can be implemented with ease. So I always suggest users to use dataframe or dataset.
Talking much less, I am posting you the solution below using dataframe. Once you have a dataframe, switching to rdd is very easy.
Your desired solution is below (you will have to find a way to read json file as its done with json string below : thats an assignment for you :) good luck)
import org.apache.spark.sql.functions._
val json = """ { "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}"""
val rddJson = sparkContext.parallelize(Seq(json))
var df = sqlContext.read.json(rddJson)
df = df.withColumn("prodID", df("level.productReference.prodID"))
.withColumn("unitOfMeasure", df("level.productReference.unitOfMeasure"))
.withColumn("states", explode(df("level.states")))
.drop("level")
df = df.withColumn("state", df("states.state"))
.withColumn("effectiveDateTime", df("states.effectiveDateTime"))
.withColumn("quantity", df("states.stockQuantity.quantity"))
.withColumn("stockKeepingLevel", df("states.stockQuantity.stockKeepingLevel"))
.drop("states")
df.show(false)
This will give out put as
+------+-------------+-----+-------------------------+--------+-----------------+
|prodID|unitOfMeasure|state|effectiveDateTime |quantity|stockKeepingLevel|
+------+-------------+-----+-------------------------+--------+-----------------+
|1234 |EA |SELL |2015-10-09T00:55:23.6345Z|1400.0 |A |
|1234 |EA |HELD |2015-10-09T00:55:23.6345Z|800.0 |B |
+------+-------------+-----+-------------------------+--------+-----------------+
Now that you have desired output as dataframe converting to rdd is just calling .rdd
df.rdd.foreach(println)
will give output as below
[1234,EA,SELL,2015-10-09T00:55:23.6345Z,1400.0,A]
[1234,EA,HELD,2015-10-09T00:55:23.6345Z,800.0,B]
I hope this is helpful
There are 2 versions of solutions to your question.
Version 1:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|1234 |EA |List(SELL, HELD)|List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z)|List(1400.0, 800.0)|List(A, B) |
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
Version 2:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", ArrayType(StringType, true), true),
StructField("effectiveDateTime", ArrayType(StringType, true), true),
StructField("quantity", ArrayType(DoubleType, true), true),
StructField("stockKeepingLevel", ArrayType(StringType, true), true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String])
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String])
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double])
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String])
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|1234 |EA |[SELL, HELD]|[2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z]|[1400.0, 800.0]|[A, B] |
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
The difference between Version 1 & 2 is of schema. In Version 1 you are casting every column into String whereas in Version 2 they are being casted into Array.