I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.
I am trying to create SparkConsumer so I can send messeges in this case a csv file to Kafka through Spark Streaming. But I have an error that 'path' is not specified.
See my code below
My code is as follows:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.OutputMode
object sparkConsumer extends App {
val conf = new SparkConf().setMaster("local").setAppName("Name")
val sc = new SparkContext(conf)
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("Spark-Kafka-Integration")
.master("local")
.getOrCreate()
val schema = StructType(Array(
StructField("InvoiceNo", StringType, nullable = true),
StructField("StockCode", StringType, nullable = true),
StructField("Description", StringType, nullable = true),
StructField("Quantity", StringType, nullable = true)
))
val streamingDataFrame = spark.readStream.schema(schema).csv("C:/Users/me/Desktop/Tasks/Tasks1/test.csv")
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic_test")
.load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json($"value", schema).as("data"), $"timestamp")
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.outputMode(OutputMode.Append)
.start()
.awaitTermination()
}
I become the following error:
Exception in thread "main" java.lang.IllegalArgumentException: 'path' is not specified
Does anyone know what I am missing?
It seems that it can be a problem on this part of your code:
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
because you use use a "csv" format but you donĀ“t set the file location that it needs. Instead you configure Kafka properties to use a kafka topic as your sink. So if you change the format to "kafka" it should work.
Another problem you can experiment using csv as source is that your path should be a directory not file. In your case, if you create a directory and move your csv file it will work.
Just for testing, create a directoy named C:/Users/me/Desktop/Tasks/Tasks1/test.csv and create a file with the name part-0000.csv inside. Then include your csv content in this new file and start again the process.
I'm using Spark 2.3.0, Scala 2.11.8 and Kafka and I'm trying to write into parquet files all the messages from Kafka with Structured Streaming but for each query that my implementation does the total time for each one increase a lot Spark Stages Image.
I would like to know why this happens, I tried with different possibles triggers (Continues,0 seconds, 1 seconds, 10 seconds,10 minutes, etc) and always I get the same behavior. My code has this structure:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SparkSession}
import com.name.proto.ProtoMessages
import java.io._
import java.text.{DateFormat, SimpleDateFormat}
import java.util.Date
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
object StructuredStreaming {
def message_proto(value:Array[Byte]): Map[String, String] = {
try {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val impression_proto = ProtoMessages.TrackingRequest.parseFrom(value)
val json = Map(
"id_req" -> (impression_proto.getIdReq().toString),
"ts_imp_request" -> (impression_proto.getTsRequest().toString),
"is_after" -> (impression_proto.getIsAfter().toString),
"type" -> (impression_proto.getType().toString)
)
return json
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
return Map("error" -> "error")
}
}
def main(args: Array[String]){
val proto_impressions_udf = udf(message_proto _)
val spark = SparkSession.builder.appName("Structured Streaming ").getOrCreate()
//fetchOffset.numRetries, fetchOffset.retryIntervalMs
val stream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "ip:9092")
.option("subscribe", "ssp.impressions")
.option("startingOffsets", "latest")
.option("max.poll.records", "1000000")
.option("auto.commit.interval.ms", "100000")
.option("session.timeout.ms", "10000")
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("failOnDataLoss", "false")
.option("latestFirst", "true")
.load()
try{
val query = stream.select(col("value").cast("string"))
.select(proto_impressions_udf(col("value")) as "value_udf")
.select(col("value_udf")("id_req").as("id_req"), col("value_udf")("is_after").as("is_after"),
date_format(col("value_udf")("ts_request"), "yyyy").as("date").as("year"),
date_format(col("value_udf")("ts_request"), "MM").as("date").as("month"),
date_format(col("value_udf")("ts_request"), "dd").as("date").as("day"),
date_format(col("value_udf")("ts_request"), "HH").as("date").as("hour"))
val query2 = query.writeStream.format("parquet")
.option("checkpointLocation", "/home/data/impressions/checkpoint")
.option("path", "/home/data/impressions")
.outputMode(OutputMode.Append())
.partitionBy("year", "month", "day", "hour")
.trigger(Trigger.ProcessingTime("1 seconds"))
.start()
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
}
}
}
I attached others images from the Spark UI:
Your problem is related to the batches, you need to define a good time for processing each batch, and that depends on your cluster processing power. Also, the time for solve each batch depends whether you are receiving all the fields without null because if you receive a lot of fields on null the process will take less time to process the batch.
I'm learning Structured Streaming and I was not able to display the output to my console.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
object kafka_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
// val schema = StructType().add("a", IntegerType()).add("b", StringType())
val schema = StructType(Seq(
StructField("a", IntegerType, true),
StructField("b", StringType, true)
))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val values = df.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
My input to Kafka
my name is abc how are you ?
I just want to display strings from Kafka to spark console
I have nested JSON and like to have output in tabular structure. I am able to parse the JSON values individually , but having some problems in tabularizing it. I am able to do it via dataframe easily. But I want do it using "RDD ONLY " functions. Any help much appreciated.
Input JSON:
{ "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}
Expected Output:
I tried Below Spark code . But getting output like this and Row() object is not able to parse this.
079562193,EA,List(SELLABLE, HELD),List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z),List(1400.0, 800.0),List(SINGLE, SINGLE)
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"TPNB").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
//Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
println(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
}).collect()
// sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
HI below is the "DATAFRAME" ONLY Solution which I developed. Looking for complete "RDD ONLY" solution
def main (Args : Array[String]):Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark DataFrame few more options").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val sourceJsonDF = sqlContext.read.json("product.json")
val jsonFlatDF_level = sourceJsonDF.withColumn("explode_states",explode($"level.states"))
.withColumn("explode_link",explode($"level._link"))
.select($"level.productReference.TPNB".as("TPNB"),
$"level.productReference.unitOfMeasure".as("level_unitOfMeasure"),
$"level.locationReference.location".as("level_location"),
$"level.locationReference.type".as("level_type"),
$"explode_states.state".as("level_state"),
$"explode_states.effectiveDateTime".as("level_effectiveDateTime"),
$"explode_states.stockQuantity.quantity".as("level_quantity"),
$"explode_states.stockQuantity.stockKeepingLevel".as("level_stockKeepingLevel"),
$"explode_link.rel".as("level_rel"),
$"explode_link.href".as("level_href"),
$"explode_link.method".as("level_method"))
jsonFlatDF_oldLevel.show()
}
DataFrame and DataSet are much more optimized than rdd and there are a lot of options to try with to reach to the solution we desire.
In my opinion, DataFrame is developed to make the developers comfortable viewing data in tabular form so that logics can be implemented with ease. So I always suggest users to use dataframe or dataset.
Talking much less, I am posting you the solution below using dataframe. Once you have a dataframe, switching to rdd is very easy.
Your desired solution is below (you will have to find a way to read json file as its done with json string below : thats an assignment for you :) good luck)
import org.apache.spark.sql.functions._
val json = """ { "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}"""
val rddJson = sparkContext.parallelize(Seq(json))
var df = sqlContext.read.json(rddJson)
df = df.withColumn("prodID", df("level.productReference.prodID"))
.withColumn("unitOfMeasure", df("level.productReference.unitOfMeasure"))
.withColumn("states", explode(df("level.states")))
.drop("level")
df = df.withColumn("state", df("states.state"))
.withColumn("effectiveDateTime", df("states.effectiveDateTime"))
.withColumn("quantity", df("states.stockQuantity.quantity"))
.withColumn("stockKeepingLevel", df("states.stockQuantity.stockKeepingLevel"))
.drop("states")
df.show(false)
This will give out put as
+------+-------------+-----+-------------------------+--------+-----------------+
|prodID|unitOfMeasure|state|effectiveDateTime |quantity|stockKeepingLevel|
+------+-------------+-----+-------------------------+--------+-----------------+
|1234 |EA |SELL |2015-10-09T00:55:23.6345Z|1400.0 |A |
|1234 |EA |HELD |2015-10-09T00:55:23.6345Z|800.0 |B |
+------+-------------+-----+-------------------------+--------+-----------------+
Now that you have desired output as dataframe converting to rdd is just calling .rdd
df.rdd.foreach(println)
will give output as below
[1234,EA,SELL,2015-10-09T00:55:23.6345Z,1400.0,A]
[1234,EA,HELD,2015-10-09T00:55:23.6345Z,800.0,B]
I hope this is helpful
There are 2 versions of solutions to your question.
Version 1:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|1234 |EA |List(SELL, HELD)|List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z)|List(1400.0, 800.0)|List(A, B) |
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
Version 2:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", ArrayType(StringType, true), true),
StructField("effectiveDateTime", ArrayType(StringType, true), true),
StructField("quantity", ArrayType(DoubleType, true), true),
StructField("stockKeepingLevel", ArrayType(StringType, true), true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String])
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String])
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double])
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String])
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|1234 |EA |[SELL, HELD]|[2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z]|[1400.0, 800.0]|[A, B] |
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
The difference between Version 1 & 2 is of schema. In Version 1 you are casting every column into String whereas in Version 2 they are being casted into Array.