covert Spark rdd[row] to dataframe - scala

I have trouble to transform json to dataframe.
I am trying to use spark for a project to synchronize table to data lake(hudi) from a CDC(canal) listening mysql binlog. Here I received json about row change and add some fields for it. this json steam include multiple schema. each schemahave different colums and may add new column in the future.so I build GenericRowWithSchema for each json and pass individual schema for each row.
Now, I need to tranform rdd[row] to dataframe to write to hudi How I can trans it?
object code{
def main(args: Array[String]): Unit = {
val sss = SparkSession.builder().appName("SparkHudi").getOrCreate()
//val sc = SparkContext.getOrCreate
val sc = sss.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
//ssc.sparkContext.setLogLevel("INFO");
import org.apache.kafka.common.serialization.StringDeserializer
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "kafka.test.com:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> "group-88",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> Boolean.box(true)
)
val topics = Array("test")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
//Cache your RDD before you perform any heavyweight operations.
kafkaDirectStream.start()
val saveRdd = kafkaDirectStream.map(x => {
//receive json from kafka
val jsonObject = JSON.parse(x.value()).asInstanceOf[JSONObject]
jsonObject
}).filter(json =>{
/*some json field operation*/
val keySets = dataJson.keySet()
val dataArray:ArrayBuffer[AnyRef] = ArrayBuffer[AnyRef]()
val fieldArray:ArrayBuffer[StructField] = ArrayBuffer[StructField]()
keySets.forEach(dataKey=>{
fieldArray.append(
StructField(dataKey,SqlTypeConverter.toSparkType(sqlTypeJson.getIntValue(dataKey))))
dataArray.append(dataJson.get(dataKey));
})
val schema = StructType(fieldArray)
val row = new GenericRowWithSchema(dataArray.toArray, schema).asInstanceOf[Row]
row
})
saveRdd.foreachRDD ( rdd => {
// Get the offset ranges in the RDD
//println(rdd.map(x => x.toJSONString()).toDebugString());
sss.implicits
rdd.collect().foreach(x=>{
println(x.json)
println(x.schema.sql)
})
})

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

KafkaConsumer is not safe for multi-threaded access Spark Streaming Scala

I'm trying to join 2 different streams coming from apache Kafka (2 different topics) in Apache Spark Streaming on a cluster of machines.
The messages I send are string "formatted" as a csv string (comma separated).
This is the Spark code:
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("spark://0.0.0.0:7077")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, measurement : Double, unit: String, accuracy : Double, elem: String, elems: String, elemss: String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.0:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val stream1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams))
val s1pre = stream1.map(record => record.value.split(",").map(_.trim))
val s1 = s1pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val topics2 = Array("topics2")
val stream2 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics2, kafkaParams))
val s2pre = stream2.map(record => record.value.split(",").map(_.trim))
val s2 = s2pre.map(x => Sensors1(x.apply(6), "2016-03-01T00:00:00.000", "2018-09-01T00:00:00.000", Datas1(Location(x.apply(1).toDouble,x.apply(2).toDouble, ""), x.apply(0) ,x.apply(3).toDouble,x.apply(5),x.apply(4).toDouble,x.apply(7),x.apply(8),x.apply(9)))
val j1s1 = s1.map(x => (x.data1.timestamp, (x)))
val j1s2 = s2.map(x => (x.data1.timestamp, (x)))
val j1s1win = j1s1.window(Seconds(3), Seconds(6))
val j1s2win = j1s2.window(Seconds(3), Seconds(6))
val j1pre = j1s1win.join(j1s2win)
case class Sensorj1(sensor_name: String, start_date: String, end_date: String)
val j1 = j1pre.map { r => new Sensorj1("j1", r._2._1.start_date, r._2._1.end_date)}
j1.print()
The problem I have is "KafkaConsumer is not safe for multi-threaded access".
After reading different posts I changed my code by adding cache() at the end of the KafkaConsumer (val stream1 and val stream2).
After that I do not have the same error but I have a serialization error on the string I try to map.
I do not understand and I do not have any idea on how to fix this problem.
Any suggestions?
Thanks
LF

how to extract RDD content and put in a DataFrame using spark(scala)

What I am trying to do is simply to extract some information from an rdd and put it in a dataframe, using Spark (scala).
So far, what I've done is to create a streaming pipeline, connecting to a kafka topic and put the content of the topic in a RDD :
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
.outputMode("complete")
val topics = Array("vittorio")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val row = stream.map(record => record.value)
row.foreachRDD { (rdd: RDD[String], time: Time) =>
rdd.collect.foreach(println)
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
val DF = rdd.toDF()
DF.show()
}
ssc.start() // Start the computation
ssc.awaitTermination()
}
object SparkSessionSingleton {
#transient private var instance: SparkSession = _
def getInstance(sparkConf: SparkConf): SparkSession = {
if (instance == null) {
instance = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
}
instance
}
}
Now, the content of my rdd is :
{"event":"bank.legal.patch","ts":"2017-04-15T15:18:32.469+02:00","svc":"dpbank.stage.tlc-1","request":{"ts":"2017-04-15T15:18:32.993+02:00","aw":"876e6d71-47c4-40f6-8c49-5dbd7b8e246b","end_point":"/bank/v1/legal/mxHr+bhbNqEwFvXGn4l6jQ==","method":"PATCH","app_instance":"e73e93d9-e70d-4873-8f98-b00c6fe4d036-1491406011","user_agent":"Dry/1.0.st/Android/5.0.1/Sam-SM-N910C","user_id":53,"user_ip":"151.14.81.82","username":"7cV0Y62Rud3MQ==","app_id":"db2ffeac6c087712530981e9871","app_name":"DrApp"},"operation":{"scope":"mdpapp","result":{"http_status":200}},"resource":{"object_id":"mxHr+bhbNqEwFvXGn4l6jQ==","request_attributes":{"legal_user":{"sharing_id":"mxHr+bhbNqEwFvXGn4l6jQ==","ndg":"","taxcode":"IQ7hUUphxFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CAA","address":"Via Batto 44","zipcode":"926","country_id":18,"city_id":122},"business_categories":[5],"company_name":"4Gzb+KJk1XAQ==","vat_number":"162340159"}},"response_attributes":{"legal_user":{"sharing_id":"mGn4l6jQ==","taxcode":"IQ7hFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CATA","address":"Via Bllo 44","zipcode":"95126","country_id":128,"city_id":12203},"business_categories":[5],"company_name":"4GnU/Nczb+KJk1XAQ==","vat_number":"12960159"}}},"class":"DPAPI"}
and doing val DF = rdd.toDF() is showing :
+--------------------+
| value|
+--------------------+
|{"event":"bank.le...|
+--------------------+
what I would like to achieve is a dataframe that will be populated as much as new RDD arrives from the streaming. A sort of union method butnot sure if is the correct way because I'm not sure all rdds will have the same schema.
for example, this is what I would like to achieve :
+--------------------+------------+----------+-----+
| _id| user_ip| status|_type|
+--------------------+------------+----------+-----+
|AVtJFVOUVxUyIIcAklfZ|151.14.81.82|INCOMPLETE|DPAPI|
|AVtJFVOUVxUyIIcAklfZ|151.14.81.82|INCOMPLETE|DPAPI|
+--------------------+------------+----------+-----+
thanks!
If your rdd is
{"event":"bank.legal.patch","ts":"2017-04-15T15:18:32.469+02:00","svc":"dpbank.stage.tlc-1","request":{"ts":"2017-04-15T15:18:32.993+02:00","aw":"876e6d71-47c4-40f6-8c49-5dbd7b8e246b","end_point":"/bank/v1/legal/mxHr+bhbNqEwFvXGn4l6jQ==","method":"PATCH","app_instance":"e73e93d9-e70d-4873-8f98-b00c6fe4d036-1491406011","user_agent":"Dry/1.0.st/Android/5.0.1/Sam-SM-N910C","user_id":53,"user_ip":"151.14.81.82","username":"7cV0Y62Rud3MQ==","app_id":"db2ffeac6c087712530981e9871","app_name":"DrApp"},"operation":{"scope":"mdpapp","result":{"http_status":200}},"resource":{"object_id":"mxHr+bhbNqEwFvXGn4l6jQ==","request_attributes":{"legal_user":{"sharing_id":"mxHr+bhbNqEwFvXGn4l6jQ==","ndg":"","taxcode":"IQ7hUUphxFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CAA","address":"Via Batto 44","zipcode":"926","country_id":18,"city_id":122},"business_categories":[5],"company_name":"4Gzb+KJk1XAQ==","vat_number":"162340159"}},"response_attributes":{"legal_user":{"sharing_id":"mGn4l6jQ==","taxcode":"IQ7hFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CATA","address":"Via Bllo 44","zipcode":"95126","country_id":128,"city_id":12203},"business_categories":[5],"company_name":"4GnU/Nczb+KJk1XAQ==","vat_number":"12960159"}}},"class":"DPAPI"}
Then you can use sqlContext's read.json to read the rdd to valid dataframe and then select only the needed fields as
val df = sqlContext.read.json(sc.parallelize(rdd))
df.select($"request.user_id"as("user_id"),
$"request.user_ip"as("user_ip"),
$"request.app_id"as("app_id"),
$"resource.request_attributes.legal_user.status"as("status"),
$"class")
.show(false)
This should result the following dataframe
+-------+------------+---------------------------+----------+-----+
|user_id|user_ip |app_id |status |class|
+-------+------------+---------------------------+----------+-----+
|53 |151.14.81.82|db2ffeac6c087712530981e9871|INCOMPLETE|DPAPI|
+-------+------------+---------------------------+----------+-----+
You can get required columns as you wish using above method. I hope the answer is helpful
You can union current DataFrame with existing one:
At first, create empty DataFrame at program start:
val df = // here create DF with required schema
df.createOrReplaceTempView("savedDF")
Now in foreachRDD:
// here we are in foreachRDD
val df = // create DataFrame from RDD
val existingCachedDF = spark.table("savedDF") // get reference to existing DataFrame
val union = existingCachedDF.union(df)
union.createOrReplaceTempView("savedDF")
Good idea will be to checkpoint DataFrame in some microbatches to reduce very very long DataFrame logical plan
The other idea is to use Structured Streaming, which will replace Spark Streaming

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.