Pyspark predict using kafka direct stream - pyspark

I am trying to pull kafka data to spark streaming, load an already built model from HDFS and then make predictions using kafka message.
I tried several methods but I'm stuck at model.predict because of a TypeError: Cannot convert type into Vector
The data received from kafka is float comma separated.
Here is my code :
sc = SparkContext(appName="PythonStreamingKafkaForecast")
ssc = StreamingContext(sc, 10)
# Create stream to get kafka messages
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["my_topic"], {"metadata.broker.list": "kafka_ip"})
features = directKafkaStream.foreachRDD(lambda rdd: rdd.map(lambda s: Vectors.dense(s[1].split(","))))
model = LinearRegressionModel.load(sc, "hdfs://hadoop_ip/model.model")
#Predict
predicted = model.predict(features)
I also tried this :
lines = directKafkaStream.map(lambda x: x[1])
features = lines.map(lambda data: Vectors.dense([float(c) for c in data.split(',')]))
But this time, features is of type TransformedStream which won't work for preidctions ...
Could you tell me what I'm doing wrong ?
Thank you for your help

Ok, the issue was to try to read data from kafka even if the topic is empty.
This solved my problem :
def predict(rdd):
count = rdd.count()
if (count > 0):
features = rdd.map(lambda s: Vectors.dense(s[1].split(",")))
return features
else:
print("No data received")
directKafkaStream.foreachRDD(lambda rdd: predict(rdd))

Related

Structured Streaming metrics understanding

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec

What is the difference between saveAsObjectFile and persist in apache spark?

I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?
val conf = new SparkConf()
.setAppName("kyroExample")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)
val sparkContext = new SparkContext(conf)
val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray
val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)
case class Person(name: String, age: Int)
evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
evenAgePerson.count()
persist and saveAsObjectFile solve different needs.
persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.
e.g
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation
errors.filter(col("line").like("%MySQL%")).count()
vs
val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()
saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3
Spark persist and saveAsObjectFile is not the same at all.
persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.
saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.
saveAsObjectFile doesn`t use "spark.serializer" configuration at all.
as you can see the below code:
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:
/** Serialize an object using Java serialization */
def serialize[T](o: T): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(o)
oos.close()
bos.toByteArray
}
saveAsObjectFile use the Java Serialization always.
On the other hand, persist will use the configured spark.serializer you configure him.

Issue storing AVRO Kafka streams to File System

I want to store my AVRO kafka streams to file system using my spark streaming API with the following scala code in delimited format, but facing some challenges in achieving this
record.write.mode(SaveMode.Append).csv("/Users/Documents/kafka-poc/consumer-out/)
Since, record(generic record) is not a DF or RDD, I am not sure how to proceed with this?
Code
val messages = SparkUtilsScala.createCustomDirectKafkaStreamAvro(ssc, kafkaParams, zookeeper_host, kafkaOffsetZookeeperNode, topicsSet)
val requestLines = messages.map(_._2)
requestLines.foreachRDD((rdd, time: Time) => {
rdd.foreachPartition { partitionOfRecords => {
val recordInjection = SparkUtilsJava.getRecordInjection(topicsSet.last)
for (avroLine <- partitionOfRecords) {
val record = recordInjection.invert(avroLine).get
println("Consumer output...."+record)
println("Consumer output schema...."+record.getSchema)
}}}}
following is the output & schema
{"username": "Str 1-0", "tweet": "Str 2-0", "timestamp": 0}
{"type":"record","name":"twitter_schema","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"int"}]}
Thanks in advance and appreciate your help
I found a solution for this.
val jsonStrings: RDD[String] = sc.parallelize(Seq(record.toString()));
val result = sqlContext.read.json(jsonStrings).toDF();
result.write.mode("Append").csv("/Users/Documents/kafka-poc/‌​consumer-out/");

Hbase insert are very slow when kafka avro records are converted to Json

I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks

data parallelism in spark : reading avro data from hdfs

I am trying to read avro data using scala within spark environment. My data is not getting distributed and while running it is going to 2 nodes only. we have 20+nodes. Here is my code snippet
#serializable case class My_Class (val My_ID : String )
val filePath = "hdfs://path";
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](filePath)
val rddprsid = avroRDD.map(A =>
new My_Class(new String(A._1.datum.get("My_ID").toString()))
);
val uploadFilter = rddprsid.filter(E => E.My_ID ne null);
val as = uploadFilter.distinct(100).count;
I am not able to use parallelize operation on the rdd as it complaints about the following error.
<console>:30: error: type mismatch;
found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
required: Seq[?]
Can someone please help?
You are seeing only 2 nodes because yarn submission defaults to 2. You need to submit with --num-executors [NUMBER] and optionally --executor-cores [NUMBER]
As to the parallelize....you're data is already parallelized...thus the wrapper around RDD. parallelize is only for use to take in-memory data across the cluster.