Push Data to Nifi Flow using apache spark and scala - scala

I want to get data from nifi flow to spark and do some stuff. After that I want to send result again to nifi flow.
This is my nifi flow to send data to spark using output ports.
To get data from Nifi flow I wrote the below function.
def process() ={
val schema =
StructType(
Seq(
StructField(name = "ID", dataType = StringType, nullable = false),
StructField(name = "Client_Name", dataType = StringType, nullable = false),
StructField(name = "Due_Date", dataType = StringType, nullable = false),
StructField(name = "Score", dataType = StringType, nullable = false)
)
)
val config =
new SiteToSiteClient
.Builder()
.url("http://localhost:8090/nifi")
.portName("Data For Spark")
.buildConfig()
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("NiFi-Spark Streaming example")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val packetStream = ssc.receiverStream(new NiFiReceiver(config, StorageLevel.MEMORY_ONLY))
val file = packetStream.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))
file.foreachRDD(rdd => {
val data = spark.createDataFrame(rdd
.filter(!_.contains("ID,Client_Name,Due_Date,Score"))
.map(line => Row.fromSeq(line.split(",").toSeq)), schema)
data.show(100)
val id = data.select("ID")
})
ssc.start()
ssc.awaitTermination()
}
The final result of above function is id dataframe. I want to send that result to nifi flow. I don't want to write that result as a file to some destination and get to nifi flow using getFile processor.
How can I send the final result to nifi flow?

This is an interesting approach.
Have you considered introducing a brokering service such as Apache Kafka? This can be used both as a source and as a sink in your Apache Spark application and the integration is out of the box. You can also follow the official guide here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. The guide describes a flow using the relatively new Apache Spark Structured Streaming.
Then on Apache NiFi you can use the ConsumeKafkaRecord processor to consume from the same topic being used as a sink in your Apache Spark application. You can also make use of the PublishKafkaRecord processor if you wish to refactor your application to make use of Apache Kafka as a source as well rather than relying on Apache NiFi sockets directly.
Update: If you absolutely must write directly to Apache NiFi, using Apache Spark Structured Streaming you can extend the ForeachWriter class (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter) to implement your own custom sink.

Look at ListenHTTP. That way you will treat NiFi as a simple REST service. Personally, I would prefer some message bus involved between Spark and NiFi but if it is not possible for your use case then you can try if that works for you.

Related

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

Structured Streaming metrics understanding

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec

Spark Structured Streaming dynamic lookup with Redis

i am new to spark.
We are currently building a pipeline :
Read the events from Kafka topic
Enrich this data with the help of Redis-Lookup
Write events to the new Kafka topic
So, my problem is when i want to use spark-redis library it performs very well, but data stays static in my streaming job.
Although data is refreshed at Redis, it does not reflect to my dataframe.
Spark reads data at first then never updates it.
Also i am reading from REDIS data at first,total data about 1mio key-val string.
What kind of approaches/methods i can do, i want to use Redis as in-memory dynamic lookup.
And lookup table is changing almost 1 hour.
Thanks.
used libraries:
spark-redis-2.4.1.jar
commons-pool2-2.0.jar
jedis-3.2.0.jar
Here is the code part:
import com.intertech.hortonworks.spark.registry.functions._
val config = Map[String, Object]("schema.registry.url" -> "http://aa.bbb.ccc.yyy:xxxx/api/v1")
implicit val srConfig:SchemaRegistryConfig = SchemaRegistryConfig(config)
var rawEventSchema = sparkSchema("my_raw_json_events")
val my_raw_events_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("subscribe", "my-raw-event")
.option("failOnDataLoss","false")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger",1000)
.load()
.select(from_json($"value".cast("string"),rawEventSchema, Map.empty[String, String])
.alias("C"))
import com.redislabs.provider.redis._
val sc = spark.sparkContext
val stringRdd = sc.fromRedisKV("PARAMETERS:*")
val lookup_map = stringRdd.collect().toMap
val lookup = udf((key: String) => lookup_map.getOrElse(key,"") )
val curated_df = my_raw_events_df
.select(
...
$"C.SystemEntryDate".alias("RecordCreateDate")
,$"C.Profile".alias("ProfileCode")
,**lookup(expr("'PARAMETERS:PROFILE||'||NVL(C.Profile,'')")).alias("ProfileName")**
,$"C.IdentityType"
,lookup(expr("'PARAMETERS:IdentityType||'||NVL(C.IdentityType,'')")).alias("IdentityTypeName")
...
).as("C")
import org.apache.spark.sql.streaming.Trigger
val query = curated_df
.select(to_sr(struct($"*"), "curated_event_sch").alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("topic", "curated-event")
.option("checkpointLocation","/user/spark/checkPointLocation/xyz")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
query.awaitTermination()
One option is to not use spark-redis, but rather lookup in Redis directly. This can be achieved with df.mapPartitions function. You can find some examples for Spark DStreams here https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/. The idea for Structural Streaming is similar. Be careful to handle the Redis connection properly.
Another solution is to do a stream-static join (spark docs):
Instead of collecting the redis rdd to the driver, use the redis dataframe (spark-redis docs) as a static dataframe to be joined with your stream, so it will be like:
val redisStaticDf = spark.read. ...
val streamingDf = spark.readStream. ...
streamingDf.join(redisStaticDf, ...)
Since spark micro-batch execution engine evaluates the query-execution on each trigger, the redis dataframe will fetch the data on each trigger, providing you an up-to-date data (if you will cache the dataframe it won't)

loading csv file to HBase through Spark

this is simple " how to " question::
We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.
Thank you,
There are multiple ways you can do that.
Spark Hbase connector:
https://github.com/hortonworks-spark/shc
You can see lot of examples on the link.
Also you can use SPark core to load the data to Hbase using HbaseConfiguration.
Code Example:
val fileRDD = sc.textFile(args(0), 2)
val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("hbase.master", "localhost:60000")
conf.set("fs.default.name", "hdfs://localhost:8020")
conf.set("hbase.rootdir", "/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
transformedRDD.saveAsNewAPIHadoopDataset(jobConf)
def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {
val cfDataBytes = Bytes.toBytes("cf")
val rowkey = Bytes.toBytes(line.split("\\|")(1))
val put = new Put(rowkey)
put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
return (new ImmutableBytesWritable(rowkey), put)
}
Also you can use this one
https://github.com/nerdammer/spark-hbase-connector