Kafka Spark Stream save directly to Redis - scala

I'am using Scala to get kafkaStream and want to insert this data directly to Redis. What is the best optimum strategy to do so ?
val kafkaStream = KafkaUtils.createStream(ssc, "192.168.0.40:2181", "group", topics, StorageLevel.MEMORY_AND_DISK)
Earlier I was trying to use https://github.com/debasishg/scala-redis but this does not work with Spark so i had to collect the RDD and then save the record into Redis which is creating lot of overhead in my project. So looking for some solution where I can directly push this string of messages into Redis and also want to maintain the ZScore.
Thanks,

Related

Use casesĀ for using multiple queries for Spark Structured streaming

I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use casesĀ multiple queries should be used over the single query
Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

Spark - Get earliest and latest offset of Kafka without opening stream

I am currently using spark-streaming-kafka-0-10_2.11 to connect my spark application with the kafka queue. For Streams everything works fine. For a specific scenario however I just need the whole content of the kafka queue exactly once - for this I got the suggestion to better use KafkaUtils.createRDD (SparkStreaming: Read Kafka Stream and provide it as RDD for further processing)
However for spark-streaming-kafka-0-10_2.11 I cannot figure out how to get the earliest and latest offset for my Kafka topic that would be needed to create the Offset-Range I have to hand of the the createRDD method.
What is the recommended way to get those offsets without opening a stream? Any help would be greatly appreciated.
After reading several discussions I am able to get the earliest or latest offset from a specific partition with :
val consumer = new SimpleConsumer(host,port,timeout,bufferSize,"offsetfetcher");
val topicAndPartition = new TopicAndPartition(topic, initialPartition)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(OffsetRequest.EarliestTime,1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
return offsets.head
but still , how to replicate the behaviour of "from_beginning" in a kafka_consumer.sh CLI command is something I do not know by the KafkaUtils.createRDD aproach.

Multiple Streams support in Apache Flink Job

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.
Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt
Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.

Spark Streaming Update on Kafka Direct Stream parameter

I have the followingcode:
//Set basic spark parameters
val conf = new SparkConf()
.setAppName("Cartographer_jsonInsert")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messagesDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple4[String, Int, Long, String]](ssc, getKafkaBrokers, getKafkaTopics("processed"), (mmd: MessageAndMetadata[String, String]) => {
(mmd.topic, mmd.partition, mmd.offset, mmd.message().toString)
})
getKafkaBrokers and getKafkaTopics calls an API that checks the database for specific new Topics as we add them to our system. Does the SSC while running update variables each iteration? So ever messageDStream be re-created with the new variables each time?
It does not look like it does, is there any way to have the happen?
Tathagata Das, one of the creators of Spark Streaming answered a similar question in the Spark User List regarding modifications of existing DStreams.
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.
I don't see a straight forward way of implementing this with Spark Streaming, as you have no way of updating your graph. You need much more control than currently available. Maybe a solution based on Reactive Kafka, the Akka Streams connector for Kafka. Or any other streaming based solution where you control the source.
Any reason you are not using Akka Graph with reactive-kafka (https://github.com/akka/reactive-kafka). it is very easy to build reactive stream where source can be given topic , flow to process messages and sink to sink result.
I have built a sample application is using the same https://github.com/asethia/akka-streaming-graph

Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844