Use cases for using multiple queries for Spark Structured streaming - scala

I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use cases multiple queries should be used over the single query

Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

Related

kafka connect hdfs flatten array records

We are experimenting with Kafka Connect HDFS and have source data which contains an array of avro records (referred to as container record below) which are pushed to topics.
The plan was to use kafka-connect hdfs to flatten the array in the container record into individual records and write to hdfs as avro and then commit the actual container record.
However it seems that kafka-connect-hdfs does not support this out of the box.
In attempt to navigate the source we found that there is a tight mapping of the topic names and offset management and attempting to transform 1 sinkRecod to multiple sinkRecords deviates from expected behavior.
so wanted to verify if 1 to n transformation possible ? if so how ?
Kafka Connect supports Single Message Transforms (SMT) which is the area of functionality, if any, in Kafka Connect that would do what you want. However IIRC SMT are only for 1:1 input/output record processing, not 1:n - that is, I don't think (but am willing to be corrected) that you can generate additional output records from a single input record.
You probably want to look at Kafka Streams which would give you the full capabilities of doing this. You'd write a streams app which would subscribe to the source topic, do the necessary array processing, and write it back to a new Kafka topic. The new Kafka topic would then be the source topic for your Kafka Connect HDFS sink.

Beam / Cloud Dataflow: How to Add Kafka (or PubSub) topics to Running Stream

(How) is it possible to dynamically add or remove topics to a running pipeline as a source or sink (Kafka or PubSub)? Or have as a sink a dynamic pattern like it is possible with BigQuery Table names.
Some background: We have different topics, one per customer, to better facilitate downstream aggregations and also clean/up add them on the fly. Kafka is used to be able to backfill calculations over periods that are longer than possible with PubSub.
The options I have in my mind right now are either extending KafkaIO to support this, or to update the pipeline each time there is a topic added removed (meaning there will be some lags in the stream while its updated). Or maybe I'm having a wrong design pattern in my head and there are other solutions for this.
You are correct that right now the easiest solution is updating the pipeline.
However, a new API called Splittable DoFn (SDF) is currently in active development; it is already available in the Cloud Dataflow runner in streaming mode and in the Direct runner, and implementation is in progress in Flink and Apex runners.
It makes it possible to do things like "create a PCollection of Kafka topic names and read each of those topics", so you can have one pipeline stage produce names of topics to be read (e.g. the names themselves could arrive over Kafka or Pubsub every time a customer is added, or you could write an SDF to watch the result of a database query returning a list of customers and emit new ones), and another stage reading those topics.
See http://s.apache.org/splittable-do-fn for the design doc of the API, and http://s.apache.org/textio-sdf for an example proposed refactoring of TextIO using this API - you may want to try to modify KafkaIO yourself in a similar fashion.

Using kafka streams to create a table based on elasticsearch events

Is it possible to use Kafka streaming to create a pipeline that reads JSON from a Kafka topic and then do some logic with them and send the results to another Kafka topic or something else?
For example, I populate my topic using logs from elasticsearch. That is pretty easy using a simple logstash pipeline.
Once I have my logs in the kafka topic, I want to extract some pieces of information from the log and put them in a sort of "table" with N column(is Kafka capable of this?) and then put the table somewhere else (another topic or a db).
I didn't find any example that satisfies my criteria.
thanks
Yes, it's possible.
There is no concept of columns in kafka or kafka-streams. However, you typically just define a plain old java object of your choice, with the fields that your want (fields being the equivalent of columns in this case). You produce the output in that format to an output topic (using an appropriately chosen serializer). Finally, if you want to store the result in a relational database, you map the fields into columns, typically using a kafka connect jdbc sink:
http://docs.confluent.io/current/connect/connect-jdbc/docs/sink_connector.html

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.