Spark Structured Streaming - Processing each row - scala

I am using structured streaming with Spark 2.1.1. I need to apply some business logic to incoming messages (from Kafka source).
essentially, I need to pick up the message, get some key values, look them up in HBase and perform some more biz logic on the dataset. the end result is a string message that needs to be written out to another Kafka queue.
However, since the abstraction for incoming messages is a dataframe (unbounded table - structured streaming), I have to iterate through the dataset received during a trigger through mapPartitions (partitions due to HBase client not being serializable).
During my process, i need to iterate through each row for executing the business process for the same.
Is there a better approach possible that could help me avoid the dataFrame.mapPartitions call? I feel its sequential and iterative !!
Structured streaming basically forces me to generate an output data frame out of my business process, whereas there is none to start with. What other design pattern can I use to achieve my end goal ?
Would you recommend an alternative approach ?

When you talk about working with Dataframes in Spark, speaking very broadly, you can do one of 3 things
a) Generate a Dataframe
b) Transform a data frame
c) Consume a data frame
In structured streaming, a streaming DataFrame is generated using a DataSource. Normally you create sources using methods exposed sparkSession.readStream method. This method returns a DataStreamReader which has several methods for reading from various kinds of input. All of there return a DataFrame. Internally it creates a DataSource. Spark allows you to implement your own DataSource, but they recommend against it, because as of 2.2, the interface is considered experimental
You transform data frames mostly using map or reduce, or using spark SQL. There are different flavors of map (map, mapPartition, mapParititionWithIndex), etc. All of them basically take a row and return a row. Internally Spark does the work of parallelizing the calls to your map method. It partitions the data, spreads it around on executors on the cluster, and calls your map method in the executor. You don't need to worry about parallelism. It's built under the hood. mapParitions is not "sequential". Yes, rows within a partition are executed sequentially, but multiple partitions are executed in parallel. You can easily control the degree of parallelism by partitioning your dataframe. You have 5 partitions, you will have 5 processes running in parallel. You have 200, you can have 200 of them running in parallel if you have 200 cores
Note that there is nothing stopping you from going out to external systems that manage state inside your transformation. However, your transformations should be idempotent. Given a set of input, they should always generate the same output, and leave the system in the same state over time. This can be difficult if you are talking to external systems inside your transformation. Structured Streaming provides at least once guarantee. The means that the same row might be transformed multiple times. So, if you are doing something like adding money to a bank account, you might find that you have added the same amount of money twice to some of the accounts.
Data is consumed by sinks. Normally, you add a sink by calling the format method on a Dataframe and then calling start. StructuredStreaming has a handful of inbuilt sinks which (except for one) are more or less useless.You can create your custom Sink but again it's not recommended because the interface is experimental. The only useful sink is what you would implement. It is called ForEachSink. Spark will call your for each sink with all the rows in your partition. You can do whatever you want with the rows, which includes writing it to Hbase. Note that because of the at least once nature of Structured Streaming, the same row might be fed to your ForEachSink multiple times. You are expected to implement it in an idempotent manner. Also, if you have multiple sinks, data is written to sinks in parallel. You cannot control in what order the sinks are called. It can happen that one sink is getting data from one micro batch while another sink is still processing data for the previous micro batch. Essentially, the Sinks are eventually consistent, not immediately consistent.
Generally, the cleanest way to build your code is to avoid going to outside systems inside your transformations. Your transformations should purely transform data in data frames. If you want data from HBase, get it into a data frame, join it with your streaming data frame, and then transform it. This is because when you go to outside systems, it becomes difficult to scale. You want to scale up your transformations by increasing partitioning on your data frames and adding nodes. However, too many nodes talking to external systems can increase the load on the external systems and cause bottlenecks, Separating transformation from data retrieval allows you to scale them independently.
BUT!!!! there are big buts here......
1) When you talk about Structured streaming, there is no way to implement a Source that can selectively get data from your HBase based on the data in your input. You have to do this inside a map(-like) method. So, IMO, what you have is perfectly fine if the data in Hbase changes or there is a lot of data that you don't want to keep in memory. If your data in HBase is small and unchanging, then it's better to read it into a batch data frame, cache it and then join it with your streaming data frame. Spark will load all the data into its own memory/disk storage, and keep it there. If your data is small and changing very frequently, it's better to read it in a data frame, don't cache it and join it with a streaming data frame. Spark will load the data from HBase every time it runs a micro batch.
2) there is no way to order the execution of 2 separate Sinks. So, if your requirement requires you to write to a database, and write to Kafka, and you want to guarantee that a row in Kafka is written after the row is committed in the database, then the only way to do that is to
a) do both writes in a For each Sink.
b)write to one system in a map-like function and the other in a for each sink
Unfortunately, if you have a requirement that requires you to read data from a streaming source, join it with data from batch source, transform it, write it to database, call an API, get the result from the API and write the result of the API to Kafka, and those operations have to be done in exact order, then the only way you can do this is by implementing sink logic in a transformation component. You have to make sure you keep the logic separate in separate map functions, so you can parallelize them in an optimal manner.
Also, there is no good way to know when a micro-batch is completely processed by your application, especially if you have multiple sinks

try ForeachWriter, In ForeachWriter process() method receives single row from data frame.
and you can process the data as you want. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/ForeachWriter.html

Related

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

Can I attach multiple transformers/processors to a single stream in Apache Kafka

In all example I see a simple single transformer/processor topology for Kafka. My doubt is whether we can modularise application logic by breaking down in to multiple transformers/processors applying sequentially to a single input stream.
Please find use case below :
Current application configuration is a single processor containing all processing logic tasks like filtering, validation, application logic, delaying(Kafka is too fast for dbs) and invoke SP/push to down stream.
But we are now planning to decouple all these operations by breaking down each task into separate processors/transformers of Kstream.
Since we are relatively new to Kafka, we are not sure of the pros and cons of this approach especially with respect to Kafka internals like state store/ task scheduling/ multithreading model.
Please share your expert opinions and experiences
Please note that we do not have control over topic, no new topic can be created for this design. The design must be feasible for the existing topic alone.
Kafka Streams allows you to split your logic into multiple processors. Internally, Kafka Streams implements a "depth-first" execution strategy. Thus, each time you call "forward" the output tuple is immediately processed by the downstream processor and "forward" return after downstream processing finished (note, that writing data into a topic and reading it back "breaks" the in-memory pipeline -- thus, when data is written to a topic, there is no guarantee when downstream processor will read and process those records).
If you have state that is shared between multiple processor, you would need to attach the store to all processor that need to access to store. The execution on the store will be single threaded and thus, there should be no performance difference.
As long as you connect processor directly (and not via topics) all processor will be part of the same tasks. Thus, there shouldn't be a performance difference.

How to do df.rdd or df.collect().foreach on streaming dataset?

This is the exception I am getting whenever I am trying to convert it.
val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
All I am trying to do is replicate the following sql.dataframe operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
collect is a big no-no even in Spark Core's RDD world due to the size of the data you may transfer back to the driver's single JVM. It just sets the boundary of the benefits of Spark as after collect you are in a single JVM.
With that said, think about unbounded data, i.e. a data stream, that will never terminate. That's Spark Structured Streaming.
A streaming Dataset is one that is never complete and the data inside varies every time you ask for the content, i.e. the result of executing the structured query over a stream of data.
You simply cannot say "Hey, give me the data that is the content of a streaming Dataset". That does not even make sense.
That's why you cannot collect on a streaming dataset. It is not possible up to Spark 2.2.1 (the latest version at the time of this writing).
If you want to receive the data that is inside a streaming dataset for a period of time (aka batch interval in Spark Streaming or trigger in Spark Structured Streaming) you write the result to a streaming sink, e.g. console.
You can also write your custom streaming sink that does collect.map(_.toSeq) inside addBatch which is the main and only method of a streaming sink. As a matter of fact, console sink does exactly it.
All I am trying to do is replicate the following sql.dataframe
operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
The very first solution that comes to my mind is to use foreach sink:
The foreach operation allows arbitrary operations to be computed on the output data.
That, of course, does not mean that this is the best solution. Just one that comes to my mind immediately.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Oracle change-data-capture with Kafka best practices

I'm working on a project where we need to stream real-time updates from Oracle to a bunch of systems (Cassandra, Hadoop, real-time processing, etc). We are planing to use Golden Gate to capture the changes from Oracle, write them to Kafka, and then let different target systems read the event from Kafka.
There are quite a few design decisions that need to be made:
What data to write into Kafka on updates?
GoldenGate emits updates in a form of record ID, and updated field. These changes can be writing into Kafka in one of 3 ways:
Full rows: For every field change, emit the full row. This gives a full representation of the 'object', but probably requires making a query to get the full row.
Only updated fields: The easiest, but it's kind of a weird to work with as you never have a full representation of an object easily accessible. How would one write this to Hadoop?
Events: Probably the cleanest format ( and the best fit for Kafka), but it requires a lot of work to translate db field updates into events.
Where to perform data transformation and cleanup?
The schema in the Oracle DB is generated by a 3rd party CRM tool, and is hence not very easy to consume - there are weird field names, translation tables, etc. This data can be cleaned in one of (a) source system, (b) Kafka using stream processing, (c) each target system.
How to ensure in-order processing for parallel consumers?
Kafka allows each consumer to read a different partition, where each partition is guaranteed to be in order. Topics and partitions need to be picked in a way that guarantees that messages in each partition are completely independent. If we pick a topic per table, and hash record to partitions based on record_id, this should work most of the time. However what happens when a new child object is added? We need to make sure it gets processed before the parent uses it's foreign_id
One solution I have implemented is to publish only the record id into Kafka and in the Consumer, use a lookup to the origin DB to get the complete record. I would think that in a scenario like the one described in the question, you may want to use the CRM tool API to lookup that particular record and not reverse engineer the record lookup in your code.
How did you end up implementing the solution ?