Spark: Scheduling Within an Application with scala/java - scala

The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition

Related

Join a static and a dynamic Kafka source in Flink

Today, I'd like to address a conceptual topic about Flink, rather than a technical.
In our case, we do have two Kafka topics A and B, that need to be joined. The join should always include all elements from topic A, as well as all new elements from topic B. There's 2 possibilities to achieve this: always create a new consumer and start consumption of topic A from beginning, or keep all elements from topic A within a state, once consumed.
Right now, the technological approach is going via joining two DataStreams, which quickly shows us its limits for this use case, as there is no possibility to join streams without a window (fair enough). Elements from topic A are eventually lost, if the window moves on and I got the feeling regularly resetting the consumer would bypass the elaborate logic introduced by Flink.
The other approach I am looking towards right now, would be to use the Table API, it sounds like it's the best fit for this job and actually keeps all the elements in its state for an indefinite amount of time.
However my question: Before going into depths of the Table API, only to notice there is a more elegant way, I'd like to identify, if this is the optimal solution for this matter or if there's an even better fitting Flink concept I am not aware of?
Edit: I forgot to mention: We do not make use of POJOs, but rather keep it generic, which means that the incoming data is identified as Tuple2<K,V>, where K,V are each an instance of GenericRecord. The corresponding schema for Serialization/Deserialization is obtained from the Schema Registry on runtime. I don't know, to which extent the SQL constructs can be a bottleneck in this situation.
Additionally, this remark from the documentation Both tables must have distinct field names makes me doubt a little bit, as we do have the same field names, which we will have to handle somehow, without having huge workarounds.
If A is truly static, then it will be less expensive if you can somehow fully ingest A, either into Flink state or into memory, and then stream B past A -- thereby producing the join results without having to store B.
There are at least a couple of ways to accomplish this with Flink. One is described in this answer, and the other involves using the State Processor API.
With this second approach you would hold A in key-partitioned Flink state. By using the State Processor API you can bootstrap a savepoint that contains the state you want, so that by starting your job from this savepoint, A is already fully loaded and immediately available.
There's a simple example of bootstrapping keyed state in this gist. Once you have created the savepoint, then you need to implement a streaming job that uses it to compute the join -- which can be done with a RichFlatMapFunction.
The other alternative for implementing joins without using the Table API is to simply roll your own with a RichCoFlatMapFunction or a KeyedCoProcessFunction. You will find examples of this in the Flink training. None of those examples really match your requirements, but they give the general flavor. I don't see any advantage to this, however -- if you are going to do a fully dynamic/dynamic join, might as well use the Table API.

Kakfa Streams multiple applications / streams on same node?

Hopefully this is a quick and easy question. Right now I have an application that has two unique tasks on it in the same stream. When the entire application runs, partitions are not balanced across the two tasks which was an issue as one of the tasks requires more resources (memory / cpu)
In order to solve this I created two unique streams with different stream builders in my application and started them separately. By setting it up this way, the partitions were balanced in the way I expected.
kafkaStreams0 = new KafkaStreams(kafkaStreamsBuilder0.build(), streamsProperties0)
kafkaStreams1 = new KafkaStreams(kafkaStreamsBuilder1.build(), streamsProperties1)
kafkaStreams0.start()
kafkaStreams1.start()
I'm giving each of these their own application id in the stream properties. Something about this seems like a hack, but I can't find any documentation about whether this is a valid solution.
As a note: I'd like to avoid splitting these into two applications as I don't want to add the operational overhead.

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.