How to use DataFrames within SparkListener? - scala

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk

SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Related

How do I make sure that I process one message at a time at most?

I am wondering how to process one message at a time using Googles pub/sub functionality in Go. I am using the official library for this, https://pkg.go.dev/cloud.google.com/go/pubsub#section-readme. The event is being consumed by a service that runs with multiple instances, so any in memory locking mechanism will not work.
I realise that it's an anti-pattern to do this, so let me explain my use-case. Using mongoDB I store an array of objects as an embedded document for each entity. The event being published is modifying parts of this array and saves it. If I receive more than one event at a time and they start processing exactly at the same time, one of the saves will override the other. So I was thinking a solution for this is to make sure that only one message will be processed at a time, and it would be nice to use any built-in functionality in cloud pub/sub to do so. Otherwise I was thinking of implementing some locking mechanism in the DB but i'd like to avoid that.
Any help would be appreciated.
You can imagine 2 things:
You can use ordering key in PubSub. Like that, all the message in relation with the same object will be delivered in order and one by one.
You can use a PUSH subscription to PubSub, to push to Cloud Run or Cloud Functions. With Cloud Run, set the concurrency to 1 (it's by default with Cloud Functions gen1), and set the max instance to 1 also. Like that you can process only one message at a time, all the other message will be rejected (429 HTTP error code) and will be requeued to PubSub. The problem is that you can parallelize the processing as before with ordering key
A similar thing, and simpler to implement, is to use Cloud Tasks instead of PubSub. With Cloud Tasks you can set a rate limit on a queue, and set the maxConcurrentDispatches to 1 (and you haven't to do the same with Cloud Functions max instances or Cloud Run max instances and concurrency)

Spark: Scheduling Within an Application with scala/java

The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.

Storm dynamic topology

Does Storm support dynamic topology? The functionality I want from this is to dynamically change the topology according to the user requirement while the Storm topology is running. For example, when user want to know the top-10 words of a stream, I use the top-10 bolt to process it, when user want to know something else, I use the other bolt to process the stream and 'unplug' the top-10 bolt.
I know it could be done by partition the stream or duplicate the stream and alway running every functionalities and only demo the data we want, or we could shut down the stream and update another topology, but is there a 'hot plug-in' way to do that?
You can't dinamically change a Storm topology's structure, i.e. modify the spouts and bolts wiring. A Storm topology's wiring is always static.
However, you could implement the needed functionality in other ways you already described. IMHO, the best, most logical way would be by running multiple topologies -- in case the data processing differs greatly. But if most of the processing is similar in both cases, just duplicate the source stream and process the data in different branches of the same topology.
It was added on STORM-561, on 03/Jun/15:
https://issues.apache.org/jira/browse/STORM-561
There is no built in way to do this (switch out one bolt for another), but what you can do is write a bolt that executes arbitrary code based on the input it receives. So long as your input and output has the same structure in storm (same tuples emitted), you could theoretically execute whatever you wanted at run time in your bolt. This is especially easy if you build your bolt in Clojure, but it's possible in essentially every language you can use with Storm.
However, this probably doesn't make a lot of sense as most computations you'll want to do involve more than one bolt and lend themselves to passing differently structured tuples around. As schiavuzzi already said in their answer, you're probably better off running multiple topologies if there are multiple, independent computations you'd like to do to a stream.
For hot deployment there is a new streaming platform from eBay.
Jetstream: https://github.com/pulsarIO/jetstream.
It has a built in config management tool and your config sits in mongodb. When user modify the config bean, the tool will publish the notification to zookeeper, the corresponding JetStream applications will be get notified and change the config dynamically