Is checkpointing necessary in spark streaming - scala

I have noticed that spark streaming examples also have code for checkpointing. My question is how important is that checkpointing. If its there for fault tolerance, how often do faults happen in such streaming applications?

It all depends on your use case. For suppose if you are running a streaming job, which just reads data from Kafka and counts the number of records. What would you do if your application crashes after a year or so?
If you don't have a backup/checkpoint, you will have to recompute all the previous one years worth data so you can resume counting.
If you have a backup/checkpoint, you can simply read the checkpoint data and resume instantly.
Or if all you are just doing is having a streaming application which just Reads-Messages-From-Kafka >>> Tranform >>> Insert-to-a-Database, I need not worry about my application crashing. Even if it's crashed, i can simply resume my application without loss of data.
Note: Check-pointing is a process which stores the current state of a spark application.
Coming to the frequency of fault tolerance, you can almost never predict an outage. In companies,
There might be power outage
regular maintainance/upgrading of cluster
hope this helps.

There are two cases:
You are doing stateful operations, such as updateStateByKey, then
you must use checkpointing - every state is saved. Without setting
checkpoint directory, an exception will be thrown.
You are doing only windowed operations - then yes, you can disable checkpointing. However I strongly recommend setting checkpoint directory.
When driver is killed, then you'll loose all your data and progress information. Checkpointing helps you to recover applications from such situations.
Is a failure a normal situation? Of course! Imagine that you've got large cluster, many machines, many components in these machines. If one of these components fails, then your application will also fail. When connection to driver is lost - your application fails. With checkpoiting you can just run application again and it will recover state.

Related

Prediction/Estimation of missing intervals inside Apache Kafka process

Goal is to process raw readings (15min and 1h interval) from external remote meters (assets) in real time.
Process is defined using simple Apache Kafka producer/consumer and multiple Spring Boot microservices to deduplicate messages, transform (map) readings to our system (instead external codes insert internal IDS and similar stuff) and insert in TimescaleDB (extension of PostgreSql).
Everything seems fine, but there is requirement to perform real time prediction/estimation of missing intervals.
Simple example for one meter and 15 minute readings:
On day 1 we got all readings. We process them and have them ingested in our DB.
On day 2 we are missing all readings - so process is not even
started for this meter.
On day 3 we again got all readings - but only for day 3. Now we need
to predict that whole day 2 is missing and create empty readings and
then estimate them by some algorithm (that is not that important
now).
My question here, is there any way or idea how to do this without querying existing database in one of the microservices and checking if something is missing?
Is it possible to check previous messages in Kafka topics and based on that do the prediction/estimation (kafka streams? - I don't get them at all) and is that even smart to do, or there is any other way/idea to do it?
Personal opinion disclaimer
It is not reasonably possible to check previous messages in Kafka Streams. If you are hellbent on doing it, you could probably try to seek messages and re-consume them but Kafka will fight you every step on the way. The mental model is, that you are transforming or aggregating data that comes in in real time. If you need to query something about previous data, you ought to have collected that information when that data was coming through.
What could work (rather well even) is to separate the prediction of missing data from the transformation.
Create two consumers for the stream.
Have one topology (or whatever it is that does your transformations already) transform the data and load it back into Kafka and from there to timescaledb.
Have one topology (or another microservice) that does what is needed to predict missing data. Your usecase of backfilling a missing day could be handled by something like a count based on daily windows
Make that trigger your backfilling either as part of that topology or as a subsequent microservice and load that data to timescaledb as well.
Are you already using Kafka Streams for the transformations? This would be a classical usecase.
The recognition of missing data not so much
As far as I understand it does not require high throughput. More the opposite. You want to know if there is no data.
As far as I understand it latency is not a (main) concern.
Kafka Streams could be useful if you need to take automated action within seconds after data stops coming in. But even then, you could just write throughput metrics and trigger alerts in this case.
Pther than that, it is a very stateful problem and stream processing is at its best if you can treat every message separately reduce them in a "standard" manner like sums or counts.
I got the impression, that a delay of a few hours / a day is not that tragic and currently the backfilling might be done manually. In this case the cot of Kafka Streams would outweigh the benefits.

Stateful Kafka Stream - How to restore state?

If I am running my stream application - appA on Machine A and then I moved it to Machine B; will it remember the earlier state?
When I write simple consumer it remembers the last offset and it gets stored in __consumer_offsets itself on Broker. So no matter where I start the Consumer it will pick up from that place.
Is there such a construct for stateful stream processing applications? If I am calculating the continuous Profit and Loss of my portfolio I need to start from where it was the last run and then start aggregating new transactions to that earlier P&L number. I cannot afford to process all messages again from the start of time. I have been having a hard time in finding an article around this that explains how to solve this problem.
No, it won't remember state unless you move the statestore as well (state.dir configuration).
The changelog topic will need read from the earliest offsets to rebuild the state.
There's presentations about running Kafka Steams in Kubernetes that cover some aspects of this, since Kubernetes can stop and relocate its pods... But kubernetes also has volume management features that may not be available in your scenario.
It might therefore be best to run your job on both machines to start, then you have fault tolerance, high availability with a warm standby replica / partitioned state.

Can Cassandra or ScyllaDB give incomplete data while reading with PySpark if either clusters are left un-repaired forever?

I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.
As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.
In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).

How to build a fault-tolerant app in Storm?

The short version of the question: how to build a fail-safe word count program (topology) in Twitter Storm that produces accurate results even when failure occurs? Is that even possible?
Long version: I am studying Twitter Storm and trying to understand how it should be used. I have followed the tutorial and find it a very simple concept. But the word count example outlined in the tutorial is not fault tolerant (because bolts save some data in memory). Saving the same data in back-end DB however leads to double counting if an event is re-submitted to the start of chain (which happens when some of the bolts fail).
Should I see Twitter Storm as real-time platform for producing partially accurate results and still depend on MapReduce to get the accurate ones?
It really depends on what kind of failure your trying to hege against. There are a few things that you can do:
Storm bolts are supposed to ack a tuple only after they have processed it. If you write your spouts and bolts and topology to use this, you can implement an "exactly one time" system which will guarantee accuracy.
Kafka can be a good way to put data into Storm because it uses disk persistance to keep messages around for a long time even after they are consumed. This means you can retrieve them if there's a failure by a consumer down the line.
In general though, it's difficult to guarantee that things are processed exactly once in any streaming system. This is a known problem, and it is a very difficult problem to solve efficiently.
Storm has the concept of transactional topologies. In practice, this means you will want to process items in batches, then commit to your database at the end of the batch, storing the transaction ID in the database alongside a count. This also has the practical benefit of reducing the load on your database with fewer inserts.
Batches are processed in parallel and may be replayed on failure, but are guaranteed to be committed in order. This is important because it makes it safe to write code that fetches the current count row, checks the transaction ID against the one in memory, and if the two differ (meaning it is an uncommitted batch), adding the new count to the existing one and committing that updated count.
See the following link for much more information and code examples:
https://github.com/nathanmarz/storm/wiki/Transactional-topologies