Spark/Spark Streaming in production without HDFS - scala

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.

Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.

Related

Kafka Streams vs Flink

I wrote an application that reads 100.000 Avro records per second from Kafka topic, aggregates by key, use tumbling windows with 5 different sizes, do some calculation to know the highest, lowest, initial and end value, and write back to another Kafka topic.
This application already exists in Flink, but the source is RSocket in CSV format and the sink is Cassandra. The problem is that the new application is using a lot more CPU and memory. I checked this article and noticed performance is not mentioned.
Am I correct to assume the difference is mostly because of Avro serialisation / deserialisation, or is Flink supposed to be faster for this use case? If the difference is small, I'd prefer Kafka Streams to avoid needing to manage the cluster.
I don't think this question can be answered generally. Both Flink and Kafka Streaming can be tuned to the workload, and small changes in parameters can make a large difference in performance. Generally, there is no fundamental reason why Flink should be a lot faster for such a use case than Kafka Streams. One exception may be repartitioning, which always need to go through the Kafka cluster for Kafka streams and can stay within the cluster for Flink, but as I understand, you are not repartitioning in your use case.
Serialization format may play a large role, however. Some benchmarks that I remember for protobuf (for avro is similar) showed that the size in (Java) memory is 100x larger than the serialized data on the wire. Again, this depends on many things, in particular how nested/complex your schema is. If avro is deserialized to a complex object model, this will cause a significant CPU / memory overhead compared to passing strings around.
However, the only way to tell for certain what is slowing down your use case is profiling it and seeing where the additional resources are spent.
Without benchmarks on your own hardware, or JVM profiling your code, it's hard to say which will be faster.
Flink does invoke more JVM function calls than Kafka Streams, from what I've seen.
Kafka Streams doesn't work well (or at all) with external systems such as RSocket or Cassandra. Therefore, you would still need Flink or some other ETL tool like Kafka Connect (i.e manage a cluster) to get data into a Kafka topic to then process, regardless of framework.
Serialization format shouldn't matter. Flink or Kafka Streams will use the exact same JVM methods from Avro (or any other format) SDK.

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.