I am currently studying distributed stream processing systems, e.g. Storm, Flink and Spark Streaming. I want to implement some applications in these systems and briefly compare them. I wonder if there is any company using these systems for processing the following situations and what's the scale of the data stream.
Graph where a large graph may be distributed to multiple machines and we handles some updates (add or remove vertices or edges) and queries on the graph. So far I can only find some streaming graph algorithms on a single machine.
Transaction where exactly-once message delivery is a must. There exists a Leaderboard Maintenance Benchmark in S-Store (Meehan, John, et al. "S-store: Streaming meets transaction processing." Proceedings of the VLDB Endowment 8.13 (2015): 2134-2145) but I cannot find how they generated the input data.
have a look at yahoo's benchmark for comparing Apache Storm, Apache Spark, Apache Flink and Apache Apex etc
Related
We are using kafka for messaging and lot more stuff but now there is a requirement where we need some kind of rule-engine for data processing based on some rules. Does kafka holds any capability like this (rule-engine) or we have to use third party rule engine's (eg. https://camunda.com/dmn/ ) only and integrate with kafka.
There is no need to use a third party rule engine with Apache Kafka. As part of the project there is Kafka Streams and also to ease off a bit the need to write Java code to express rules there is ksqlDB that is based on a subset of ANSI SQL.
While these options are not necessarily a rule engine per-se; they share the same semantics which is: given an intermediate processing output the relevant result based on the computation. The difference will be in the how and not in the what. So I think they are decent replacements.
You can always integrate both as well. Several rule engines such as Drools from Red Hat are Java-based and thus; can be easily accessed from a Kafka Streams processor. As long the if-then-else rules run in the same JVM space of the Kafka Streams application you won't have any performance penalties other than a possibly bigger JVM heap.
At "Big Data Concepts, Theories, and Applications", I found a brief decription of MLST and DAG. Still the description doesn't provide an answer to WHY one distributed systems use MLST while other use DAG? What does that mean in terms of performance/latency/multi-user/etc.?
I noticed that DAG is used by distributed computational engines (i.e. Apache Spark, Storm, Tez) while MLST is used by interactional distributed databases (i.e. Google Dremel, Apache Impala, Presto). Is this always true?
I work on a log centralization project.
I'm working with ELK to Collect/Aggregate/Store/Visualize my data. I see that Kafka can be useful for large volume of data but
I can not find information from what volume of data it could become interesting to use it.
10 Giga of log per day ? Less, more ?
Thanks for your help.
Let's approach this in two ways.
What volumes of data is Kafka suitable for? Kafka is used at large scale (Netflix, Uber, Paypal, Twitter, etc) and small.
You can start with a cluster of three brokers handling a few MB if you want, and scale out from there as required. 10 Gb of data a day would be perfectly reasonable in Kafka—but so would ten times less or ten times more.
What is Kafka suitable for? In the context of your question, Kafka serves as an event-driven integration point between systems. It can be a "dumb" pipeline, but since it persists data that enables its reconsumption elsewhere. It also offers native stream processing capabilities and integration with other systems.
If all you are doing is getting logs into Elasticsearch then Kafka may be overkill. But if you wanted to use that log data in another place (e.g. HDFS, S3, etc), or process it for patterns, or filter it for conditions to route elsewhere—then Kafka would be a sensible option to route it through. This talk explores some of these concepts.
In terms of ELK and Kafka specifically, Logstash and Beats can write to Kafka as an output, and there's a Kafka Connect connector for Elasticsearch
Disclaimer: I work for Confluent.
I'm designing an event-sourced architecture based around Kafka, and using Flink for stream processing.
One use case will be the querying (filtering and sorting of results) of historical trade data that has passed through the Kafka topic over time. e.g. "Give me all trades in the last 5 years with these attributes, sorted by xx". Total trade history will be around 10m, increasing by say 1m/year.
Is Flink itself the right tool for such historical queries, and able to do so with reasonable performance (a few seconds)? Or am I better feeding the events from Kafka into an indexable/queryable data store like MongoDB/RDBMS, and using that for historical queries?
Doing the former feels like it'll adhere more closely to a Kappa Architecture, whereas resorting to a historical db feels like I'm moving away from that back towards a Lambda architecture.
Flink is well suited to process historic data from a Kafka topic (or any other data source) due to its support for event-time processing, i.e., time-based processing based on timestamps in the records not based on the clock of the processing machine (aka processing-time).
If you only want to perform analytics, you might want to have a look at Flink's SQL support.
We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.