Create Kafka stream of events from lines of a single massive file - apache-kafka

In DNA informatics the files are massive (300GB each and biobanks have hundreds of thousands of files) and they need to go through 6 or so lengthy downstream pipelines (hours to weeks). Because I do not work at the company that manufactures the sequencing machines, I do not have access to the data as it is being generated...nor do I write assembly lang.
What I would like to do is transform the lines of text from that 300GB file into stream events. Then pass these messages through the 6 pipelines with Kafka brokers handing off to SparkStreaming between each pipeline.
Is this possible? Is this the wrong use case? It would be nice to rerun single events as opposed to entire failed batches.
Desired Workflow:
------pipe1------
_------pipe2------
__------pipe3------
___------pipe4------
Current Workflow:
------pipe1------
_________________------pipe2------
__________________________________------pipe3------
___________________________________________________------pipe4------

Kafka is not meant for sending files, only relatively small events. Even if you did send a file line-by-line, you would have need to know how to put the file back together to process it, and thus you're effectivly doing the same thing as streaming the files though a raw TCP socket.
Kafka has a maxiumum default message suze of 1MB, and while you can increase it, I wouldn't recommend pushing it much over the double digit MB sizes.
How can I send large messages with Kafka (over 15MB)?
If you really needed to get data like that though Kafka, the recommended pattern is to put your large files on an external storage (HDFS, S3, whatever), then put the URI to the file within the Kafka event, and let consumers deal with reading that datasource.
If the files have any structure at all to them (like pages, for example), then you could use Spark and a custom Hadoop InputFormat to serialize those, and process data in parallel that way. Doesn't necessarily have to be through Kafka, though. You could try Apache NiFi, which I hear processes larger files better (maybe not GB, though).

Related

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

Multithreading in Apache Beam : Reading Files in Seperate Threads

We have a requirement to create separate threads for reading multiple files.
Thread 1 can read file 1 and create PCollection<String>. Can I execute a Pardo Operation in a multithreaded environment. and create a PCollection < String,String > from PCollection< String >?
Thread 2 and complete the same operation from Thread 1 but on a different file File 2.
Join output of File1 and File 2 in the main thread after Thread 1 and Thread 2 operation is completed.
Could you please tell whether this is possible and it is a recommended approach?
It sounds like what you want can be done with Beam. In the Beam model, you do not define how you want your operations to run, but rather, what operations you want to perform; then Beam, and the underlying runner takes care of managing threads.
That's why you generally shouldn't manage your own threads to read files in Beam. You should use TextIO to read from plain text files, and the TextIO module should read the files in parallel.
There are a few cases when your files will not be able to be read in parallel:
Your files are compressed. This means that the file needs to be simultaneously decompressed and read, and can be read from different offsets simultaneously.
You have too many files (1000s). If you have thousands or tens of thousands of files, you may want to use TextIO.readAll instead of the normal TextIO implementation, because keeping track of thousands of files that are being read in parallel can overwhelm the system.
Let me know if you are using non-plain text files, or other kind of source.

use of Mongo/Redis on a full firehose stream

I have been reading up on how DataSift uses different technologies to consume the twitter firehose and since I need to follow the same concept, wanted to get some understanding on the differences between mongo/redis and its use in storage of realtime data. My understanding is this:
The stream volume is way too high to simply consume and place the data (tweets etc) in for example a rabbitmq bunch of queues. My concern is the issue of data loss. My current architecture involves connecting to an open stream and consume the data and push each post or message into a couple of queues in rabbitmq. The queues hold a copy of each message with one being the processing queue and one being the storage queue. I then consume each queue by doing processing on the processing queue which is time intensive but my workers keep up fine and write all the storage queue contents to files and that works fine.
If my volume was to increase 100x, I am told this current setup will not be able to handle the volume and using the mongo/redis approach would be better. So not sure how this would be implemented: would I then consume the stream into mongo and then from there into queues and why would this be a better approach.

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.