Multithreading in Apache Beam : Reading Files in Seperate Threads - apache-beam

We have a requirement to create separate threads for reading multiple files.
Thread 1 can read file 1 and create PCollection<String>. Can I execute a Pardo Operation in a multithreaded environment. and create a PCollection < String,String > from PCollection< String >?
Thread 2 and complete the same operation from Thread 1 but on a different file File 2.
Join output of File1 and File 2 in the main thread after Thread 1 and Thread 2 operation is completed.
Could you please tell whether this is possible and it is a recommended approach?

It sounds like what you want can be done with Beam. In the Beam model, you do not define how you want your operations to run, but rather, what operations you want to perform; then Beam, and the underlying runner takes care of managing threads.
That's why you generally shouldn't manage your own threads to read files in Beam. You should use TextIO to read from plain text files, and the TextIO module should read the files in parallel.
There are a few cases when your files will not be able to be read in parallel:
Your files are compressed. This means that the file needs to be simultaneously decompressed and read, and can be read from different offsets simultaneously.
You have too many files (1000s). If you have thousands or tens of thousands of files, you may want to use TextIO.readAll instead of the normal TextIO implementation, because keeping track of thousands of files that are being read in parallel can overwhelm the system.
Let me know if you are using non-plain text files, or other kind of source.

Related

Create Kafka stream of events from lines of a single massive file

In DNA informatics the files are massive (300GB each and biobanks have hundreds of thousands of files) and they need to go through 6 or so lengthy downstream pipelines (hours to weeks). Because I do not work at the company that manufactures the sequencing machines, I do not have access to the data as it is being generated...nor do I write assembly lang.
What I would like to do is transform the lines of text from that 300GB file into stream events. Then pass these messages through the 6 pipelines with Kafka brokers handing off to SparkStreaming between each pipeline.
Is this possible? Is this the wrong use case? It would be nice to rerun single events as opposed to entire failed batches.
Desired Workflow:
------pipe1------
_------pipe2------
__------pipe3------
___------pipe4------
Current Workflow:
------pipe1------
_________________------pipe2------
__________________________________------pipe3------
___________________________________________________------pipe4------
Kafka is not meant for sending files, only relatively small events. Even if you did send a file line-by-line, you would have need to know how to put the file back together to process it, and thus you're effectivly doing the same thing as streaming the files though a raw TCP socket.
Kafka has a maxiumum default message suze of 1MB, and while you can increase it, I wouldn't recommend pushing it much over the double digit MB sizes.
How can I send large messages with Kafka (over 15MB)?
If you really needed to get data like that though Kafka, the recommended pattern is to put your large files on an external storage (HDFS, S3, whatever), then put the URI to the file within the Kafka event, and let consumers deal with reading that datasource.
If the files have any structure at all to them (like pages, for example), then you could use Spark and a custom Hadoop InputFormat to serialize those, and process data in parallel that way. Doesn't necessarily have to be through Kafka, though. You could try Apache NiFi, which I hear processes larger files better (maybe not GB, though).

Spring Batch and Executors Framework

Are these 2 Frameworks used for same purpose. If not why and how these are used in real time applications ? Are there any tutorials to learn these.
Spring batch is meant for batch processing of files by executing them in a series of jobs .Batch processing could be reading from CSV or XML or any flat file and write it to DB .Spring Batch provides many made Classes to read/write CSV, XML and database.
http://www.mkyong.com/tutorials/spring-batch-tutorial/
Java Executor service on the other hand is all about spawning multiple threads in a thread pool and executing them for any purpose be it batch or anything else,but here you have better control because of transaction management.Also , its a feature introduced from Java 5 onwards.Also there are many methods depends on you want to compute on a response or you don't want a response from future object .
http://tutorials.jenkov.com/java-util-concurrent/executorservice.html

Filestream Pointers

I need to copy many small file streams into one file stream. Each file stream carries its own designated position, i.e. FS1 goes firstly, then FS2 and so on. But while multi-threading the program, the thread which finishes the processing firstly adds it firstly which causes errors.
Is there any way that we can define its position so that no matter what sequence we add them, they will get in the right place??
I tried this by creating many headers before hand so that the file streams will replace those headers, but searching for those headers just slows down the program.
This question is in continuation to my last one, as First processedFS jumps (Copies) first, so we need to define location where it will be copied.
Please refer to this question:
Sequential MT
You can't have multiple threads write to the same file stream at the same time without wrapping it with a synchronization lock, and you would also need to re-seek the stream back and forth depending on which thread needs to write to it at any given moment, so it writes at the correct offset within the file. That is a lot of overhead.
You can, however, have multiple threads use different file streams to write to the same file at the same time, provided the sharing rights between the streams are compatible to allow concurrent writing and preserve data. Pre-size the file to the desired length, then divide up sections of that length amongst the threads as needed. Give each thread its own stream to the target file, first seeking to the appropriate start offset of its assigned section. Then each thread can write to its respective stream normally, without having to synch with the other threads or re-seek its stream. Just make sure each thread does not exceed the bounds of its assigned section within the file so it does not overwrite another thread's data.

Spring batch job to process a dynamic set of files parallely

I am trying to design a spring batch job which I want to process a dynamic set of files parallely. Meaning when the batch job itself is started, the number of files to process is not known - the files are available dynamically. The job should run and continue to process the files parallely as and when a new file arrives,till it has finished processing all files.
I have gone through the spring batch project page, and from my understanding it looks like Multi-threaded Step is suitable for my case. But the thing that I am not sure of is whether it can support dynamic availability of files to be processed?
Any inputs will be highly appreciated.
Thanks and regards,
Priya
You have a couple options here:
MultiResourceItemReader - This ItemReader wraps an ItemWriter like the FlatFileItemReader and loops through the resources provided via an expression.
Partitioning - This option is better for parallel processing of files. Using the MultiResourcePartitioner, you can execute files in parallel with all the restartability, etc features you'd normally get with Spring Batch.
You can read more about partitioning in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.