Multiple Actors that writes to the same file + rotate - scala

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)

One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.

U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.

Related

Create Kafka stream of events from lines of a single massive file

In DNA informatics the files are massive (300GB each and biobanks have hundreds of thousands of files) and they need to go through 6 or so lengthy downstream pipelines (hours to weeks). Because I do not work at the company that manufactures the sequencing machines, I do not have access to the data as it is being generated...nor do I write assembly lang.
What I would like to do is transform the lines of text from that 300GB file into stream events. Then pass these messages through the 6 pipelines with Kafka brokers handing off to SparkStreaming between each pipeline.
Is this possible? Is this the wrong use case? It would be nice to rerun single events as opposed to entire failed batches.
Desired Workflow:
------pipe1------
_------pipe2------
__------pipe3------
___------pipe4------
Current Workflow:
------pipe1------
_________________------pipe2------
__________________________________------pipe3------
___________________________________________________------pipe4------
Kafka is not meant for sending files, only relatively small events. Even if you did send a file line-by-line, you would have need to know how to put the file back together to process it, and thus you're effectivly doing the same thing as streaming the files though a raw TCP socket.
Kafka has a maxiumum default message suze of 1MB, and while you can increase it, I wouldn't recommend pushing it much over the double digit MB sizes.
How can I send large messages with Kafka (over 15MB)?
If you really needed to get data like that though Kafka, the recommended pattern is to put your large files on an external storage (HDFS, S3, whatever), then put the URI to the file within the Kafka event, and let consumers deal with reading that datasource.
If the files have any structure at all to them (like pages, for example), then you could use Spark and a custom Hadoop InputFormat to serialize those, and process data in parallel that way. Doesn't necessarily have to be through Kafka, though. You could try Apache NiFi, which I hear processes larger files better (maybe not GB, though).

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Filestream Pointers

I need to copy many small file streams into one file stream. Each file stream carries its own designated position, i.e. FS1 goes firstly, then FS2 and so on. But while multi-threading the program, the thread which finishes the processing firstly adds it firstly which causes errors.
Is there any way that we can define its position so that no matter what sequence we add them, they will get in the right place??
I tried this by creating many headers before hand so that the file streams will replace those headers, but searching for those headers just slows down the program.
This question is in continuation to my last one, as First processedFS jumps (Copies) first, so we need to define location where it will be copied.
Please refer to this question:
Sequential MT
You can't have multiple threads write to the same file stream at the same time without wrapping it with a synchronization lock, and you would also need to re-seek the stream back and forth depending on which thread needs to write to it at any given moment, so it writes at the correct offset within the file. That is a lot of overhead.
You can, however, have multiple threads use different file streams to write to the same file at the same time, provided the sharing rights between the streams are compatible to allow concurrent writing and preserve data. Pre-size the file to the desired length, then divide up sections of that length amongst the threads as needed. Give each thread its own stream to the target file, first seeking to the appropriate start offset of its assigned section. Then each thread can write to its respective stream normally, without having to synch with the other threads or re-seek its stream. Just make sure each thread does not exceed the bounds of its assigned section within the file so it does not overwrite another thread's data.

How can running event handlers on production be done?

On production enviroments event numbers scale massively, on cases of emergency how can you re run all the handlers when it can take days if they are too many?
Depends on which sort of emergency you are describing
If the nature of your emergency is that your event handlers have fallen massively behind the writers (eg: your message consumers blocked, and you now have 48 hours of backlog waiting for you) -- not much. If your consumer is parallelizable, you may be able to speed things up by using a data structure like LMAX Disruptor to support parallel recovery.
(Analog: you decide to introduce a new read model, which requires processing a huge backlog of data to achieve the correct state. There isn't any "answer", except chewing through them all. In some cases, you may be able to create an approximation based on some manageable number of events, while waiting for the real answer to complete, but there's no shortcut to processing all events).
On the other hand, in cases where the history is large, but the backlog is manageable (ie - the write model wasn't producing new events), you can usually avoid needing a full replay.
In the write model: most event sourced solutions leverage an event store that supports multiple event streams - each aggregate in the write model has a dedicated stream. Massive event numbers usually means massive numbers of manageable streams. Where that's true, you can just leave the write model alone -- load the entire history on demand.
In cases where that assumption doesn't hold -- a part of the write model that has an extremely large stream, or a pieces of the read model that compose events of multiple streams, the usual answer is snapshotting.
Which is to say, in the healthy system, the handlers persist their state on some schedule, and include in the meta data an identifier that tracks where in the history that snapshot was taken.
To recover, you reload the snapshot, and the identifier. You then start the replay from that point (this assumes you've got an event store that allows you to start the replay from an arbitrary point in the history).
So managing recovery time is simply a matter of tuning the snapshotting interval so that you are never more than recovery SLA behind "latest". The creation of the snapshots can happen in a completely separate process. (In truth, your persistent snapshot store looks a lot like a persisted read model).

Message queues and database inserts

I'm new to message queues and am intrigued by their capabilities and use. I have an idea about how to use it but wonder if it is the best use of this tool. I have an application that picks up and reads spreadsheets, transforms the data business objects for database storage. My application needs to read and be able to update several hundred thousand records, but I'm running into performance issues holding onto these objects and bulk inserting into the database.
Would having have two different applications (one to read the spreadsheets, one to store the records) using a message queue be proper utilization of a message queue? Obviously there are some optimizations I need to make in my code and is going to be my first step, but wanted to hear thoughts from those that have used message queues.
It wouldn't be an improper use of the queue, but its hard to tell if in you scenario adding a message queue will having any affect on the performance problems you mentioned. We would need more information.
Are you adding one message to a queue to tell a process to convert a spreadsheet and a second message when the data is ready for loading? or are you thinking of adding on message per data record? (That might get expensive fast, and probably won't increase the performance).