Filestream Pointers - filestream

I need to copy many small file streams into one file stream. Each file stream carries its own designated position, i.e. FS1 goes firstly, then FS2 and so on. But while multi-threading the program, the thread which finishes the processing firstly adds it firstly which causes errors.
Is there any way that we can define its position so that no matter what sequence we add them, they will get in the right place??
I tried this by creating many headers before hand so that the file streams will replace those headers, but searching for those headers just slows down the program.
This question is in continuation to my last one, as First processedFS jumps (Copies) first, so we need to define location where it will be copied.
Please refer to this question:
Sequential MT

You can't have multiple threads write to the same file stream at the same time without wrapping it with a synchronization lock, and you would also need to re-seek the stream back and forth depending on which thread needs to write to it at any given moment, so it writes at the correct offset within the file. That is a lot of overhead.
You can, however, have multiple threads use different file streams to write to the same file at the same time, provided the sharing rights between the streams are compatible to allow concurrent writing and preserve data. Pre-size the file to the desired length, then divide up sections of that length amongst the threads as needed. Give each thread its own stream to the target file, first seeking to the appropriate start offset of its assigned section. Then each thread can write to its respective stream normally, without having to synch with the other threads or re-seek its stream. Just make sure each thread does not exceed the bounds of its assigned section within the file so it does not overwrite another thread's data.

Related

Vert.x WriteStream setWriteQueueMaxSize

I am new for Vert.x. When I read writestream in the article from https://vertx.io/docs/vertx-core/java/ as following:
setWriteQueueMaxSize: set the number of object at which the write queue is
considered full, and the method writeQueueFull returns true. Note that,
when the write queue is considered full, if write is called the data will
still be accepted and queued. The actual number depends on the stream
implementation, for Buffer the size represents the actual number of bytes
written and not the number of buffers.
especially this statement - "Note that, when the write queue is considered full, if write is called the data will still be accepted and queued.", from this statement, I have a few of questions:
(1) Is there any size limitation for writing to stream? I mean, such as event bus, how many messages can be written to event bus? Does it depend on memory? suppose I keep on writing messages to event bus and message doesn't be consumed, does it cause Out of Memory in Java?
(2) If there is some limitation for writing, where and how can I check default queue size? Such as, I want to know the default queue size of Vert.x KafkaProducer, where can I check it?
Any ideas are appreciated.
There are actually three separate questions there, but I'll give it a shot anyway:
Is there any size limitation for writing to stream?
That depends on the stream implementation. But I've yet to see one implementation that actually uses WriteQueueMaxSize
how many messages can be written to event bus? Does it depend on memory?
EventBus is a special case. If nobody is consuming from EventBus, it will simply drop messages, so in such a trivial case, an infinite number could be written. But if there are consumers, and they're slow, yes, eventually you'll run out of memory. EventBus implementation currently doesn't do anything with WriteQueueMaxSize
If there is some limitation for writing, where and how can I check default queue size? Such as, I want to know the default queue size of Vert.x KafkaProducer, where can I check it?
All Vert.x project are open source, so you'll usually find them on GitHub (or other open source repo, but I don't remember any that are not on GitHub, actually).
Particularly for Vert.x KafkaProducer, you can see the code here:
https://github.com/vert-x3/vertx-kafka-client/blob/1720d5a6792f70509fd7249a47ab50b930cee7a7/src/main/java/io/vertx/kafka/client/producer/impl/KafkaProducerImpl.java#L217

When is concurrent execution of two critical sections producing results in some unknown order useful?

Refer to Galvin et. al Operating System Concepts, 8th edition, 6th chapter, section 6.9, page 257. It says, "If two critical sections are instead executed concurrently, the result is equivalent to their sequential execution in some unknown order. Although this property is useful in many application domains, in many cases we would like to make sure that a critical section forms a single logical unit of work that either is performed in its entirety or is not performed at all." When is that property useful? Please explain, thanks in advance! Also, please offer me some vegemite to eat!
The property is useful (because it increases potential parallelism) when the order that the critical sections are executed is irrelevant.
For a more complex example; let's say you have a thread fetching the next block from a file, a thread compressing the current block, and a thread sending the previously compressed block to a network connection.
In this case there are obvious constraints (you can't compress the current block while it's still being fetched, and you can't send the compressed block to a network connection until it's finished being compressed), but there are also obvious opportunities for parallelism where the order is irrelevant (you don't care if the next block is fetched before or after or while the current block is compressed, you don't care if the current block is compressed before or after or while the previously compressed block is being sent to network, and you don't care if the next block is fetched before or after or while the previously compressed block is being sent to network).

nondeterminism.njoin: maxQueued and prefetching

Why does the njoin prefetch the data before processing? It seems like an unnecessary complication, unless it has something to do with how Processes of Processes are merged?
I have a stream that runs effects whenever a new element is generated. I'd like to keep the effects to a minimum, so whenever a njoin with, say maxOpen = 4, 4 should be the maximum number of elements generated at the same time (no element should be generated unless it can be processed immediately).
Is there a way to solve this gracefully with njoin? Right now I'm using a bounded queue of "tickets" (an element is generated only after it got a ticket).
See https://github.com/scalaz/scalaz-stream/issues/274, specifically the comment below from djspiewak.
"From a conceptual level, the problem here is the interface point between the "pull" model of Process and the "push" model that is required for any concurrent stream merging. Both wye and njoin sit at this boundary point and "cheat" by actively pulling on their source processes to fill an inbound queue, pushing results into an outbound queue pending the pull on the output process. (obviously, both wye and njoin make their inbound queues implicit via Actor) For the most part, this works extremely well and it preserves most of the properties that users care about (e.g. propagation of termination, back pressure, etc)."
The second parameter to njoined, maxQueued, bounds the amount of prefetching. If that parameter is 0, there is no limit on the queue size, and thus no limit on the prefetching. The docs for mergeN, which calls njoin explain a bit more the reasoning for this prefetching behavior. "Internally mergeN keeps small buffer that reads ahead up to n values of A where n equals to number of active source streams. That does not mean that every source process is consulted in this read-ahead cache, it just tries to be as much fair as possible when processes provide their A on almost the same speed." So it seems that the njoin is dealing with the problem of what happens when all the sources provide a value at nearly the same time, but it's trying to prevent any one of those joined streams from crowding out slower streams.

Message queues and database inserts

I'm new to message queues and am intrigued by their capabilities and use. I have an idea about how to use it but wonder if it is the best use of this tool. I have an application that picks up and reads spreadsheets, transforms the data business objects for database storage. My application needs to read and be able to update several hundred thousand records, but I'm running into performance issues holding onto these objects and bulk inserting into the database.
Would having have two different applications (one to read the spreadsheets, one to store the records) using a message queue be proper utilization of a message queue? Obviously there are some optimizations I need to make in my code and is going to be my first step, but wanted to hear thoughts from those that have used message queues.
It wouldn't be an improper use of the queue, but its hard to tell if in you scenario adding a message queue will having any affect on the performance problems you mentioned. We would need more information.
Are you adding one message to a queue to tell a process to convert a spreadsheet and a second message when the data is ready for loading? or are you thinking of adding on message per data record? (That might get expensive fast, and probably won't increase the performance).

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.