Is it possible to write data with one process, and in the same time read with same process or another one, on same or different GCE machine?
The API is doesn't seem to cover this one:
https://developers.google.com/storage/docs/concepts-techniques#streaming
No, this is not supported. An object is not visible for reads until the writer finishes writing and finalizes the object.
Depending on your specific use case, you might be able to use composite objects to achieve what you need. For example, instead of writing to one large object, you could write smaller chunks of the object to individual objects, then compose them together into a larger object when finished. This would allow a reader to read each component after it's been written.
Related
We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine.
The challenge we face here is,
We have a million records pushed to Ktable. We need to iterate the
whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http
(unless there is any streaming TCP technique available) . Basically
we need two instance data in a single call and generate the report.
Proposed Solution:
We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,
final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
streams.store("result",
QueryableStoreTypes.<Key, Result>keyValueStore())
KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
while (iterator.hasNext()) {
//append to excel report
}
Question:
Will the above solution work without any issues? If not, is there any alternative solution for this?
Please suggest. Thanks in Advance
GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.
The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:
http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery
and for RPC use a framework that supports streaming, e.g. akka-http.
Server-side streaming:
http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html
Consuming a streaming response:
http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities
This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.
I think you should use GlobalKTable to get a full local copy of the data.
What would you consider the best solution to store via G-WAN Key-Value Store my values in RAM and multi-threaded, and able to be used by all my scripts (from other virtual servers or not) ?
Thank you in advance.
I wish to store different values in different "storages" so as to be able to recover each one via a "key" (type char).
The G-WAN KV store does that (for any type of data: binary too).
Once your application will have millions of concurrent users, one way to speed-up lookups will be to use different G-WAN servers to host either a partitioned data set or a redundant data set (it all depends on the type of your application).
The G-WAN reverse-proxy featuring an elastic load-balancer makes such things this almost transparent for developers.
I do not care that the data is lost when you restart g-wan.
Then you won't have to use a persistant layer like mySQL, etc.
So it would be fine (I think) to have a persistent pointer but I'm not sure that this is the most suitable solution
Look at the persistence.c example for about how to share common data among all worker threads in G-WAN.
But you can avoid that if you are using G-WAN with one single worker thread (./gwan -w 1). One thread is more than enough to start developing and even to operate your application until the point you will need to process more requests.
With one single thread, you can just use a static pointer to your G-WAN KV store (unless different scripts need to access it).
I'm working on a video server, and I want to use a database to keep video files.
Since I only need to store simple video files with metadata I tried to use MongoDB in Java, via its GridFS mechanism to store the video files and their metadata.
However, there are two major features I need, and that I couldn't manage using MongoDB:
I want to be able to add to a previously saved video, since saving a video might be performed in chunks. I don't want to delete the binary I have so far, just append bytes at the end of an item.
I want to be able to read from a video item while it is being written. "Thread A" will update the video item, adding more and more bytes, while "Thread B" will read from the item, receiving all the bytes written by "Thread A" as soon as they are written/flushed.
I tried writing the straightforward code to do that, but it failed. It seems MongoDB doesn't allow multi-threaded access to the binary (even if one thread is doing all the writing), nor could I find a way to add to a binary file - the Java GridFS API only gives an InputStream from an already existing GridFSDBFile, I cannot get an OutputStream to write to it.
Is this possible via MongoDB, and if so how?
If not, do you know of any other DB that might allow this (preferably nothing too complex such as a full relational DB)?
Would I be better off using MongoDB to keep only the metadata of the video files, and manually handle reading and writing the binary data from the filesystem, so I can implement the above requirements on my own?
Thanks,
Al
I've used mongo gridfs for storing media files for a messaging system we built using Mongo so I can share what we ran into.
So before I get into this for your use case scenario I would recommend not using GridFS and actually using something like Amazon S3 (with excellent rest apis for multipart uploads) and store the metadata in Mongo. This is the approach we settled on in our project after first implementing with GridFS. It's not that GridFS isn't great it's just not that well suited for chunking/appending and rewriting small portions of files. For more info here's a quick rundown on what GridFS is good for and not good for:
http://www.mongodb.org/display/DOCS/When+to+use+GridFS
Now if you are bent on using GridFS you need to understand how the driver and read/write concurrency works.
In mongo (2.2) you have one writer thread per schema/db. So this means when you are writing you are essentially locked from having another thread perform an operation. In real life usage this is super fast because the lock yields when a chunk is written (256k) so your reader thread can get some info back. Please look at this concurrency video/presentation for more details:
http://www.10gen.com/presentations/concurrency-internals-mongodb-2-2
So if you look at my two links essentially we can say quetion 2 is answered. You should also understand a little bit about how Mongo writes large data sets and how page faults provide a way for reader threads to get information.
Now let's tackle your first question. The Mongo driver does not provide a way to append data to GridFS. It is meant to be a fire/forget atomic type operation. However if you understand how the data is stored in chunks and how the checksum is calculated then you can do it manually by using the fs.files and fs.chunks methods as this poster talks about here:
Append data to existing gridfs file
So going through those you can see that it is possible to do what you want but my general recommendation is to use a service (such as Amazon S3) that is designed for this type of interaction instead of trying to do extra work to make Mongo fit your needs. Of course you can go to the filesystem directly as well which would be the poor man's choice but you lose redundancy, sharding, replication etc etc that you get with GridFS or S3.
Hope that helps.
-Prasith
Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..
We are using Enterprise Library 5 and the CacheManager that it provides in our web application. Everything seems to be working fine up to the point where we start a heavy load test on the application.
We are caching records from the database using a key based on their ID. We are not requesting from cache one item all the time, sometimes we need to get a list of items from the cache. For this we have a LINQ query that makes a Select(e => CacheManager.GetData(id_from_list)) and returns the list of items from the cache. Most of the time this works fine but in heavy loads the GetData method becomes a bottleneck due to the locking that the cache manager is performing both on read and write operations from cache. Basically only one thread can read data from the cache at one time. We did create several cache managers based on the type of the items - this allows several threads to get data from different cache managers but still the issue remains when heavy loads hit the application (one bottleneck per cache manager) - of course it did improve the application up to some point but not enough.
Did someone else encountered the same problem and did you find a way to overcome this?
NOTE: We tried to actually cache lists of items and compose the key from the ids of the items in the list. This actually solved the problem and the cachemanager.getdata is not a bottleneck anymore ... BUT ... obviously this is not a good solution as we could have each item thousands of times in the cache in a lot of lists.
You may consider adapting the CacheManager to use a read/write lock (which I think is much more suitable for this situation) instead of the exclusive locking that it uses now.
http://msdn.microsoft.com/en-us/library/system.threading.readerwriterlock.aspx
Basically, a read/write lock is appropriate when multiple reader threads need simultaneous access to the data, and only the occurrence of a write will cause incoming readers to block.
These have other problems when put under load, however, such as write starvation. Depending on the read/write lock implementation a write will always wait for all reads to finish first - with a constant stream of reads a write will never have a chance to happen.