Is there any way to save image to mongo's gridfs and after asynchronous upload to S3 in background?
Maybe it is possible to chain uploaders?
The problem in next: Multiple servers used, thus - saved to hard drive image and running background process can be on different servers.
Also
1. it should remove from gridfs when uploaded to s3
2. it should auto remove from s3 when correspond entity destroyed.
Thanks.
What does your deployment architecture look like? I'm a little confused by when you say "multiple servers"- do you mean multiple mongod instances? Also, it's a bit confusing when you specify your requirements. According to requirement 1, if you upload to S3, then the gridfs file should be removed. However, according to your requirements, it cannot exist in both S3 and Gridfs, so requirement 2 seems to be a contradiction to the first, ie, it shouldn't exist in gridfs in the first place. Are you preserving some files on both Gridfs and S3?
If you are running in a replica set or sharded cluster, you could create a tailable cursor on your gridfs collection (you can also do this on a single node, although it's not recommended). When you see an insert operation (will look like 'op':'i') you could execute a script or do something in your application to grab the file from gridfs and push the appropriate file to s3. Similarly, when you see a delete operation ('op':'d') you could summarily delete the file from s3.
The beauty of a tailable cursor is that it allows for asynchronous operations- you can have another process monitoring the oplog on a different server and performing the appropriate actions.
I used temp variable to store to gridfs and made Worker (see this) to perform async upload from gridfs to s3.
Hope this would help somebody, thanks.
Related
I'm currently trying to move my existing MongoDb deployment to the --directoryperdb option to make use of different, mounted volumes. In the docs this is achieved via mongodump and restore, however on a large database with over 50GB compressed data this takes a really long time and indices need to be completely rebuilt.
Is there a way to move the WiredTiger files inside the current /data/db path, just how you would when just changing the db path? I've tried copying the corresponding collection- and index- files into their subdirectory, which doesn't work. Creating dummy collections and replacing them with the old files and then running --repair works but I don't know how long it takes since I only tested it with a few docs large collection, this also seems very hacky with a lot of things that could go wrong (for example data loss).
Any advice on how to do this, or is this something that simply should not be done?
My Job is to Improve the speed of reading a lot of small file (1KB) from disk to write into our database.
The database is open source to me, and I can change all the code from the client to the server.
The database architecture is that , it is a simple master-slave distributed HDFS based database like HBase. The small file from disk can be insert into our database and combined into bigger block automatically and then write into HDFS.(also the big file can be split to smaller block by database and then write into HDFS)
One way to change the client is to increase the thread number.
I don't have any other idea.Or you can provide some idea to do the performance analysis.
One of the way to process such small files could be to convert these small files to a sequence file and store it into HDFS. Then use this file as a Map Reduce job input file to put the data into HBase or similar database.
This uses aws as an example but it could be any storage/queue setup:
If the files were able to exist on a shared storage such as S3 you could add one queue entry for each file and then just start throwing servers at the queue to add the files to the db. At that point the bottleneck becomes the db instead of the client.
We have software that generates a large amount of data in a short period of time, and is stored in a single MongoDB database. To increase write performance we are looking into setting up a sharded cluster to handle the incoming data. Because this is all being done on amazon ec2 instances, we would prefer to consolidate our data from the sharded cluster to a single persistent server once the process is done to save on cost. Obviously we can write a python script that will port the data off the cluster when done, but I am hoping there is a cleaner, more automated method. Once the data has been written, the access is all read-only and a single server can handle the workload sufficiently. I was looking for some solution combining replica sets and sharding, but that doesn't seem to to be the way those work. Any suggestions for how to best implement this architecture?
One way to migrate a MongoDB with zero downtime is to create a replica-set consisting of the old and the new servers and removing the old ones as soon as the new have synced. But that doesn't work when the old database is sharded and the new one isn't, because shards are build from replica-sets, not the other way around. That means that you have to copy the database the old-fashioned way. There are two methods to do this:
The network method: Use the command db.copyDatabase(<remote_db_name>, <local_db_name>, <remote_host>, <remote_username>, <remote_password>)
on the destination to copy the database from the source via network.
The file method: Do a mongodump on the source to export the data to a file. Then do a mongorestore on the new server to import it.
I need to copy data from one cluster to another cluster. I did some research and found I could do copyTable, which basically scan and put data from one to another.
I also know that I can copy over the whole HDFS volume for Hhase. I am wondering if this works and if perform better than copyTable? (I believe it should perform better since it copy files without logic operations)
Take a look at HBase replication you can find a short how to here
I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I would like to use Amazon's Map/Reduce services so to do this instead of maintaining my own Hadoop cluster.
Does it make sense to copy the data from the database to S3. Then run Amazon Map/Reduce on it? Or are there better ways to get this done.
Also if further down the line I might want to run the queries for frequently like every day so the data on S3 would need to mirror what is in Mongo would this complicate things?
Any suggestions/war stories would be super helpful.
Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon's EMR product and you don't want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.
However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).
It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb
This looks appealing since it lets you implement the MR in python/js/ruby.
I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.
UPDATE: An example of using map-reduce with mongo here.