How to create a slave of large mongodb databases [closed] - mongodb

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
As we know, mongodb has limited oplog.
If I just create a new slave, everything in the database is not sync yet. Everything in the database is bigger than any oplog.
So how do I get around this? Does that mean we cannot create a new slave that's bigger than the oplog? Does mongodb has other mechanism for synching database besides that oplog?
How exactly it's done then if true?
So what's the problem?

If your database is of reasonable size, and you have a snapshot, you can copy over the files (specified by the --dbpath flag on startup or in the config file) to allow the new replica set member to come online quicker. However, an initial sync may still happen.
Conceptually, the following things happen
Start up the new member as part of the replica set
Add it to the rs.conf()
The new replicaset is synced off the closest (could be a primary or a secondary) and will begin pulling data from it (initial sync) and mark a point in the oplog for it's own reference.
The new secondary will then apply the oplog from it's own timestamp that it has copied from the other replica set member.
If the sync fails, another initial sync (from the very start) will happen. For really large datasets, the sync can take some time.
In reply to your questions
Does that mean we cannot create a new slave that's bigger than the oplog?
You can create and add a new member that is bigger than the oplog
Does mongodb has other mechanism for synching database besides that oplog?
Yes, the initial sync and the file copy mentioned above.

Related

why i am unable to add some files to my git repository? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
i'm trying to learn how to create a git repository.
i have a working directory where my app resides on the server.
/home/app/abd
steps i've taken so far:
~/abd$ git init <--create repo
~/abd$ git add ./ <--stage all the files in my directory
but now i've run into this:
new file: media/root/pics/Volvo_XC90_T8eAWDPlug-InHybridInscription7Passenger.png
new file: media/root/pics/Volvo_XC90_T8eAWDPlug-InHybridInscriptionExpression6Passenger.png
new file: media/root/pics/Volvo_XC90_T8eAWDPlug-InHybridInscriptionExpression7Passenger.png
new file: media/root/pics/Volvo_XC90_T8eAWDPlug-InHybridR-Design7Passenger.png
new file: pga.tar.gz
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: _volumes/pg_db1/pgdata/pg_stat_tmp/db_0.stat
modified: _volumes/pg_db1/pgdata/pg_stat_tmp/db_16384.stat
modified: _volumes/pg_db1/pgdata/pg_stat_tmp/db_24724.stat
modified: _volumes/pg_db1/pgdata/pg_stat_tmp/global.stat
so my question is:
is this because i'm running postgres in a container and my website is live?
i don't really want to stop the postgres instance, i just want to capture all of the current data...and it happens to be that my volumes are inside the working directory...
There's a general problem here, and it applies to Git as well as to Postgres (or Ingres or MongoDB or, well, pretty much any database you can name, though eventually-consistent distributed databases like Cassandra in its usual configuration, have built in ways to deal with this). The general problem is called the CAP theorem.
Backing up a database consists of making a copy, by definition. This in turn represents a distribution of the data in the database: you now have a distributed database.
The CAP theorem tells us that any distributed database must give up at least one of three guarantees: consistency (data retrieved from the database matches, e.g., your bank balance at ATM #1 isn't mysteriously different from your bank balance at ATM #2), availability (your bank balance is available at all ATMs), and partition tolerance (even if the bank's network goes down, you can still use the ATMs).
Git itself can be viewed as software that manipulates a pair of databases in tandem: there's an object database holding the commits and other supporting objects, and a names database holding branch and other names, needed to find the objects. (Git manages its own distributed-ness by giving up consistency across databases: you must use git fetch and/or git push to synchronize them, which you can only do when the two are not partitioned, i.e., the network is up between them.)
Again, any time you make a backup, or for that matter, put something on a network drive, you're creating duplicates: a sort of distributed database. This forces you into a CAP-theorem choice: what will you give up? But any one given database has already made its own choice. Some SQL setups give up nothing, at the cost of not allowing partitions at all: you literally can't copy the database, at least not if it's live. Any copy—any backup—is automatically invalid.
They may give you a "dump" option that produces a consistent snapshot that can be used to restore the data as-of-then. You'll lose subsequent transactions, but you've already agreed that this is the right answer, so you must use this protocol. You don't just back up the database somehow: you invoke the dump operation, and put the dumped data into the backup.
So, any time you have some database software and are looking for ways to make backups for disaster recovery and the like, always check the instructions. How does this database say to do that? Follow those instructions. Don't just use Git, or your OS's "snapshot the tree" operation, or whatever: those may not be reliable.

CouchDB's replicated DB is a copy or reference to original?

As per http://guide.couchdb.org/draft/replication.html
Replication synchronizes two copies of the same database, allowing
users to have low latency access to data no matter where they are.
These databases can live on the same server or on two different
servers—CouchDB doesn’t make a distinction. If you change one copy of
the database, replication will send these changes to the other copy.
I have following 2 confisions:
Does this mean that every replicated DB is a new DB/copy of original DB or does it refers to the original DB?
On replication will it increase the size of DB?
Note: These confusions are in context of PouchDB(mobile) to CouchDB(server) interactions. To be more precise, I want to do something like https://stackoverflow.com/a/32206581/2904573
Have also gone through https://stackoverflow.com/a/4766398/2904573 but didn't got my answer.
As far as I know, the replicated DB is NOT a symbolic link to original DB, but it's a duplicate.
I posted the same question on CouchDB github repo and got the answer.
The replicated DB is a new copy. In the most general terms, it takes
all the documents and attachments in the original DB (called the
source) and puts them into a new DB (called the target).
If the replication is a continuous replication, then the source is
also then monitored for any new changes, and, if they are any, those
change are copied to the target as well.
Ref: https://github.com/apache/couchdb/issues/1494#issuecomment-410933908
Thanks.

Mongodb backup or replication or clone with existing big data

I have one mongo instance running on amazon. There are 5M docs in a single collection. And 20docs/1sec come in data. No index. And my server just have 50G space, already used 22G.
Now I need to do some data analyse for those data, but because on index, I execute one query, the db is block and can't insert data until I restart the server.
And data keep come in, so I worry about the space is not enough.
What I'm trying to do is build another server, setup a new mongo instance, then copy the data into it. Then add index on the new one and do the analyse.
Waht is the best way, any suggestion?
Probably the best way is to just create an index in the background. This will not block anything and you can then just run the indexed query on your node. Creating an index in the background takes a bit longer but it does prevent the blocking:
db.collection.ensureIndex( { col: 1 }, { background: true } );
See also: http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/
If you really want a secondary to do analysis, then you can create a replica set from your existing member. But for that you will have to take MongoDB down - and restart it with the replSet parameter. After starting it with that parameter, you can now add a new replica set member which will sync the data. This synching will also impact performance as lots of data will have to be copied. The primary will also need more disk space now because of the oplog that MongoDB needs to sync secondaries with.
mongodump and mongorestore can also be an option but then the data between the two nodes will not stay in sync. You would have to run the dump+restore each time you want to run analysis on the new data. In that case, a replica set might be better.
A replica set really wants 3 members though, to prevent a split brain in case a node goes down. This can be another data node, but in your case you would probably want to set-up an arbiter. If you don't want automatic failover (I don't think you'd need it in this case, as you're just doing analysis), then set up your replica set two nodes, but make the second (new) one hidden: http://docs.mongodb.org/manual/tutorial/configure-a-hidden-replica-set-member/
set up replica set from this existing member, and then add the index
on the secondary and do analysis.
Take a mongodump and restore to a new server and do the analysis

MongoDB one way replication

Need some way to push data from clients database to central database.Basically, there are several instances of MongoDB running on remote machines [clients] , and need some method to periodically update central mongo database with newly added and modified documents in clients.it must replicate its records to the single central server
Eg:
If I have 3 mongo instances running on 3 machines each having data of 10GB then after the data migration 4th machine's mongoDB must have 30GB of data. And cenral mongoDB machine must get periodically updated with data of all those 3 machines. But these 3 machines not only get new documents but existing documents in them may get updated. I would like the central mongoDB machine also to get these updations.
Your desired replication strategy is not formally supported by MongoDB.
A MongoDB replica set consists of a single primary with asynchronous replication to one or more secondary servers in the same replica set. You cannot configure a replica set with multiple primaries or replication to a different replica set.
However, there are a few possible approaches for your use case depending on how actively you want to keep your central server up to date and the volume of data/updates you need to manage.
Some general caveats:
Merging data from multiple standalone servers can create unexpected conflicts. For example, unique indexes would not know about documents created on other servers.
Ideally the data you are consolidating will still be separated by a unique database name per origin server so you don't have strange crosstalk between disparate documents that happen to have the same namespace and _id shared by different origin servers.
Approach #1: use mongodump and mongorestore
If you just need to periodically sync content to your central server, one way to do so is using mongodump and mongorestore. You can schedule a periodic mongodump from each of your standalone instances and use mongorestore to import them into the central server.
Caveats:
There is a --db parameter for mongorestore that allows you to restore into a different database from the original name (if needed)
mongorestore only performs inserts into the existing database (i.e. does not perform updates or upserts). If existing data with the same _id already exists on the target database, mongorestore will not replace it.
You can use mongodump options such as --query to be more selective on data to export (for example, only select recent data rather than all)
If you want to limit the amount of data to dump & restore on each run (for example, only exporting "changed" data), you will need to work out how to handle updates and deletions on the central server.
Given the caveats, the simplest use of this approach would be to do a full dump & restore (i.e. using mongorestore --drop) to ensure all changes are copied.
Approach #2: use a tailable cursor with the MongoDB oplog.
If you need more realtime or incremental replication, a possible approach is creating tailable cursors on the MongoDB replication oplog.
This approach is basically "roll your own replication". You would have to write an application which tails the oplog on each of your MongoDB instances and looks for changes of interest to save to your central server. For example, you may only want to replicate changes for selective namespaces (databases or collections).
A related tool that may be of interest is the experimental Mongo Connector from 10gen labs. This is a Python module that provides an interface for tailing the replication oplog.
Caveats:
You have to implement your own code for this, and learn/understand how to work with the oplog documents
There may be an alternative product which better supports your desired replication model "out of the box".
You should be aware that there are only replica set for doing replication there a replicat set always means: one primary, multiple secondary. Write always go to the primary server. Appearently you want multi-master replication which is not supported by MongoDB. So you want to look into a different technology like CouchDB or CouchBase. MongoDB is barrel burst here.
There may be a way since MongoDB 3.6 to achieve your goal: Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.
There are some configuration options that affect whether you can use Change Streams or not, so please read about them.
Another option is Delayed Replica Set Members.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error. For example, a delayed member can make it possible to recover from unsuccessful application upgrades and operator errors including dropped databases and collections.
Hidden Replica Set Members may be another option to consider.
A hidden member maintains a copy of the primary's data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set.
Another option may be to configure a Priority 0 Replica Set Member.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error.
I am interested in these options myself, but I haven't decided what approach I will use.

Redis DB export/import [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Does anybody know a good solution for export/import in Redis?
Generally I need to dump DB (and edit the dump as a case) from a server and load it to another one (e.g. localhost).
Maybe some scripts?
Redis has two binary format files supported: RDB and AOF.
RDB is a dump like what you asked. You can call save to force a rdb. It will be stored in the dbfilename setting you have, or dump.rdb in the current working directory if that setting is missing.
More Info:
http://redis.io/topics/persistence
If you want a server to load the content from other server, no dump is required. You may use slaveof to sync the data and once it's up to date call slaveof no one.
More information on replication can be found in this link: http://redis.io/topics/replication
you can try my dump util, rdd, it extract or insert data into redis and can split, merge, filter, rename