Mongo Import on a Production DB - mongodb

I have a production system which is running with thousands of user profiles. These profiles were added by us at the time of launch after which users have added/updated more data into it.
Now, we are planning to add around 20K more profiles into the system. These 20K Profiles are present in our staging mongo instance.
What would be the best approach to get these profiles on Production instance considering there will be constant read/writes happening on the system? Also, we want to make sure if there is any common user than his profile should not be altered as he might have made some changes to it.
I was thinking of using mongorestore as it guaranties to insert documents and not update. What would be the risk involved?
Thanks

As always - test it first!
If your unhappy with mongorestore's impact on loadthen a custom python script will give you more granular control. This will allow you to put whatever business logic and checks you need and potentially add some flood limitations.

Related

Sync two offline masters when network available

I have a use case where I need to set up two physical stations at a venue. Each station will be running a couple of app servers and a mongodb server.
I can't rely on the venue's internet access so I need my app to be able to work offline and "sync" the dbs every once in a while.
I initially thought about having two masters that would somehow sync with a remote one but TIL that master-master replication is not possible with mongodb.
I've read about the active-active approach, however, that won't let me write to a different shard when offline.
I'm running out of ideas, any recommendation would be greatly appreciated.
------ Update on what I'm trying to achieve:
I'm working with a venue that has two entrances. The idea is to be able to capture some information from people attending the events (name, email, etc). After getting registered we will print a name tag with some of the info.
Everything sounds pretty easy, however, if possible, I would like to not rely on the venue's network (internet). So that's where I started struggling figuring out whats the best approach. I guess what I want is being able to have a remote mongo but if the network goes down somehow keep saving records locally and send them to the remote mongo instance when network is available again.
Extra considerations:
- Events last a couple of days, some people lose their name tag overnight, they should be able to go to either of the entrances and get it reprinted. So we should be able to find their info even if they registered in entrance A but they are asking for a reprint in entrance B.
More questions:
- Am I overthinking it? Maybe venue's network + a 4G/LTE modem as a backup should be enough? I would prefer not relying on it tho.
I believe you're overthinking things. Here's what I would do if faced with a similar situation:
From the description, it doesn't sound like the two sites need to be connected in real time at all. I would create a server on Entry A, another in Entry B, and consolidate their data each day after the day ended if required. This is because:
It's unlikely that one person will register in both sites within a single day. If they lost their tag on that day, I'll just tell them to go back to where they registered earlier and get it reprinted there. Worst case, you'll create a duplicate entry (should be obvious which is the duplicate since no one would lose their tag within seconds) but I would not anticipate hundreds of people all lost their tags within a day.
If the attendee lost their tag overnight, both servers will have synced data and should be able to reprint.
If you're concerned about the venue's Wifi access, just run cables from the server to the printing stations.
Personally, I would argue that the overnight sync is not really needed at all (see the likelihood of people registering twice). I would just collect the data from both servers after the event ended. That is, unless you have specific needs for the combined data from both entries during the 2nd day.
Note: please make sure you're running a minimum of 3-node replica set. Running a standalone instance for prod environment is not recommended. Hardware/disk corruption is a common event.

Share a MongoDB instance between Meteor apps without lag in reactivity?

This question has been asked multiple times, here and here, and the answer to get this working is fairly straight forward: add an environmental variable to your bash_profile and all Meteor instances on your localhost will share that MONGO_URL.
What I've noticed however is that while this may be the case, there's quite a bit of latency in the "reactivity" of Meteor. I've tested this with two very lean Meteor apps, with empty collections. Inserting a document to a collection from one Meteor app, where my second app is querying that same collection and printing out a field from the documents does work, but there's a noticeable lag before it updates. I've ruled out the possibility of the collection insertion being the source of the lag (simple console.log callback on the client of the first app, logging the id of the newly inserted document).
My purpose for having multiple apps (two to be precise) sharing the same MongoDB is to separate an admin panel from a mobile app without going crazy regarding name-spacing and bloat. This configuration works, but I'm not sure it's the "proper" way of accomplishing the task, and it certainly seems to be causing a performance hit.
Any insight into this matter would be appreciated. Thank you!
EDIT: To clarify, the db URL I'm using is on my localhost, and isn't something hosted online.
When you use an external database, by default meteor will use periodic polling (every few seconds) in order to observe any changes. The delay you are experiencing is a result of this polling process. You can remove the delay and reduce your app's CPU usage by taking advantage of meteor's oplog tailing feature. In order to use it you will:
Get access to a mongodb instance with the oplog turned on.
Set the environment variable MONGO_OPLOG_URL so your app(s) can read the oplog.
Personally, I'd recommend compose.io for this. They provide exactly this as part of their basic elastic deployment. See this post for detailed instructions.
For users who wish to connect to the oplog created locally for you, you can obtain the URL via:
MongoInternals.defaultRemoteCollectionDriver().mongo._oplogHandle._oplogUrl
It should end up looking something like mongodb://127.0.0.1:3001/local

MongoDB - Safeguard against .remove() entire database?

I'm using MongoDB as my database, and as a first-time back-end developer the ease with which I can delete an entire database/collection really bothers me.
Simply typing db.collection.remove() removes all records from that collection!
I know that an effective backup strategy should render this a non-issue, but I occasionally do run .remove() on some collections, and I'd hate to type in the wrong collection name by accident and (a) have to go through a backup restore, and (b) lose whatever data I had gathered between the backup and the restore, especially as my app gathers a lot of user data.
Is there any 'safeguard' I can set up my database to use, even if it's just a warning/confirmation that says
"Yo, are you sure you want to remove everything from <collectionname>? Choose: Yes/No"
User roles won't fix your problem. If your account has permissions to delete one user, you could accidentally delete them all. If your account has permissions to update an attribute for one user, you could accidentally update all of your users.
There's a simple fix for this however.
Step 0: Backup your database. And test your backups regularly. And make sure you get alerted if the backup did not run, or errored. Replica sets are not backups. I know this is obvious, but evidentally it's not obvious to everybody.
Step 1: Write a web admin GUI interface for your database. This it will only take a day or two -- and it should be simple enough that a secretary or intern could use it without fear for your data. (If you think this will take a long time, find a framework with more bells and whistles. Your admin console doesn't even need to be written in the same language as your app.)
Step 2: Data migrations (maintenance transformations of your database) should always be run from scripts checked into source control and tested on non-prod beforehand. The script could be as simple as mongo -e "foo.update(blah)", but you should run it as a script to avoid cut-n-paste errors. Ideally, you would even have a checklist for all migrations. (Check that you have a recent backup. Check the database log and system load beforehand. Write a before and after query that will tell you if the migration was successful...)
Step 3: You now no longer need to use the production Mongo console. So don't. It's a useful tool for development, but that's only needed on local development databases.
The above-mentioned Roles might be useful for read-only queries. But you can already do that against the non-master replica set member.
tl;dr: You can go pretty far using cowboy admin techniques, but eventually you're going to figure out that it's better (and not much more work) to automate everything.
There is nothing you can do in the current version to provide this functionality.
In a future version when user defined roles are available you could define a role which allows insert() and update() but not remove() or drop() etc. and therefore make yourself log-in as a different higher-role user, but that's not available in the current (2.4) version.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.

How to apply database updates after deployment?

i know this is an often asked question on these boards. And usually the question has been about how to manage the changes being made to the database before you even get around to deploy them.Mostly the answer has been to script the database and save it under sourcecontrol and then any additional updates are saved as scripts under version control too.(ex. Tool to upgrade SQL Express database after deployment)
my question is when is it best to apply the database updates , in the installer or when the new version first runs and connects to the database? note this is a WinApp that is deployed to customers each have their own databases.
One thing to add to the script: Back up the database (or at least the tables you're changing!) before applying the changes.
As a user I think I'd prefer it happens during the install, and going a little further that the installer can roll itself back in the event of a failure. My thinking here is that if I am installing an update, I'd like to know when the update is done that it actually is done and has succeeded. I don't want a message coming up the next time I run it informing me that something failed and I've potentially lost all my data. I would assume that a system admin would probably also appreciate install time feedback (of course, that doesn't matter if your web app isn't something that will be installed on a network). Also, as ראובן said, backing up the database would be a nice convenience.
You haven't said much about the architecture of the application, but since an installer is involved I assume it's a client/server application.
If you have a server installer, that's where you want to put it, since the database structure is only going to change once. Since the client installers are going to need to know about the change, it would be nice to have a way to detect the database version change, and for the old client to be able to download the client update from the server automatically and apply it.
If you only have a client installer, I still think it's better to put it there (maybe as a custom action that fires off the executable for updating the database). But it really isn't going to matter, because conceptually one installer or first-time user of the new version is going to have to fire off the changes to the database anyway. The database changes are going to put structural locks on the database so, in practical terms, everyone is going to have to be kicked off the system at that time for the database update to be applied.
Of course, this is all BS if it's not client-server.