Fastest way to update large amount of data

Fastest way to update large amount of data - mongodb

I have milions of rows in mongo collection and need to update all of them.
I've written a mongo shell (JS) script like this:
db.Test.find().forEach(function(row) {
// change data and db.Test.save()
});
which (i guess) should be faster then e.g. updating via any language driver due to possible latency between web server and mongo server itself and just because the fact, that driver is "something on the top" and mongo is "something in the basement".
Even though it can update aproximately 2 100 rec./sec on quad-core 2.27GHz processor with 4GB RAM.
As i know mongoimport can handle around 40k rec./sec (on the same machine), i don't think mentioned speed is anything "fast".
Is there any faster way?

There are two possible limiting factors here:
Single write lock: MongoDB only has one write lock, this may be the determining factor.
Disk Access: if data being updated is not actively in memory it will need to be loaded from disk which will cause a slow-down.
Is there any faster way?
The answer here depends on the bottleneck. Try running iostat and mongostat to see where the bottleneck lies. If iostat shows high disk IO, then you're being held back by the disk. If mongostat shows a high "lock%" then you've maxed out access to the global write lock.
If you've maxed out IO, there is no simple code fix. If you've maxed out the write lock, there is no simple code fix. If neither of these is an issue, it may be worth trying another driver.
As i know mongoimport can handle around 40k rec./sec (on the same machine)
This may not be a fair comparison, many people run mongoimport on a fresh database and the data is generally just loaded into RAM.
I would start by checking iostat / mongostat.

Related

Postgres why is swap-usage growing? How to reduce it? - AWS RDS

Having a postgres DB on AWS-RDS the Swap Usage in constantly rising.
Why is it rising? I tried rebooting but it does not sink. AWS writes that high swap usage is "indicative of performance issues"
I am writing data to this DB. CPU and Memory do look healthy:
To be precise i have a
db.t2.micro-Instance and at the moment ~30/100 GB Data in 5 Tables - General Purpose SSD. With the default postgresql.conf.
The swap-graph looks as follows:
Swap Usage warning:

Well It seems that your queries are using a memory volume over your available. So you should look at your queries execution plan and find out largest loads. That queries exceeds the memory available for postgresql. Usually over-much joining (i.e. bad database structure, which would be better denonarmalized if applicable), or lots of nested queries, or queries with IN clauses - those are typical suspects. I guess amazon delivered as much as possible for postgresql.conf and those default values are quite good for this tiny machine.
But once again unless your swap size is not exceeding your available memory and your are on a SSD - there would be not that much harm of it

check the
select * from pg_stat_activity;
and see if which process taking long and how many processes sleeping, try to change your RDS DBparameter according to your need.

Obviously you ran out of memory. db.t2.micro has only 1GB of RAM. You should look in htop output to see which processes takes most of memory and try to optimize memory usage. Also there is nice utility called pgtop (http://ptop.projects.pgfoundry.org/) which shows current queries, number of rows read, etc. You can use it to view your postgress state in realtime. By the way, if you cannot install pgtop you can get just the same information from posgres internal tools - check out documentation of postgres stats collector https://www.postgresql.org/docs/9.6/static/monitoring-stats.html
Actually it is difficult to say what the problem is exactly but db.t2.micro is a very limited instance. You should consider taking a biggier instance especially if you are using postgres in production.

design mongodb to load entire content in memory

I am involved in a project where they get enough RAM to store the entire database in memory. According to the manager, that is what 10Gen recommended. This is counter intuitive. Is that really the way you want to use Mongodb?

It is not counter intuitive... I find it quite intuitive, actually.
In How much faster is the memory usually than the disk? you can read:
(...) memory is only about 6 times faster when you're doing sequential
access (350 Mvalues/sec for memory compared with 58 Mvalues/sec for
disk); but it's about 100,000 times faster when you're doing random
access.
So if you can fit all your data in RAM, it is quite good because you are going to be really fast reading your data.
Regarding MongoDB, from the FAQ's:
It’s certainly possible to run MongoDB on a machine with a small
amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
The problem is that you usually have much more data than memory available. And then you have to go to disk, and disk I/O is slow. Regarding database performance, avoiding full scan queries is key (much more important when accessing to disk). Therefore, if your data set does not fit in memory, you should aim at having indexes for the vast majority of your access patterns and try to fit those indexes in memory:
If you have created indexes for your queries and your working data set
fits in RAM, MongoDB serves all queries from memory.

It all depends on the size of your database. I am guessing that you said your database was actually quite small, otherwise I cannot see how someone at 10gen gave such advice, I mean not even #Stennie gives such advice (he is 10gen by the way).
Even if your database is small I don't see how the manager recommended that. MongoDB does not do memory management of its own as such it does not "pin" data into pages like memcached does or other memory based databases do.
This means that the paging of mongods data can be quite unpredicatable, a.k.a you will spend more time trying to keep things in RAM than paging in data. This is why it is better to just make sure your working set fits and it can loaded with speed, such things are based upon your hardware and queries.
#Stennies comment pretty much sums up the stance you should be taking with MongoDB.

PostgreSQL Performance: keep seldomly used small database in memory while server busy with big database

I have a server with 64GB RAM and PostgreSQL 9.2. On it is one small database "A" with only 4GB which is only queried once an hour or so and one big database "B" with about 60GB which gets queried 40-50x per second!
As expected, Linux and PostgreSQL fill the RAM with the bigger database's data as it is more often accessed.
My problem now is that the queries to the small database "A" are critical and have to run in <500ms. The logfile shows a couple of queries per day that took >3s though. If I execute them by hand they, too, take only 10ms so my indexes are fine.
So I guess that those long runners happen when PostgreSQL has to load chunks of the small databases indexes from disk.
I already have some kind of "cache warmer" script that repeats "SELECT * FROM x ORDER BY y" queries to the small database every second but it wastes a lot of CPU power and only improves the situation a little bit.
Any more ideas how to tell PostgreSQL that I really want that small database "sticky" in memory?

PostgreSQL doesn't offer a way to pin tables in memory, though the community would certainly welcome people willing to work on well thought out, tested and benchmarked proposals for allowing this from people who're willing to back those proposals with real code.
The best option you have with PostgreSQL at this time is to run a separate PostgreSQL instance for the response-time-critical database. Give this DB a big enough shared_buffers that the whole DB will reside in shared_buffers.
Do NOT create a tablespace on a ramdisk or other non-durable storage and put the data that needs to be infrequently but rapidly accessed there. All tablespaces must be accessible or the whole system will stop; if you lose a tablespace, you effectively lose the whole database cluster.

Memory efficient way to import large files and data into MongoDB?

After recently experimenting with MongoDB, I tried a few different methods of importing/inserting large amounts of data into collections. So far the most efficient method I've found is mongoimport. It works perfectly, but there is still overhead. Even after the import is complete, memory isn't made available unless I reboot my machine.
Example:
mongoimport -d flightdata -c trajectory_data --type csv --file trjdata.csv --headerline
where my headerline and data look like:
'FID','ACID','FLIGHT_INDEX','ORIG_INDEX','ORIG_TIME','CUR_LAT', ...
'20..','J5','79977,'79977','20110116:15:53:11','1967', ...
With 5.3 million rows by 20 columns, about 900MB, I end up like this:
This won't work for me in the long run; I may not always be able to reboot, or will eventually run out of memory. What would be a more effective way of importing into MongoDB? I've read about periodic RAM flushing, how could I implement something like with the example above?
Update:
I don't think my case would benefit much from adjusting fsync, syncdelay, or journaling. I'm just curious as to when that would be a good idea, and best practice, even if I was running on high RAM servers.

I'm guessing that memory is being used by mongodb itself, not mongoimport. Mongodb by design tries to keep all of its data into memory and relies on the OS to swap the memory-mapped files out when there's not enough room. So I'd give you two pieces of advice:
Don't worry too much about what your OS is telling you about how much memory is "free" -- a modern well-running OS will generally use every bit of RAM available for something.
If you can't abide by #1, don't run mongodb on your laptop.

Machine hangs when mongoDB db.copyDatabase(...) takes all vailable RAM

When I try to copy a database from one mongoDB server to another (About 100GB) the mongo daemon process takes 99% of the available RAM (Windows 64bit 16GB). As a result the system becomes very slow and sometimes unstable.
Is there any way to avoid it?
MongoDB 2.0.6

Albert.
MongoDB is very much an "in ram" application. Mongo has all of your database memory mapped for usage but normally only the most recently used data will be in RAM (called your working set) and mongo will page out to get any data not in RAM as needed. Normally mongo's behaviour is only to have as much as it needs in RAM, however when you do something like a DB Copy all of the data is needed - thus the mongod consuming all your ram.
There is no ideal solution to this, but if desperately needed you could use WSRM http://technet.microsoft.com/en-us/library/cc732553.aspx to try and limit the amount of RAM consumed by the process. This will have the effect of making the copy take longer and may cause other issues.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse