How I can speed up mongodb? - mongodb

I have a crawler which consits of 6+ processes. Half of processes are masters which crawl web and when they find jobs they put it inside jobs collection. In most times masters save 100 jobs at once (well I mean they get 100 jobs and save each one of them separatly as fast as possible.
The second half of processes are slaves which constantly check if new jobs of some type are available for him - if yes it marks them in_process (it is done using findOneAndUpdate), then processe the job and save the result in another collection.
Moreover from time to time master processes have to read a lot of data from jobs table to synchronize.
So to sum up there are a lot of read operations and write operations on db. When db was small it was working ok but now when I have ~700k job records (job document is small, it has 8 fields and has proper indexes / compound indexes) my db slacks. I can observe it when displaying "stats" page which basicaly executes ~16 count operations with some conditions (on indexed fields).
When masters/slaves processes are not runing stats page displays after 2 seconds. When masters/slaves are runing same page displays around 1 minute and sometime is does not display at all (timeout).
So how I can make my db to handle more requests per second? I have to replicate it?

Related

Spark job running fast after count

I have a scala job which had few CDC records ( 10K records) and as part of a merge job, it does member matching ( 4 million) from other Hive table which has 4 million members , provider matching from other Hive table which has 1 million providers and a bunch of other stuff ( recycle logic/rejection logic) which usually takes around 30 min to complete.
With the same volume I have added action ( count(*)) after every module to understand which one takes more time, however after using action it got completed in 6 min. Usually as per best practices we should not use action frequently, however I don't understand what's making the job run fast ? any link with explanation would be helpful.
Is it that since the resource may have got released after every module execution due to action, it's making overall job run fast.
My cluster is 6 node with 13 cores machine and 64GB memory, however there are other processes also running ..so it's usually overutilized.

Slow transaction processing in PostgreSQL

I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

What is MongDB (WiredTiger) update query's default lock wait time?

I have embedded sub documents array in MongoDB document and multiple users can try add sub documents into the array. I use update($push) query to add document into array, but if multiple users try to add entry from UI, how do I make sure second $push doesn't fail due to lock by first? I would have chance of only couple of users adding entry at same into single document, so not to worry about the case where 100s of users exist. What is the default wait time of update in WiredTiger, so 2nd push doesn't abort immediately and can take upto 1 sec, but $push should complete successful?
I tried finding default wait time in MongoDB and WiredTiger docs, I could find transaction default wait times, but update query.
Internally, WiredTiger uses an optimistic locking method. This means that when two threads are trying to update the same document, one of them will succeed, and the other will back off. This will manifest as a "write conflict" (see metrics.operation.writeConflicts).
This conflict will be retried transparently, so from a client's point of view, the write will just take longer than usual.
The backing off algorithm will wait longer the more conflict it encounters, from 1 ms, and capped at 100 ms per wait. So the more conflict it encounters, it will eventually wait at 100 ms per retry.
Having said that, the design of updating a single document from multiple sources will have trouble scaling in the future due to two reasons:
MongoDB has a limit of 16 MB document size, so the array cannot be updated indefinitely.
Locking issues, as you have readily identified.
For #2, in a pathological case, a write can encounter conflict after conflict, waiting 100 ms between waits. There is no cap limit on the number of waits, so it can potentially wait for minutes. This is a sign that the workload is bottlenecked on a single document, and the app essentially operates on a single-thread model.
Typically the solution is to not create artificial bottlenecks, but to spread out the work across many different documents, perhaps in a separate collection. This way, concurrency can be maintained.

Memcache or Queue for Hits Logging

We have a busy website, which needs to log 'hits' about certain pages or API endpoints which are visited, to help populate stats, popularity grids, etc. The hits were logging aren't simple page hits, so can't use log parsing.
In the past, we've just directly logged to the database with an update query, but under heavy concurrency, this creates a database load that we don't want.
We are currently using Memcache but experiencing some issues with the stats not being quite accurate due to non-atomic updates.
So my question:
Should we continue to use Memcache but improve atomic increments:
1) When page is hit, create a memcache key such as "stats:pageid:3" and increment this each time we hit atomically
2) Write a batch script to cycle through all the memcache keys and create a batch update to database once every 10 mins
PROS: Less database hits, as we're only updating once per page per 10 mins (with however many hits in that 10 min period)
CONS: We can atomically increment the individual counters, but would still need a memcache key to store which pageids have had hits, to loop through and log. This won't be atomic, so when we flush the data to DB and reset everything, things may linger in this key. We could lose up to 10 mins of data.
OR Use a queue/task system:
1) When page is hit, add a job to the task queue
2) Task queue can then be rate limited and in the background process these 'hits' to the database.
PROS: Easy to code, we can scale up queue workers if required.
CONS: We're still hitting the database once per hit as each task would be processed individually, rather than 'summing' up all the hits.
Or any other suggestions?
OR: use something designed for recording stats at high-traffic levels, such as StatsD & Graphite. The original StatsD is written in Javascript on top of NodeJS, which can be a little complex to setup (but there are easier ways to install it, with a Docker container), or you can use a work-alike (not using NodeJS), that does the same function, such as one written in GoLang.
I've used the original StatsD and Graphite pair to great effect, plus it's making the pretty graphs (this was for 10's of millions of events per day).