TokyoCabinet Write speed too slow - perl

I have a perl script (in Ubuntu 12.04 LTS) writing to 26 TCH files. The keys are roughly equally distributed. The writes become very slow after 3 Million inserts (equally distributed to all the files) and the speed comes down from 240,000 inserts/min at the beginning to 14,000 inserts/min after 3 MM inserts. Individually the shard files are no more than 150 MB and overall their size comes to around 2.7 GB.
I run optimize on every TCH File after every 100K inserts to that file with bnum as 4*num_records_then and options set to TLARGE and make sure xmsiz matches the size of bnum (as mentioned in Why does tokyo tyrant slow down exponentially even after adjusting bnum?)
Even after this, the inserts start at high speed then slowly decrease to 14k inserts/min from 240k inserts/min. Could it be due to holding multiple tch connections (26) in a single script? Or is there configuration setting, I'm missing (would disabling journaling help, but the above thread says journaling affects performance only after the tch file becomes bigger than 3-4GB, my shards are <150MB files..)?

I would turn off journaling and measure what changes.
The cited thread talks about a 2-3 GB tch file, but if you sum the sizes of your 26 tch files, you are in the same league. For the filesystem, the total amount of data ranges written to should be the relevant parameter.

Related

What is the minimum number of shards required for Mongo database to store 1 billion documents?

We need to store 1 billion documents of 1KB each. Each shard is planned to have 8GB of RAM. The platform is Open Shift Red Hat Linux.
Initially we had 10 shards for 300 million. We started inserting documents with 2000 inserts/second. Everything went well till 250 million. After that the insert slowed down drastically to 300/400 insert per second.
The queries are also taking long time (more than 1 minute) even all the queries are covered queries.(Queries which need to scan all the indexes).
Hence we assumed, that 20 million per shard is the optimal value and hence we require 50 shards for the current hardware to achieve 1 billion.
Is this reasonable estimate or we can improve it (less shards) by tweaking mongo db parameters for better performance with the current hardware?
There are two compound indexes and one unique index(long).insertion is done using bulk write( with unordered option) with 10 threads and 200 records per (thread) bulk write using java script directly on the mongos.Shardkey is nodeId(prefix of compound index) which has cardinality upto 10k. For 300 million, the total index size comes to 45 GB.40 GB for the 2 compound indexes.Almost 9500 chunks are distributed across 10 nodes.One interesting fact is that if I increase RAM to 12 GB, the speed increases to 1500 inserts/sec.Is RAM limiting factor?
Update:
Using mongostat tool, we found that the flush(fysnc) takes more than 55 seconds to complete.MongoDB cluster runs on kubernetes based on RedHat OpenShift platform. It runs on Dell EMC server with NFS (EXT4 disk format).Is it a problem in the I/O that it supports only 2MB/second. It takes 60 seconds to write 2000 records per second and another 55 seconds to flush completely to disk.(during which all the operations of DB are blocked)
The disk utilization does not even reach 4 %.
Have you tried not sharding at all?
There's a common tendency to shard prematurely. I've seen a MongoDB consultant who suggested a rule of thumb, to not shard until your total data size is at least 2 TB. Your 1B documents of 1KB each should be around 1 TB. While it's only a rule of thumb, maybe it's worth trying.
If nothing else, it'll be much simpler to design the db without sharding and performance will be much more predictable.

PostgreSQL: Drawbacks for large wal_keep_segments?

I would like to keep at least 12 hours worth of wal segments around to keep replication going for extended network outages (namely long DR tests that my database is not a part of)
I've estimated that I will need to raise my wal_keep_segments from 64 to 1000+
Are there any drawbacks of doing this other than the space it would require? i.e. performance?
I'm considering the archive option as a backup plan for now.
Apart from the disk space, there is no problem with a high wal_keep_segments setting.

MongoDB Disabling Auto Split

We have a process that will bulk ingest around 6 billion documents of 500 bytes each. We are finding that the performance really tends to drop off around the 4-5bn document mark with the lions share of the time spent splitting chunks which prevents insertions during that time.
Does it make sense to increase the chunk size from the default of 64MB and / or to disable auto splitting altogether as a temporary solution?

MongoDB upsert operation blocks inconsistently (with syncdelay set to 0)

There is a database with 9 million rows, with 3 million distinct entities. Such a database is loaded everyday into MongoDB using perl driver. It runs smoothly on the first load. However from the second, third, etc. the process is really slowed down. It blocks for long times every now and then.
I initially realised that this was because of the automatic flushing to disk every 60 seconds, so I tried setting syncdelay to 0 and I tried the nojournalling option. I have indexed the fields that are used for upsert. Also I have observed that the blocking is inconsistent and not always at the same time for the same line.
I have 17 G ram and enough hard disk space. I am replicating on two servers with one arbiter. I do not have any significant processes running in the background. Is there a solution/explanation for such blocking?
UPDATE: The mongostat tool says in the 'res' column that around 3.6 G is used.

Can't map file memory-mongo requires 64 bit build for larger datasets

I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.