Mongo through PyMongo DB taking too much RAM - mongodb

I have a small DB that I use to train my Python ML models. Data is in a collection with 20 mil documents. I accesso to data using this code.
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient['local']
collection = mydb['mydataset']
for i in tqdm(range(len(list_of_companies))):
cursor = collection.find({'company': companies[i]})
dataset_part_df = pd.DataFrame(list(cursor))
dataset_df = pd.concat([dataset_df,dataset_part_df)
del cursor
The PC Mongo and Python are running on has 32 GB or RAM. Mongo is 5.0.6 and is running as a service in Windows 10.
When whole dataset is loaded, Python is using 8 GB or RAM, while "MongoDB Database Server" is using 12 GB of RAM (leaving me with very little free RAM to do any data analysis)
I have already tried setting this parameters in Mongo CGC files:
storage:
dbPath: G:\MongoDB\Server\5.0\data
journal:
enabled: true
wiredTiger:
engineConfig:
cacheSizeGB: 2
wiredTigerCacheSizeGb: 2
directoryForIndexes: true
I have also tried to set "wiredTigerCacheSizeGB" instead of "wiredTigerCacheSizeGb", but nothing is working. Mongo DB just take so much ram.
Thanks a lot in advance to anyone that can help

Related

Why is query router storing all of the data in mongodb cluster?

I am trying to set up mongodb cluster, I have got 1 config server, 1 query router and 2 mongod instance. Here is my script to set up the cluster
mongod --configsvr --port 27010 --dbpath ~/mongodb/data1
mongos -configdb localhost:27010 --port 27011
mongod --port 27012 --dbpath ~/mongodb/data2
mongod --port 27013 --dbpath ~/mongodb/data3
sh.addShard("localhost:27012")
sh.addShard("localhost:27013")
sh.enableSharding("tags")
db.tweets.ensureIndex( { _id : "hashed" } )
sh.shardCollection("tags.tweets", { "_id": "hashed" } )
In order to insert the data, I am using this script
connection = pymongo.MongoClient("mongodb://localhost:27011")
db=connection.tags
tweets = db.tweets
def main(jsonfile):
f = open(jsonfile)
for line in f.readlines():
try:
tweet_dict = json.loads(line)
result = tweets.insert_one(tweet_dict)
print result.inserted_id
except Exception as e:
print "Unexpected error:", type(e), e
sys.exit()
Why my tweets, which I am trying to insert, are getting sharded, all of the tweets I am trying to insert are also getting stored in query router. Is this behaviour expected?
The whole point of cluster is horizontal scalability(i.e. tweets getting split among machine), so for all of the tweets to accumulate in query router seems counter-intuitive?
Can anybody explain why it is happening? Why query router has all of the tweets I have inserted?
You ask why your inserted tweets "are also getting stored in query router". The short answer is that the only copy of each document is stored on one of the underlying shard servers, and nothing is stored on the query router. The mongos process is not started with a --dbpath parameter, so it has nowhere to store data.
I set up an environment just like yours and then I then used a python script similar to yours to connect to the mongos (aka query router) and insert 200 documents to tags.tweets. Now when I connect to the mongos and count the documents in tags.tweets, it finds 200.
$> mongo --port 27011 tags
mongos> db.tweets.count()
200
However, when I run getShardDistribution it shows docs 91 on the first shard, and docs 109 docs on the second:
mongos> db.tweets.getShardDistribution()
Shard shard0000 at localhost:27301
data : 18KiB docs : 91 chunks : 2
estimated data per chunk : 9KiB
estimated docs per chunk : 45
Shard shard0001 at localhost:27302
data : 22KiB docs : 109 chunks : 2
estimated data per chunk : 11KiB
estimated docs per chunk : 54
Totals
data : 41KiB docs : 200 chunks : 4
Shard shard0000 contains 45.41% data, 45.5% docs in cluster, avg obj size on shard : 210B
Shard shard0001 contains 54.58% data, 54.5% docs in cluster, avg obj size on shard : 211B
How the query router works is it passes all commands to the underlying shard servers and then combines their responses before returning a result the to the caller. The count() of 200 returned above was just the sum of a count() done on each shard.
There's a lot more information about using MongoDB sharding for horizontal scalability in the sharding documentation here. You might find the section on metadata helpful for your current issue.

Tuning RelStorage and parameters of PostgreSQL on Plone site

I got many times an error of POSKeyError. I think our setting is not enough parameters of PostgreSQL. Because the system chenged storage from MySQL to PostgreSQL. I got the error many times before the chenging.
Please let me know the specific setting or any points.
Using version:
Plone 4.3.1
RelStorage 1.5.1 with PostgreSQL on RDS, AWS
shared-blob-dir true (stored on the filesystem)
Plone Quick Upload 1.8.2
Here are some PostgreSQL tune-ups within postgresql.conf:
# shared_buffers and effective_cache_size should be 30%-50%
# of your machine free memory
shared_buffers = 3GB
effective_cache_size = 2GB
checkpoint_segments = 64
checkpoint_timeout = 1h
max_locks_per_transaction = 512
max_pred_locks_per_transaction = 512
# If you know what you're doing you can uncomment and adjust the following values
#cpu_tuple_cost = 0.0030
#cpu_index_tuple_cost = 0.0001
#cpu_operator_cost = 0.0005
And here are they explained by Jens W. Klein:
most important: shared_buffers = 3GB (set it to 30%-50% of your
machine free memory)
checkpoint_segments = 64,
checkpoint_timeout = 1h (decreases logging overhead)
max_locks_per_transaction = 512,
max_pred_locks_per_transaction = 512 (relstorage needs lots of them)
effective_cache_size = 4GB (adjust to ~50% of your memory)
just for import you could disable fsync in the config, then it should be really fast, but don't switch off the machine
CPU tweaks. We didn't touch the default values for these, but if you
know what you're doing, go for it. Bellow are some recommended
values:
cpu_tuple_cost = 0.0030,
cpu_index_tuple_cost = 0.001,
cpu_operator_cost = 0.0005 (query planning optimizations, the defaults are some year old, so current cpus are faster, these are better estimates, but i don't know how to get here "real" values)
You should also read https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
And here is our buildout.cfg:
[instance1]
recipe = plone.recipe.zope2instance
rel-storage =
type postgresql
host 10.11.12.13
dbname datafs
user zope
password secret
blob-dir /var/sharedblobstorage/blobs
blob-cache-size 350MB
poll-interval 0
cache-servers 10.11.12.14:11211
cache-prefix datafs

MongoDB Not Responding on db.collection.createIndex()

currently i'm using mongodb for my big data project. i have installed mongodb into Centos 7 server with 32GB RAM and connected to 12TB NFS. until now, this is my database statistic:
web-analyzer 43.933 GB
web-crawler 109.900 GB
web-crawler2 339.788 GB
the problem is, whenever i run the craeteIndex() on may collection, my mongodb always end up to not responding (i cannot execute db.collection.count() or 'show dbs / collections' command) so i terminate the job using CTRL + C. after that i cannot shutdown my mongodb using 'kill pid' or 'mongod --shutdown' command so i have to reboot my server. Anyone know the cause of this problem and how to solve this issue?
this is my 'top' command output for mongod service:
before running the query:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2896 mongod 20 0 0.965t 99220 70964 S 0.3 0.3 0:00.33 mongod
after running createIndex()
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2896 mongod 20 0 0.965t 2.465g 2.411g S 2.7 7.9 0:26.03 mongod
Thank you :)
By default, building an index is a blocking operation which locks the database until completed. When you already have a lot of data in your collection, this can take a very long time. But you can build an index in background by using the background:true option.
db.collection.createIndex({ keyfield: 1, otherkeyfield: 1 }, { background: true });
It will take longer overall, but the database will keep responding while you do.
For more information about creating indexes, please refer to the documentation.

How to increase connections in mongodb

I start MongoDB this way by running a script startmongo.sh which will start all the below in order
./mongodb1.sh
./mongodb2.sh
./mongod3_arbiter.sh
Each mongodb1.sh , mongodb2.sh , mongodb3_arbiter.sh consists of code as
mongod --config mongod1.conf
mongod --config mongod2.conf
mongod --config mongod3_arbiter.conf
I want to increase the connection limit to 10000 , so i wanted to specify attribute ulimit -n 10000
My question is do i need to specify this attribute all the above conf files ??
Right now the conf consists of
replSet = test
fork = true
port = 27017
dbpath = /mongologs/mongodb3
logpath = /mongologs/mongo/mongodb3
rest = true
Please let me know , thanks in advance .
H friend, I think you need to set maxpoolSize.
Please have a look at mongoDB Docs
There described like
uri.maxPoolSize¶
The maximum number of connections in the connection pool. The default value is 100.
These are set in /etc/limits.conf (debian based) or /etc/security/limits.conf (redhat based) depending on what linux distribution you have.
You are looking for the nofile attribute.
<domain> <type> <item> <value>
* soft nofile 10000
* hard nofile 10000

percona won't start after uncommenting innodb lines in my.cnf

I'm running Percona 5.5 on a Centos 6.3. I'm using the prepackaged "huge" my.cnf which ships with percona, it matches my server specs pretty well. My database uses innodb tables. Reading the my.cnf file, there is a section pertaining to innodb:
# Uncomment the following if you are using InnoDB tables
innodb_data_home_dir = /var/lib/mysql
#innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
#innodb_log_group_home_dir = /var/lib/mysql
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
#innodb_buffer_pool_size = 384M
#innodb_additional_mem_pool_size = 20M
# Set .._log_file_size to 25 % of buffer pool size
#innodb_log_file_size = 100M
#innodb_log_buffer_size = 8M
#innodb_flush_log_at_trx_commit = 1
#innodb_lock_wait_timeout = 50
I uncommented the above lines (obviously leaving the actual comments commented), restarted percona and I get the following message:
Starting MySQL (Percona Server). ERROR! The server quit without updating PID file (/var/lib/mysql/rmdb.pid).
I'm new to managing a database server. What benefit is there to uncommenting the above lines, and why does it crash mySQL when I restart when I do?
Thanks