mongoDB: group by failing while querying from pymongo - mongodb

Here is what I am doing:
>>> import pymongo
>>> con = pymongo.Connection('localhost',12345)
>>> db = con['staging']
>>> coll = db['contract']
>>> result = coll.group(['asset_id'], None, {'list': []}, 'function(obj, prev) {prev.list.push(obj)}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.3-fat/egg/pymongo/collection.py", line 908, in group
File "build/bdist.macosx-10.3-fat/egg/pymongo/database.py", line 340, in command
File "build/bdist.macosx-10.3-fat/egg/pymongo/helpers.py", line 126, in _check_command_response
pymongo.errors.OperationFailure: command SON([('group', {'$reduce': Code('function(obj, prev) {prev.list.push(obj)}', {}), 'ns': u'contract', 'cond': None, 'key': {'asset_id': 1}, 'initial': {'list': []}})]) failed: exception: BufBuilder grow() > 64MB
and what I see on mongod logs is following
Wed Nov 16 16:05:55 [conn209] Assertion: 13548:BufBuilder grow() > 64MB 0x10008de9b 0x100008d89 0x100151e72 0x100152712 0x100151954 0x100152712 0x100151954 0x100152712 0x100152e7b 0x100152f0c 0x10013b1d9 0x1003706bf 0x10037204c 0x10034c4d6 0x10034d877 0x100180cc4 0x100184649 0x1002b9e89 0x1002c3f18 0x100433888 0 mongod 0x000000010008de9b _ZN5mongo11msgassertedEiPKc + 315 1 mongod 0x0000000100008d89 _ZN5mongo10BufBuilder15grow_reallocateEv + 73
2 mongod 0x0000000100151e72 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 2962
3 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
4 mongod 0x0000000100151954 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 1652
5 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
6 mongod 0x0000000100151954 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 1652
7 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
8 mongod 0x0000000100152e7b _ZN5mongo9Convertor8toObjectEl + 139
9 mongod 0x0000000100152f0c _ZN5mongo7SMScope9getObjectEPKc + 92
10 mongod 0x000000010013b1d9 _ZN5mongo11PooledScope9getObjectEPKc + 25
11 mongod 0x00000001003706bf _ZN5mongo12GroupCommand5groupESsRKSsRKNS_7BSONObjES3_SsSsPKcS3_SsRSsRNS_14BSONObjBuilderE + 3551
12 mongod 0x000000010037204c _ZN5mongo12GroupCommand3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb + 3676
13 mongod 0x000000010034c4d6 _ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb + 1350
14 mongod 0x000000010034d877 _ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi + 2151
15 mongod 0x0000000100180cc4 _ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi + 52
16 mongod 0x0000000100184649 _ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_ + 10585
17 mongod 0x00000001002b9e89 _ZN5mongo13receivedQueryERNS_6ClientERNS_10DbResponseERNS_7MessageE + 569
18 mongod 0x00000001002c3f18 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 1528
19 mongod 0x0000000100433888 _ZN5mongo10connThreadEPNS_13MessagingPortE + 616
Wed Nov 16 16:05:55 [conn209] query staging.$cmd ntoreturn:1 command: { group: { $reduce: CodeWScope( function(obj, prev) {prev.list.push(obj)}, {}), ns: "contract", cond: null, key: { asset_id: 1 }, initial: { list: {} } } } reslen:111 1006ms
I am very new to both pymongo and Mongodb, and dont't know how to resolve this, please help
Thank you

The relevant part of your stacktrace is:
exception: BufBuilder grow() > 64MB
Basically, Mongo doesn't allow you to have any document greater than 64MB. See this question for more details (the size limit has been bumped to 64MB since then.)
I'm not sure what you're trying to do with that query. It sort of looks like you want to get a list of objects for each asset_id. However, your collection is going to grow past capacity because you're never differentiating between objects in your group. Try setting your initial to {'asset_id': '', 'objects': []} and reduce to function(obj, prev) {prev.asset_id = obj.asset_id; prev.objects.push(obj) although there are much more efficient ways of doing this query.
Alternatively, if you're trying to get all the documents matching an ID, try:
coll.find({'asset_id': whatevs})
If you're trying to get a count of the objects, try this instead:
coll.group(
['asset_id'], None, {'asset_id': '', 'count': 0},
'function(obj, prev) {prev.asset_id = obj.asset_id; prev.count += obj.count}'
)

Related

Why are my wildcard attributes not being filled in Snakemake?

I am following the tutorial in the documentation (https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html) and have been stuck on the "Step 4: Rule parameter" exercise. I would like to access a float from my config file using a wildcard in my params directive.
I seem to be getting the same error whenever I run snakemake -np in the command line:
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
This is my code so far
import time
configfile: "config.yaml"
rule all:
input:
"plots/quals.svg"
def get_bwa_map_input_fastqs(wildcards):
print(wildcards.__dict__, 1, time.time()) #I have this print as a check
return config["samples"][wildcards.sample]
def get_bcftools_call_priors(wildcards):
print(wildcards.__dict__, 2, time.time()) #I have this print as a check
return config["prior_mutation_rates"][wildcards.sample]
rule bwa_map:
input:
"data/genome.fa",
get_bwa_map_input_fastqs
#lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg=r"#RG\tID:{sample}\tSM:{sample}"
threads: 2
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
#prior=get_bcftools_call_priors
params:
prior=get_bcftools_call_priors
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -P {params.prior} -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
and here is my config.yaml
samples:
A: data/samples/A.fastq
#B: data/samples/B.fastq
#C: data/samples/C.fastq
prior_mutation_rates:
A: 1.0e-4
#B: 1.0e-6
I don't understand why my input function call in bcftools_call says that the wildcards object is empty of attributes, yet an almost identical function call in bwa_map has the attribute sample that I want. From the documentation it seems like the wildcards would be propogated before anything is run, so why is it missing?
This is the full output of the commandline call snakemake -np:
{'_names': {'sample': (0, None)}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort'), 'sample': 'A'} 1 1628877061.8831172
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
bcftools_call 1 1 1
bwa_map 1 1 1
plot_quals 1 1 1
samtools_index 1 1 1
samtools_sort 1 1 1
total 6 1 1
[Fri Aug 13 10:51:01 2021]
rule bwa_map:
input: data/genome.fa, data/samples/A.fastq
output: mapped_reads/A.bam
jobid: 4
wildcards: sample=A
resources: tmpdir=/tmp
bwa mem -R '#RG\tID:A\tSM:A' -t 1 data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_sort:
input: mapped_reads/A.bam
output: sorted_reads/A.bam
jobid: 3
wildcards: sample=A
resources: tmpdir=/tmp
samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_index:
input: sorted_reads/A.bam
output: sorted_reads/A.bam.bai
jobid: 5
wildcards: sample=A
resources: tmpdir=/tmp
samtools index sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule bcftools_call:
input: data/genome.fa, sorted_reads/A.bam, sorted_reads/A.bam.bai
output: calls/all.vcf
jobid: 2
resources: tmpdir=/tmp
{'_names': {}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort')} 2 1628877061.927639
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
If anyone knows what is going wrong I would really appreciate an explaination. Also if there is a better way of getting information out of the config.yaml into the different directives, I would gladly appreciate those tips.
Edit:
I have searched around the internet quite a bit, but have yet to understand this issue.
Wildcards for each rule are based on that rule's output file(s). The rule bcftools_call has one output file (calls/all.vcf), which has no wildcards. Because of this, when get_bcftools_call_priors is called, it throws an exception when it tries to access the unset wildcards.sample attribute.
You should probably set a global prior_mutation_rate in your config file and then access that in the bcftools_call rule:
rule bcftools_call:
...
params:
prior=config["prior_mutation_rate"],

ServerSelectionTimeoutError: Time out error when connecting to Atlas via pymongo

I'm trying to connect to my Atlas mongodb database through pymongo. I make the connection and do a basic query just to count the documents and it times out.
I am able to run the same string on my personal linux (and managed to get it working from a clean Docker), but did not manage to get it working from my Mac that I use for work (and neither did my colleagues and I wasn't able to make it work in a clean Docker image either). If it matters I'm running pymongo 3.8, that I installed with pip install pymongo[tls]. I tried downgrading as well and tried also pip install pymongo[tls,srv].
Personal guesses: something to do with proxy/firewall blocking the connection maybe? I checked whether the port was open On the server I whitelisted the 0.0.0.0/0 so that shouldn't be the issue.
import pymongo
client = pymongo.MongoClient("mongodb+srv://whatever:yep#cluster0-xxxxx.mongodb.net/test?retryWrites=true")
client.test.matches.count_documents({}) # this blocks and then errors
I get the following error
/usr/local/lib/python3.7/site-packages/pymongo/collection.py in count_documents(self, filter, session, **kwargs)
1693 collation = validate_collation_or_none(kwargs.pop('collation', None))
1694 cmd.update(kwargs)
-> 1695 with self._socket_for_reads(session) as (sock_info, slave_ok):
1696 result = self._aggregate_one_result(
1697 sock_info, slave_ok, cmd, collation, session)
/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py in __enter__(self)
110 del self.args, self.kwds, self.func
111 try:
--> 112 return next(self.gen)
113 except StopIteration:
114 raise RuntimeError("generator didn't yield") from None
/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py in _socket_for_reads(self, read_preference)
1133 topology = self._get_topology()
1134 single = topology.description.topology_type == TOPOLOGY_TYPE.Single
-> 1135 server = topology.select_server(read_preference)
1136
1137 with self._get_socket(server) as sock_info:
/usr/local/lib/python3.7/site-packages/pymongo/topology.py in select_server(self, selector, server_selection_timeout, address)
224 return random.choice(self.select_servers(selector,
225 server_selection_timeout,
--> 226 address))
227
228 def select_server_by_address(self, address,
/usr/local/lib/python3.7/site-packages/pymongo/topology.py in select_servers(self, selector, server_selection_timeout, address)
182 with self._lock:
183 server_descriptions = self._select_servers_loop(
--> 184 selector, server_timeout, address)
185
186 return [self.get_server_by_address(sd.address)
/usr/local/lib/python3.7/site-packages/pymongo/topology.py in _select_servers_loop(self, selector, timeout, address)
198 if timeout == 0 or now > end_time:
199 raise ServerSelectionTimeoutError(
--> 200 self._error_message(selector))
201
202 self._ensure_opened()
ServerSelectionTimeoutError: cluster0-shard-00-01-eflth.mongodb.net:27017: timed out,cluster0-shard-00-00-eflth.mongodb.net:27017: timed out,cluster0-shard-00-02-eflth.mongodb.net:27017: timed out

mongo: update $push failed with "Resulting document after update is larger than 16777216"

I want to extend an large array using the update(.. $push ..) operation.
Here are the details:
I have a large collection 'A' with many fields. Amongst the fields, I want to extract the values of the 'F' field, and transfer them into one large array stored inside one single field of a document in collection 'B'.
I split the process into steps (to limit the memory used)
Here is the python program:
...
steps = 1000 # number of steps
step = 10000 # each step will handle this number of documents
start = 0
for j in range(steps):
print('step:', j, 'start:', start)
project = {'$project': {'_id':0, 'F':1} }
skip = {'$skip': start}
limit = {'$limit': step}
cursor = A.aggregate( [ skip, limit, project ], allowDiskUse=True )
a = []
for i, o in enumerate(cursor):
value = o['F']
a.append(value)
print('len:', len(a))
B.update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
start += step
Here is the oupput of this program:
step: 0 start: 0
step: 1 start: 100000
step: 2 start: 200000
step: 3 start: 300000
step: 4 start: 400000
step: 5 start: 500000
step: 6 start: 600000
step: 7 start: 700000
step: 8 start: 800000
step: 9 start: 900000
step: 10 start: 1000000
Traceback (most recent call last):
File "u_psfFlux.py", line 109, in <module>
lsst[k].update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 2503, in update
collation=collation)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 754, in _update
_check_write_command_response([(0, result)])
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/helpers.py", line 315, in _check_write_command_response
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Resulting document after update is larger than 16777216
Apparently the $push operation has to fetch the complete array !!! (my expectation was that this operation would always need the same amount of memory since we always append the same amount of values to the array)
In short, I don't understand why the update/$push operation fails with error...
Or... is there a way to avoid this unneeded buffering ?
Thanks for your suggestion
Christian

Write performance is decreased when collection is increased

My problem : MongoDB poor write performance on large collection
Begin: peformance is 70,000 doc/sec (empty collection)
For a while, my collection is getting larger.
End: performance is < 10,000 doc/sec (0.4 billion docs in collection)
There are my records below:
Write Performance with Time
Write Performance with Collection
And then, I make a new collection in the same database,
I write two collection at the same time.
The Large collection (ns: "rf1.case1") performance is still slow,
but New collection (ns: "rf1.case2") performance is pretty fast.
I have already read MongoDB poor write performance on large collections with 50.000.000 documents plus , but I still no idea !
Is my configuration wrong?
3 Servers are the same Hardware Specfication:
CPU: 8 core
Memory: 32GB
Disk: 2TB HDD (7200rpm)
My Scenario:
There are 3 shards(Replica Set):
Server1 : primary(sh01) + secondary(sh03) + arbiter(sh02) + configsvr + mongos
Server2 : primary(sh02) + secondary(sh01) + arbiter(sh03) + configsvr + mongos
Server3 : primary(sh03) + secondary(sh02) + arbiter(sh01) + configsvr + mongos
Mongod sample:
/usr/local/mongodb/bin/mongod --quiet --port 20001 --dbpath $s1 --logpath $s1/s1.log --replSet sh01 --shardsvr --directoryperdb --fork --storageEngine wiredTiger --wiredTigerCollectionBlockCompressor snappy --wiredTigerCacheSizeGB 8
Two Collection:
(chunksize = 64MB)
rf1.case1
shard key: { "_id" : "hashed" }
chunks:
host1 385
host2 401
host3 367
too many chunks to print, use verbose if you want to force print
rf1.case2
shard key: { "_id" : "hashed" }
chunks:
host1 11
host2 10
host3 10

How do I get mongoimport to work with complex json data?

Trying to use the built in mongoimport utility with mongo db...
I might be blind but is there a way to import complex json data? For instance, say I need to import instances of the following object: { "bob": 1, "dog": [ 1, 2, 3 ], "beau": { "won": "ton", "lose": 3 } }.
I'm trying the following and it looks like it loads everything into memory but nothing actually gets imported into the db:
$ mongoimport -d test -c testdata -vvvv -file ~/Downloads/jsondata.json
connected to: 127.0.0.1
Tue Aug 10 17:38:38 ns: test.testdata
Tue Aug 10 17:38:38 filesize: 69
Tue Aug 10 17:38:38 got line:{ "bob": 1, "dog": [ 1, 2, 3 ], "beau": { "won": "ton", "lose": 3 } }
imported 0 objects
Any ideas on how to get the json data to actually import into the db?
I did some testing and it looks like you need to have an end-of-line character at the end of the file. Without the end-of-line character the last line is read, but isn't imported.