mongo: update $push failed with "Resulting document after update is larger than 16777216" - mongodb

I want to extend an large array using the update(.. $push ..) operation.
Here are the details:
I have a large collection 'A' with many fields. Amongst the fields, I want to extract the values of the 'F' field, and transfer them into one large array stored inside one single field of a document in collection 'B'.
I split the process into steps (to limit the memory used)
Here is the python program:
...
steps = 1000 # number of steps
step = 10000 # each step will handle this number of documents
start = 0
for j in range(steps):
print('step:', j, 'start:', start)
project = {'$project': {'_id':0, 'F':1} }
skip = {'$skip': start}
limit = {'$limit': step}
cursor = A.aggregate( [ skip, limit, project ], allowDiskUse=True )
a = []
for i, o in enumerate(cursor):
value = o['F']
a.append(value)
print('len:', len(a))
B.update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
start += step
Here is the oupput of this program:
step: 0 start: 0
step: 1 start: 100000
step: 2 start: 200000
step: 3 start: 300000
step: 4 start: 400000
step: 5 start: 500000
step: 6 start: 600000
step: 7 start: 700000
step: 8 start: 800000
step: 9 start: 900000
step: 10 start: 1000000
Traceback (most recent call last):
File "u_psfFlux.py", line 109, in <module>
lsst[k].update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 2503, in update
collation=collation)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 754, in _update
_check_write_command_response([(0, result)])
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/helpers.py", line 315, in _check_write_command_response
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Resulting document after update is larger than 16777216
Apparently the $push operation has to fetch the complete array !!! (my expectation was that this operation would always need the same amount of memory since we always append the same amount of values to the array)
In short, I don't understand why the update/$push operation fails with error...
Or... is there a way to avoid this unneeded buffering ?
Thanks for your suggestion
Christian

Related

Does the VSCode problem matcher strip ANSI escape sequences before matching?

I'm making a custom task to run Perl unit tests with yath. The output of that command contains details about failed tests, which I would like to filter and display as problems.
I've written the following matcher for the my output.
"problemMatcher": {
"owner": "yath",
"fileLocation": [ "relative", "${workspaceFolder}" ],
"severity": "error",
"pattern": [
{
"regexp": "\\[\\s*FAIL\\s*\\]\\s*job\\s*\\d+\\s*\\+?\\s*(.+)",
"message": 1,
},{
"regexp": "\\(\\s*DIAG\\s*\\)\\s*job\\s*\\d+\\s*\\+?\\s*at (.+) line (\\d+)\\.",
"file": 1,
"line": 2
}
]
}
This is supposed to match two different lines in the following output, which I will present as code for copying, and as a screenshot.
** Defaulting to the 'test' command **
( LAUNCH ) job 1 t/foo.t
( NOTE ) job 1 Seeded srand with seed '20220414' from local date.
[ PASS ] job 1 + passing test
[ FAIL ] job 1 + failing test
( DIAG ) job 1 Failed test 'failing test'
( DIAG ) job 1 at t/foo.t line 57.
[ PLAN ] job 1 Expected assertions: 2
( FAILED ) job 1 t/foo.t
( TIME ) job 1 Startup: 0.30841s | Events: 0.01992s | Cleanup: 0.00417s | Total: 0.33250s
< REASON > job 1 Test script returned error (Err: 1)
< REASON > job 1 Assertion failures were encountered (Count: 1)
The following jobs failed:
+--------------------------------------+-----------------------------------+
| Job ID | Test File |
+--------------------------------------+-----------------------------------+
| e7aee661-b49f-4b60-b815-f420d109457a | t/foo.t |
+--------------------------------------+-----------------------------------+
Yath Result Summary
-----------------------------------------------------------------------------------
Fail Count: 1
File Count: 1
Assertion Count: 2
Wall Time: 0.74 seconds
CPU Time: 0.76 seconds (usr: 0.20s | sys: 0.00s | cusr: 0.49s | csys: 0.07s)
CPU Usage: 103%
--> Result: FAILED <--
But it's actually pretty with colours.
I suspect there are ANSI escape sequences in this output. I could pass a flag to yath to make it not print colours, but I would like to be able to read this output as well, so that isn't ideal.
Do I have to change my pattern to match the escape sequences (I can read the source of the program that prints them, but it's annoying), or are they in fact stripped out and my pattern is wrong, but I can't see where?
Here's the first pattern as a regex101 match, and here's the second.

Cannot print timestamp from mongodb oplog using pymongo

I have the following code:
connection = MongoClient('core.mongo.com', 27017)
db = connection['admin']
first = db.oplog.rs.find().sort('$natural', pymongo.DESCENDING).limit(-1).next()
ts = first['ts']
while True:
cursor = db.oplog.find({'ts': {'$gt': ts}}, tailable=True, await_data=True)
while cursor.alive:
for doc in cursor:
ts = doc['ts']
time.sleep(1)
I get:
Traceback (most recent call last):
File "tail.py", line 25, in <module>
ts = first['ts']
File "/Library/Python/2.7/site-packages/pymongo/cursor.py", line 569, in __getitem__
"instances" % index)
TypeError: index 'ts' cannot be applied to Cursor instances
How am I supposed to get the latest time-stamp from the oplog of the mongo database?
Following code gives me the last operation on database_name.collection_name:
connection = MongoClient('core.mongo.com', 27017)
db = connection['admin']
oplog_str = str(connection.local.oplog.rs)
print oplog_str
new_query = {'ns': {'$in': ['database_name.collection_name']}}
curr = connection.local.oplog.rs.find(new_query).sort('$natural', pymongo.DESCENDING).limit(-1)
for doc_count, doc in enumerate(curr):
current_time_stamp = doc['ts'].time
good_date = datetime.datetime.fromtimestamp(current_time_stamp).ctime()
print doc_count, good_date
If you want the operation irrespective of the database and collection, just remove new_query from curr.

Difference between 2 strings

I want to compare some strings like this
Previous -> Present
Something like
path 1 : 100 -> 112 --> 333 --> 500
path 2 : 100 -> 333 --> 500
path 3 : 100 -> 333 --> 500 --> 500
path 4 : 100 -> 112 --> 500
I need to compare path 1 with path 2, get the number that was in path 1 which doesn't exist in path 2 and store it in a database
Then compare path 2 with path 3 and do same thing. If it already exists then increment it. Otherwise insert the new number.
I know how to insert into a database and increment if the entry exists. What I don't know is how to loop through all those paths getting those values then deciding whether to insert into the database.
I have done some research, and I have heard of Levenshtein Edit Distance but I can't figure out how I should do it.
Your question appears to be:
Given two lists of numbers, how can I tell which ones in list A aren't in list B?
Hashes are useful for doing set arithmetic.
my #a = ( 100, 112, 333, 500 );
my #b = ( 100, 333, 500 );
my %b = map { $_ => 1 } #b;
my #missing = grep { !$b{$_} } #a;

MongoDB Assertion Error: starting_from == self.__retrieved (pymongo driver)

MongoDB Question:
We're using a sharded replicaset, running pymongo 2.2 against mongo (version: 2.1.1-pre-). We're getting a traceback when a query returns more than one result document.
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/opt/DCM/mods/plugin.py", line 25, in run
self._mod.collect_metrics_dcm()
File "/opt/DCM/plugins/res.py", line 115, in collect_metrics_dcm
ms.updateSpecificMetric(metricName, value, timestamp)
File "/opt/DCM/mods/mongoSaver.py", line 155, in updateSpecificMetric
latestDoc = self.getLatestDoc(metricName)
File "/opt/DCM/mods/mongoSaver.py", line 70, in getLatestDoc
for d in dlist:
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 747, in next
if len(self.__data) or self._refresh():
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 698, in _refresh
self.__uuid_subtype))
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 668, in __send_message
assert response["starting_from"] == self.__retrieved
AssertionError
The code that give what dlist is is a simple find(). I've tried reIndex(), no joy. I've tried stopping and starting the mongo server, no joy.
This is easily replicable for me. Any ideas?
Ok, so traced this down a bit, and I have a SOLUTION for this assertion error.
There is a BUG in Mongo. When querying a sharded replicaset, Mongo returns an incorrect value for 'starting_from'. Instead of returning 0 on the first query, it's returning the number of records received instead of the offset value. I have a patch for pymongo to protect against this bad info:
File is site-packages/pymongo/cursor.py.
[user#hostname]$ diff cursor.py.orig cursor.py
631,632c631,634
< if not self.__tailable:
< assert response["starting_from"] == self.__retrieved
---
> if ((not self.__tailable) and (self.__retrieved != 0) and (response["starting_from"] != self.__retrieved)):
> from pprint import pformat
> msg = "Server response of 'starting_from' is '%s', but self__retrieved (which is only set to nonzero below here) is '%s'." % (pformat(response), pformat(self.__retrieved))
> assert False, msg
The 'starting_from' comes from helpers.py decoding the response from Mongo:
result["starting_from"] = struct.unpack("<i", response[12:16])[0]
So, it's the 12th thru the 15th byte of Mongo's response.
This is a bug in the 2.1.1 development release of mongos. See https://jira.mongodb.org/browse/SERVER-5844

mongoDB: group by failing while querying from pymongo

Here is what I am doing:
>>> import pymongo
>>> con = pymongo.Connection('localhost',12345)
>>> db = con['staging']
>>> coll = db['contract']
>>> result = coll.group(['asset_id'], None, {'list': []}, 'function(obj, prev) {prev.list.push(obj)}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.3-fat/egg/pymongo/collection.py", line 908, in group
File "build/bdist.macosx-10.3-fat/egg/pymongo/database.py", line 340, in command
File "build/bdist.macosx-10.3-fat/egg/pymongo/helpers.py", line 126, in _check_command_response
pymongo.errors.OperationFailure: command SON([('group', {'$reduce': Code('function(obj, prev) {prev.list.push(obj)}', {}), 'ns': u'contract', 'cond': None, 'key': {'asset_id': 1}, 'initial': {'list': []}})]) failed: exception: BufBuilder grow() > 64MB
and what I see on mongod logs is following
Wed Nov 16 16:05:55 [conn209] Assertion: 13548:BufBuilder grow() > 64MB 0x10008de9b 0x100008d89 0x100151e72 0x100152712 0x100151954 0x100152712 0x100151954 0x100152712 0x100152e7b 0x100152f0c 0x10013b1d9 0x1003706bf 0x10037204c 0x10034c4d6 0x10034d877 0x100180cc4 0x100184649 0x1002b9e89 0x1002c3f18 0x100433888 0 mongod 0x000000010008de9b _ZN5mongo11msgassertedEiPKc + 315 1 mongod 0x0000000100008d89 _ZN5mongo10BufBuilder15grow_reallocateEv + 73
2 mongod 0x0000000100151e72 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 2962
3 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
4 mongod 0x0000000100151954 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 1652
5 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
6 mongod 0x0000000100151954 _ZN5mongo9Convertor6appendERNS_14BSONObjBuilderESslNS_8BSONTypeERKNS_13TraverseStackE + 1652
7 mongod 0x0000000100152712 _ZN5mongo9Convertor8toObjectEP8JSObjectRKNS_13TraverseStackE + 1682
8 mongod 0x0000000100152e7b _ZN5mongo9Convertor8toObjectEl + 139
9 mongod 0x0000000100152f0c _ZN5mongo7SMScope9getObjectEPKc + 92
10 mongod 0x000000010013b1d9 _ZN5mongo11PooledScope9getObjectEPKc + 25
11 mongod 0x00000001003706bf _ZN5mongo12GroupCommand5groupESsRKSsRKNS_7BSONObjES3_SsSsPKcS3_SsRSsRNS_14BSONObjBuilderE + 3551
12 mongod 0x000000010037204c _ZN5mongo12GroupCommand3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb + 3676
13 mongod 0x000000010034c4d6 _ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb + 1350
14 mongod 0x000000010034d877 _ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi + 2151
15 mongod 0x0000000100180cc4 _ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi + 52
16 mongod 0x0000000100184649 _ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_ + 10585
17 mongod 0x00000001002b9e89 _ZN5mongo13receivedQueryERNS_6ClientERNS_10DbResponseERNS_7MessageE + 569
18 mongod 0x00000001002c3f18 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 1528
19 mongod 0x0000000100433888 _ZN5mongo10connThreadEPNS_13MessagingPortE + 616
Wed Nov 16 16:05:55 [conn209] query staging.$cmd ntoreturn:1 command: { group: { $reduce: CodeWScope( function(obj, prev) {prev.list.push(obj)}, {}), ns: "contract", cond: null, key: { asset_id: 1 }, initial: { list: {} } } } reslen:111 1006ms
I am very new to both pymongo and Mongodb, and dont't know how to resolve this, please help
Thank you
The relevant part of your stacktrace is:
exception: BufBuilder grow() > 64MB
Basically, Mongo doesn't allow you to have any document greater than 64MB. See this question for more details (the size limit has been bumped to 64MB since then.)
I'm not sure what you're trying to do with that query. It sort of looks like you want to get a list of objects for each asset_id. However, your collection is going to grow past capacity because you're never differentiating between objects in your group. Try setting your initial to {'asset_id': '', 'objects': []} and reduce to function(obj, prev) {prev.asset_id = obj.asset_id; prev.objects.push(obj) although there are much more efficient ways of doing this query.
Alternatively, if you're trying to get all the documents matching an ID, try:
coll.find({'asset_id': whatevs})
If you're trying to get a count of the objects, try this instead:
coll.group(
['asset_id'], None, {'asset_id': '', 'count': 0},
'function(obj, prev) {prev.asset_id = obj.asset_id; prev.count += obj.count}'
)