mongodb: issues using $lte and $gte - mongodb

look at this bizarre result:
list(db.users.find({"produit_up.spec.prix":{"$gte":0, "$lte": 1000}}, {"_id":0,"produit_up":1}))
Out[5]:
[{u'produit_up': [{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'portrait',
u'photo': ObjectId('506867863a5f3a0ea84dcd6c')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u"portable tr\xe8s solide, peu servi, avec batterie d'une autonomie de 3 heures.",
u'id': u'alucaard134901952647',
u'namep': u'nokia 3310',
u'nombre': 1,
u'prix': 1000,
u'tags': [u'portable', u'nokia', u'3310'],
u'vendu': False}},
{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'portrait',
u'photo': ObjectId('50686d013a5f3a04a8923b3e')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u'\u0646\u0628\u064a\u0639 \u0623\u064a \u0641\u0648\u0646 \u062c\u062f\u064a\u062f \u0641\u064a \u0627\u0644\u0628\u0648\u0627\u0637 \u0645\u0639\u0627\u0647 \u0634\u0627\u0631\u062c\u0648\u0631 \u062f\u0648\u0631\u064a\u062c \u064a\u0646',
u'id': u'alucaard134902092967',
u'namep': u'iphone 3gs',
u'nombre': 1,
u'prix': 20000,
u'tags': [u'iphone', u'3gs', u'apple'],
u'vendu': False}},
{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'paysage',
u'photo': ObjectId('50686d3e3a5f3a04a8923b40')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u'vends 206 toutes options 2006 hdi.',
u'id': u'alucaard134902099082',
u'namep': u'peugeot 206',
u'nombre': 1,
u'prix': 500000,
u'tags': [u'voiture', u'206', u'hdi'],
u'vendu': False}}]}]
list(db.users.find({"produit_up.spec.prix":{"$gte":0, "$lte": 100}}, {"_id":0,"produit_up":1}))
Out[6]: []
pymongo.version
Out[8]: '2.3+'
and it gives me the same result in Mongo Shell:
db.version()
2.2.0

here is the answer from Bernie Hackett
You have three values for "produit_up.spec.prix", 1000, 20000, 500000.
Why would you think that {"$gte":0, "$lte": 100} would match any of
those values? 100 is less than all of those values.
The reason that {"$gte":0, "$lte": 1000} returns all three documents
is that they are all subdocuments in an array. Since one of the
subdocuments in the array is matched the entire enclosing document
is a match for your query. Since you did a projection on only
"produit_up", just that array (including all array members) is
returned. Use $elemMatch in MongoDB 2.2 to only return the exact
matching array element.
MongoDB and PyMongo are working as designed here.
To get the behavior I think you're asking for see the $elemMatch operator:
http://docs.mongodb.org/manual/reference/projection/elemMatch/

Related

MongoDB query to compute percentage

I am new to MongoDB and kind of stuck at this query. Any help/guidance will be highly appreciated. I am not able to calculate the percentage in the desired way. There is something wrong with my pipeline where prerequisites of percentage are not computed correctly. Following I provide my unsuccessful attempt along with the desired output.
Single entry in the collection looks like below:
_id : ObjectId("602fb382f060fff5419fd0d1")
time : "2019/05/02 00:00:00"
station_id : 3544
station_name : "Underhill Ave &; Pacific St"
station_status : "In Service"
latitude : 40.6804836
longitude : -73.9646795
zipcode : 11238
borough : "Brooklyn"
neighbourhood : "Prospect Heights"
available_bikes : 5
available_docks : 21
The query I am trying to solve is:
Given a station_id (e.g., 522) and a num_hours (e.g., 3) passed as parameters:
- Consider only the measurements where the station_status = “In Service”.
- Consider only the measurements for that concrete
“station_id”.
- Compute the percentage of measurements with
available_bikes = 0 for each hour of the day (e.g., for the period
[8am, 9am) the percentage is 15.06% and for the period [9am, 10am)
the percentage is
27.32%).
- Sort the percentage results in decreasing order.
- Return the top “num_hours” documents.
The desired output is:
--- DOCUMENT 0 INFO ---
---------------------------------
hour : 19
percentage : 65.37
total_measurements : 283
zero_bikes_measurements : 185
---------------------------------
--- DOCUMENT 1 INFO ---
---------------------------------
hour : 21
percentage : 64.79
total_measurements : 284
zero_bikes_measurements : 184
---------------------------------
--- DOCUMENT 2 INFO ---
---------------------------------
hour : 00
percentage : 63.73
total_measurements : 284
zero_bikes_measurements : 181
My attempt is:
command_1 = {"$match": {"station_status": "In Service", "station_id": station_id, "available_bikes": 0}}
my_query.append(command_1)
command_2 = {"$group": {"_id": "null", "total_measurements": {"$sum": 1}}}
my_query.append(command_2)
command_3 = {"$project": {"_id": 0,
"station_id": 1,
"station_status": 1,
"hour": {"$substr": ["$time", 11, 2]},
"available_bikes": 1,
"total_measurements": {"$sum": 1}
}
}
my_query.append(command_3)
command_4 = {"$group": {"_id": "$hour", "zero_bikes_measurements": {"$sum": 1}}}
my_query.append(command_4)
command_5 = {"$project": {"percent": {
"$multiply": [{"$divide": ["$total_measurements", "$zero_bikes_measurements"]},
100]}}}
my_query.append(command_5)
I've taken a look at this and I'm going to offer some sincere advice:
Don't try and do this in an aggregate query. Just go back to basics and pull the numbers out using find()s and then calculate the numbers in python.
If you want to persist with an aggregate query, I will say that your match command filters on available_bikes equal to zero. You never have the total number of measurements, so you can never find the percentage. Also when you have done your first $group, your "lose" your projection so at that point in the pipeline you only have total_measurements and that's it (comment out the commands 3 to 5 to see what I mean).

mongo: update $push failed with "Resulting document after update is larger than 16777216"

I want to extend an large array using the update(.. $push ..) operation.
Here are the details:
I have a large collection 'A' with many fields. Amongst the fields, I want to extract the values of the 'F' field, and transfer them into one large array stored inside one single field of a document in collection 'B'.
I split the process into steps (to limit the memory used)
Here is the python program:
...
steps = 1000 # number of steps
step = 10000 # each step will handle this number of documents
start = 0
for j in range(steps):
print('step:', j, 'start:', start)
project = {'$project': {'_id':0, 'F':1} }
skip = {'$skip': start}
limit = {'$limit': step}
cursor = A.aggregate( [ skip, limit, project ], allowDiskUse=True )
a = []
for i, o in enumerate(cursor):
value = o['F']
a.append(value)
print('len:', len(a))
B.update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
start += step
Here is the oupput of this program:
step: 0 start: 0
step: 1 start: 100000
step: 2 start: 200000
step: 3 start: 300000
step: 4 start: 400000
step: 5 start: 500000
step: 6 start: 600000
step: 7 start: 700000
step: 8 start: 800000
step: 9 start: 900000
step: 10 start: 1000000
Traceback (most recent call last):
File "u_psfFlux.py", line 109, in <module>
lsst[k].update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 2503, in update
collation=collation)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 754, in _update
_check_write_command_response([(0, result)])
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/helpers.py", line 315, in _check_write_command_response
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Resulting document after update is larger than 16777216
Apparently the $push operation has to fetch the complete array !!! (my expectation was that this operation would always need the same amount of memory since we always append the same amount of values to the array)
In short, I don't understand why the update/$push operation fails with error...
Or... is there a way to avoid this unneeded buffering ?
Thanks for your suggestion
Christian

Determining whitespace in Go

From the documentation of Go's unicode package:
func IsSpace
func IsSpace(r rune) bool
IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
Other definitions of spacing characters are set by category Z and property Pattern_White_Space.
My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space? Does this mean that calling unicode.IsSpace(), checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space will all yield different results? If so, what are the differences? And why are there differences?
The IsSpace function will first check if your rune is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.
If not, isExcludingLatin (http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:
170 func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
171 r16 := rangeTab.R16
172 if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
173 return is16(r16[off:], uint16(r))
174 }
175 r32 := rangeTab.R32
176 if len(r32) > 0 && r >= rune(r32[0].Lo) {
177 return is32(r32, uint32(r))
178 }
179 return false
180 }
The *RangeTable being passed in is White_Space which looks is defined here:
http://golang.org/src/unicode/tables.go?h=White_Space#L6069
6069 var _White_Space = &RangeTable{
6070 R16: []Range16{
6071 {0x0009, 0x000d, 1},
6072 {0x0020, 0x0020, 1},
6073 {0x0085, 0x0085, 1},
6074 {0x00a0, 0x00a0, 1},
6075 {0x1680, 0x1680, 1},
6076 {0x2000, 0x200a, 1},
6077 {0x2028, 0x2029, 1},
6078 {0x202f, 0x202f, 1},
6079 {0x205f, 0x205f, 1},
6080 {0x3000, 0x3000, 1},
6081 },
6082 LatinOffset: 4,
6083 }
To answer your main question, the IsSpace check is not limited to Latin-1.
EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16 values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin will call is16 with that range table sub-section (R16) and determine if the rune provided falls in any of the ranges after the index of LatinOffset (which is 4 in this case).
So, that is checking these ranges:
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
There are unicode code points for:
http://www.fileformat.info/info/unicode/char/1680/index.htm
http://www.fileformat.info/info/unicode/char/2000/index.htm
http://www.fileformat.info/info/unicode/char/2001/index.htm
http://www.fileformat.info/info/unicode/char/2002/index.htm
http://www.fileformat.info/info/unicode/char/2003/index.htm
http://www.fileformat.info/info/unicode/char/2004/index.htm
http://www.fileformat.info/info/unicode/char/2005/index.htm
http://www.fileformat.info/info/unicode/char/2006/index.htm
http://www.fileformat.info/info/unicode/char/2007/index.htm
http://www.fileformat.info/info/unicode/char/2008/index.htm
http://www.fileformat.info/info/unicode/char/2009/index.htm
http://www.fileformat.info/info/unicode/char/200a/index.htm
http://www.fileformat.info/info/unicode/char/2028/index.htm
http://www.fileformat.info/info/unicode/char/2029/index.htm
http://www.fileformat.info/info/unicode/char/202f/index.htm
http://www.fileformat.info/info/unicode/char/205f/index.htm
http://www.fileformat.info/info/unicode/char/3000/index.htm
All of the above are considers "white space"

How can I see all characters in a unicode category?

I've read the documentation and can't find any examples.
http://golang.org/pkg/unicode/#IsPunct
Is there a place in the documentation that explicitly lists all characters in these categories? I'd like to see what characters are contained in category P or category M.
It's not in the documentation, but you can still read the source code. The categories you're talking about are defined in this file: http://golang.org/src/pkg/unicode/tables.go
For example, the P category is defined this way:
2029 var _P = &RangeTable{
2030 R16: []Range16{
2031 {0x0021, 0x0023, 1},
2032 {0x0025, 0x002a, 1},
2033 {0x002c, 0x002f, 1},
2034 {0x003a, 0x003b, 1},
2035 {0x003f, 0x0040, 1},
2036 {0x005b, 0x005d, 1},
2037 {0x005f, 0x007b, 28},
...
2141 {0xff5d, 0xff5f, 2},
2142 {0xff60, 0xff65, 1},
2143 },
2144 R32: []Range32{
2145 {0x10100, 0x10102, 1},
2146 {0x1039f, 0x103d0, 49},
2147 {0x10857, 0x1091f, 200},
...
2157 {0x12470, 0x12473, 1},
2158 },
2159 LatinOffset: 11,
2160 }
And here is a simple way to print all of them:
var p = unicode.Punct.R16
for _, r := range p {
for c := r.Lo; c <= r.Hi; c += r.Stride {
fmt.Print(string(c))
}
}
There are a number of web sites that present an interface to the Unicode character database. For example see the “Punctuation, ...” categories at http://www.fileformat.info/info/unicode/category/.

How to determine the index size with pymongo?

With monogdb, I can run db.collection.stats() to find the size, in bytes, for each index.
PyMongo seems to be missing this operation.
Is there a way to find this information with PyMongo?
import pymongo
connect = pymongo.Connection('mongodb://localhost', safe=True)
db = connect.test
db.command('collStats', 'collection')
Result:
{
u'count': 2,
u'ns': u'test.test2',
u'ok': 1.0,
u'lastExtentSize': 8192,
u'avgObjSize': 94.0,
u'totalIndexSize': 8176,
u'systemFlags': 1,
u'userFlags': 0,
u'numExtents': 1,
u'nindexes': 1,
u'storageSize': 8192,
u'indexSizes': {u'_id_': 8176},
u'paddingFactor': 1.0,
u'size': 188
}
P.S. test2 in the result is my collection name
http://api.mongodb.org/python/current/api/pymongo/database.html
http://docs.mongodb.org/manual/reference/collection-statistics/