How can I find which value caused a bson.errors.InvalidStringData - mongodb

I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)

PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.

(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))

Related

Is empty string value generally allowed by the FIX protocol?

When I look at the definition of a String type in the FIX protocol (e.g. here or here), I don't see a minimum length specified. Is it allowed to use empty strings? One online decoder seems to accept an empty string value (see tag 320), an other complains that it's invalid.
The FIX 4.4 specification states the following (emphasis in the original text):
Each message is constructed of a stream of <tag>=<value> fields with a
field delimiter between fields in the stream. Tags are of data type
TagNum. All tags must have a value specified. Optional fields without
values should simply not be specified in the FIX message. A Reject
message is the appropriate response to a tag with no value.
That strongly suggests (but does not unambiguously state) to me that the use of an empty value for a string is invalid. It is unsurprising to me that different FIX implementations might treat this edge case in different ways. So, I think the best approach is to avoid using empty values for strings.
+1 for Ciaran's and Grant's answer/comments. Just want to add something.
I generally suggest to look up things like this in the most current specification since they usually have been refined/reworded/clarified to eliminate unclear or ambiguous statements from older specs.
The answer is on the very page you link to in your question (emphasis mine, search for "Well-formed field"): https://www.fixtrading.org/standards/tagvalue-online/#field-syntax
A well-formed field has the form:
tag=value<SOH>
A field shall be considered malformed if any of the following occurs as a result of encoding:
the tag is empty
the tag delimiter is missing
the value is empty
the value contains an <SOH> character and the datatype of the field is not data or XMLdata
the datatype of the field is data and the field is not immediately preceded by its associated Length field.

How can I save a string array to PlayerPrefs in Unity?

I have an array and I would like to save it to PlayerPrefs. I heard, I can do this:
PlayerPrefs.SetStringArray('title', anArray);
but for some reason it does not work.
Maybe I'm not using some library like using UnityEngine.PlayerPrefs;?
Can someone help me?
Thanks in advance
You can't. PlayerPrefs doesn't support arrays.
But you could use a special separator and do e.g.
PlayerPrefs.SetString("title", string.Join("###", anArray));
and then for reading use
var anArray = PlayerPrefs.SetString("title").Split(new []{"###"}, StringSplitOptions.None);
Or if you know the content and in particular which character is never used you could also use a single char e.g.
PlayerPrefs.SetString("title", string.Join("/n", anArray));
and then for reading use
var anArray = PlayerPrefs.SetString("title").Split('/n');
Yes as TEEBQNE mentioned there is PlayerPrefsX.cs which might be the source of the confusion.
I would NOT recommend it though! It simply converts all the different input types into byte[] and from there to Base64 strings.
That might be cool and all for int[], bool[], etc. But for string[] this is absolutely inefficient since the Base64 bytes representation of a string is way longer than the string itself!
It might be a valid alternative though if you can not rely on your strings contents and you can not be sure that your separator sequence is never actually a content of any string.

Extracting ByteBuffer from GenericRecord sometimes has extra values

I serialise objects to avro format in spark. These objects include byte arrays (edit: polylines, which are represented as strings). When I inspect the file, the data is correct.
$ java -jar ~/data/avro-tools-1.8.1.jar tojson part-00000.avro | grep 123
{"key":123, "data":{"bytes":"gt_upA`mjrcE{Cw^uBwY"}}
# ^ this example has been simplified for this question
gt_upA`mjrcE{Cw^uBwY is the correct string representation of the byte array.
I then try to deserialise these files in my plain Scala app. Most values are parsed correctly, but sometimes there are extra bytes in the parsed arrays.
val entity: GenericRecord
val byteBuffer = entity.get("data").asInstanceOf[ByteBuffer]
println(new String(byteBuffer.array, "UTF-8"))
Results in gt_upA`mjrcE{Cw^uBwYB. Note the extra trailing B.
I am parsing the files in parallel, and I guess that the ByteBuffer instance is not thread safe and backing arrays are being overwritten.
How should I be parsing these files?
edit: While the question stands, I have since encoded the values as UTF-8 strings directly. It adds additional work when parsing, but avoids the problems with ByteBuffer's inability to be read concurrently.
you can't print arbitrary binary data as UTF-8. Some byte combinations are invalid or ambiguous, and converting them to characters isn't well defined, and depends on the library you are using (and also on your terminal setting).
Just print them as hexademicals instead:
byteBuffer.array.foreach { b => print("%02X".format(b)) }
println

Traversing BSON binary representation in python?

Rather than deserializing a whole BSON document to a python dict, I would like to traverse it directly, taking advantage of the native traversability of the BSON format[1,2]
Is that possible with any of the python BSON libraries available? I can readily see the methods for getting a dict out, but methods for traversing the binary format don't seem to be apparent.
https://groups.google.com/forum/#!topic/bson/e7aBbwA6bAE
http://bsonspec.org/
This sounds like what you are looking for: https://github.com/bauman/python-bson-streaming
It allows to stream the bson, rather than loading the whole file in memory.
From the documentation:
from bsonstream import KeyValueBSONInput
from sys import argv
for file in argv[1:]:
f = open(file, 'rb')
stream = KeyValueBSONInput(fh=f, fast_string_prematch="somthing") #remove fast string match if not needed
for id, dict_data in stream:
if id:
...process dict_data...
The problem you have is that to convert the BSON string into a iterator, which is in itself an object you must actually convert into a language struct, i.e. a dictionary.
Even with a BSON library it would still have to convert it into a traversable object that python understands, a.k.a a dict.
However to answer your question: I know of none.

Getting UTF-8 Request Parameter Strings in mod_perl2

I'm using mod_perl2 for a website and use CGI::Apache2::Wrapper to get the request parameters for the page (e.g. post data). I've noticed that the string the $req->param("parameter") function returns is not UTF-8. If I use the string as-is I can end up with garbled results, so I need to decode it using Encode::decode_utf8(). Is there anyway to either get the parameters already decoded into UTF-8 strings or loop through the parameters and safely decode them?
To get the parameters already decoded, we would need to override the behaviour of the underlying class Apache2::Request from libapreq2, thus losing its XS speed advantage. But that is not even straightforward possible, as unfortunately we are sabotaged by the CGI::Apache2::Wrapper constructor:
unless (defined $r and ref($r) and ref($r) eq 'Apache2::RequestRec') {
This is wrong OO programming, it should say
… $r->isa('Apache2::RequestRec')
or perhaps forego class names altogether and just test for behaviour (… $r->can('param')).
I say, with those obstacles, it's not worth it. I recommend to keep your existing solution that decodes parameters explicitly. It's clear enough.
To loop over the request parameters, simply do not pass an argument to the param method and you get a list of the names. This is documented (1, 2), please read more carefully.