Traversing BSON binary representation in python? - mongodb

Rather than deserializing a whole BSON document to a python dict, I would like to traverse it directly, taking advantage of the native traversability of the BSON format[1,2]
Is that possible with any of the python BSON libraries available? I can readily see the methods for getting a dict out, but methods for traversing the binary format don't seem to be apparent.
https://groups.google.com/forum/#!topic/bson/e7aBbwA6bAE
http://bsonspec.org/

This sounds like what you are looking for: https://github.com/bauman/python-bson-streaming
It allows to stream the bson, rather than loading the whole file in memory.
From the documentation:
from bsonstream import KeyValueBSONInput
from sys import argv
for file in argv[1:]:
f = open(file, 'rb')
stream = KeyValueBSONInput(fh=f, fast_string_prematch="somthing") #remove fast string match if not needed
for id, dict_data in stream:
if id:
...process dict_data...

The problem you have is that to convert the BSON string into a iterator, which is in itself an object you must actually convert into a language struct, i.e. a dictionary.
Even with a BSON library it would still have to convert it into a traversable object that python understands, a.k.a a dict.
However to answer your question: I know of none.

Related

Non-relational database in statically typed languages (rethinkdb, Scala)

I'm still pretty new to Scala and arrived at some kind of typing-roadblock.
Non-SQL databases such as mongo and rethinkdb do not enforce any scheme for their tables and manage data in json format. I've been struggling to get the java API for rethinkdb to work on Scala and there seems to be surprisingly low information on how to actually use the results returned from the database.
Assuming a simple document schema such as this:
{
"name": "melvin",
"age": 42,
"tags": ["solution"]
}
I fail to get how to actually this data in Scala. After running a query, for example, by running something like r.table("test").run(connection), I receive an object from which I can iterate AnyRef objects. In the python word, this most likely would be a simple dict. How do I convey the structure of this data to Scala, so I can use it in code (e.g., query fields of the returned documents)?
From a quick scan of the docs and code, the Java Rethink client uses Jackson to handle deserialization of the JSON received from the DB into JVM objects. Since by definition every JSON object received is going to be deserializable into a JSON AST (Abstract Syntax Tree: a representation in plain Scala objects of the structure of a JSON document), you could implement a custom Jackson ObjectMapper which, instead of doing the usual Jackson magic with reflection, always deserializes into the JSON AST.
For example, Play JSON defers the actual serialization/deserialization to/from JSON to Jackson: it installs a module into a vanilla ObjectMapper which specially takes care of instances of JsValue, which is the root type of Play JSON's AST. Then something like this should work:
import com.fasterxml.jackson.databind.ObjectMapper
import play.api.libs.json.jackson.PlayJsonModule
// Use Play JSON's ObjectMapper... best to do this before connecting
RethinkDB.setResultMapper(new ObjectMapper().registerModule(new PlayJsonModule(JsonParserSettings())))
run(connection) returns a Result[AnyRef] in Scala notation. There's an alternative version, run(connection, typeRef), where the second argument specifies a result type; this is passed to the ObjectMapper to ensure that every document will either fail to deserialize or be an instance of that result type:
import play.api.libs.json.JsValue
val result = r.table("table").run(connection, classOf[JsValue]) : Result[JsValue]
You can then get the next element from the result as a JsValue and use the usual Play JSON machinery to convert the JsValue into your domain type:
import play.api.libs.json.Json
case class MyDocument(name: String, age: Int, tags: Seq[String])
object MyDocument {
implicit val jsonFormat = Json.format[MyDocument]
}
// result is a Result[JsValue] ... may need an import MyDocument.jsonFormat or similar
val myDoc = Json.fromJson[MyDocument](result.next()).asOpt[MyDocument] : Option[MyDocument]
There's some ability with enrichments to improve the Scala API to make a lot of this machinery more transparent.
You could do similar things with the other Scala JSON ASTs (e.g. Circe, json4s), but might have to implement functionality similar to what Play does with the ObjectMapper yourself.

Extracting ByteBuffer from GenericRecord sometimes has extra values

I serialise objects to avro format in spark. These objects include byte arrays (edit: polylines, which are represented as strings). When I inspect the file, the data is correct.
$ java -jar ~/data/avro-tools-1.8.1.jar tojson part-00000.avro | grep 123
{"key":123, "data":{"bytes":"gt_upA`mjrcE{Cw^uBwY"}}
# ^ this example has been simplified for this question
gt_upA`mjrcE{Cw^uBwY is the correct string representation of the byte array.
I then try to deserialise these files in my plain Scala app. Most values are parsed correctly, but sometimes there are extra bytes in the parsed arrays.
val entity: GenericRecord
val byteBuffer = entity.get("data").asInstanceOf[ByteBuffer]
println(new String(byteBuffer.array, "UTF-8"))
Results in gt_upA`mjrcE{Cw^uBwYB. Note the extra trailing B.
I am parsing the files in parallel, and I guess that the ByteBuffer instance is not thread safe and backing arrays are being overwritten.
How should I be parsing these files?
edit: While the question stands, I have since encoded the values as UTF-8 strings directly. It adds additional work when parsing, but avoids the problems with ByteBuffer's inability to be read concurrently.
you can't print arbitrary binary data as UTF-8. Some byte combinations are invalid or ambiguous, and converting them to characters isn't well defined, and depends on the library you are using (and also on your terminal setting).
Just print them as hexademicals instead:
byteBuffer.array.foreach { b => print("%02X".format(b)) }
println

How can I find which value caused a bson.errors.InvalidStringData

I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)
PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.
(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))

How do you query for, or insert as, specific types in MongoDB?

I'm using the Perl bindings for MongoDB, and it seems that when I insert numbers, they are sometimes counted as strings. When I add 0 to the number, it gets converted.
Is there a way, with the MongoDB Perl bindings, to specify the data type on insertion?
Is there also a way to query for specific types?
An example type query might look like this, if there was a "$type" function:
db.c.find({ key: { '$type': 'NumberLong' } });
MongoDB does contain a $type operator for checking against bson datatype: http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24type
Unfortunately in Perl you are subject to the languages semantics ... There is no Number datatype in Perl just Scalar.
If Perl has decided to store your value as a string that is how it will go to BSON and $type will likely not match it correctly.
However it does look like the Perl driver attempts to.test if Perl can treat a string its saving as a number in which case.it.is.stored such; you may want to test this along with the $test operator.
Use the looks_like_number parameter in your perl MongoDB driver
http://search.cpan.org/~friedo/MongoDB-0.702.2/lib/MongoDB/BSON.pm#looks_like_number
Worked like a charm for me

How to create an Array from Iterable in Scala 2.7.7?

I'm using Scala 2.7.7
I'm experiencing difficulties with access to the documentation, so code snippets would be greate.
Context
I parse an IP address of 4 or 16 bytes in length. I need an array of bytes, to pass into java.net.InetAddress. The result of String.split(separator).map(_.toByte) returns me an instance of Iterable.
I see two ways to solve the problem
use an array of 16 bytes length, fil it from Iterable and return just a part of it, if not all fields are used (Is there a function to fill an array in 2.7.7? How to get the part?).
use a dynamic length container and form an array form it (Which container is suitable?).
Current implementation is published in my other question about memory leaks.
In Scala 2.7, Iterable has a method called copyToArray.
I'd strongly advise you not to use an Array here, unless you have to use a particular library/framework then requires an array.
Normally, you'd be better off with a native Scala type:
String.split(separator).map(_.toByte).toList
//or
String.split(separator).map(_.toByte).toSeq
Update
Assuming that your original string is a delimited list of hostnames, why not just:
val namesStr = "www.sun.com;www.stackoverflow.com;www.scala-tools.com"
val separator = ";"
val addresses = namesStr.split(separator).map(InetAddress.getByName)
That'll give you an iterable of InetAddress instances.