Extracting ByteBuffer from GenericRecord sometimes has extra values - scala

I serialise objects to avro format in spark. These objects include byte arrays (edit: polylines, which are represented as strings). When I inspect the file, the data is correct.
$ java -jar ~/data/avro-tools-1.8.1.jar tojson part-00000.avro | grep 123
{"key":123, "data":{"bytes":"gt_upA`mjrcE{Cw^uBwY"}}
# ^ this example has been simplified for this question
gt_upA`mjrcE{Cw^uBwY is the correct string representation of the byte array.
I then try to deserialise these files in my plain Scala app. Most values are parsed correctly, but sometimes there are extra bytes in the parsed arrays.
val entity: GenericRecord
val byteBuffer = entity.get("data").asInstanceOf[ByteBuffer]
println(new String(byteBuffer.array, "UTF-8"))
Results in gt_upA`mjrcE{Cw^uBwYB. Note the extra trailing B.
I am parsing the files in parallel, and I guess that the ByteBuffer instance is not thread safe and backing arrays are being overwritten.
How should I be parsing these files?
edit: While the question stands, I have since encoded the values as UTF-8 strings directly. It adds additional work when parsing, but avoids the problems with ByteBuffer's inability to be read concurrently.

you can't print arbitrary binary data as UTF-8. Some byte combinations are invalid or ambiguous, and converting them to characters isn't well defined, and depends on the library you are using (and also on your terminal setting).
Just print them as hexademicals instead:
byteBuffer.array.foreach { b => print("%02X".format(b)) }
println

Related

Why inputStreamReader and OutputStreamWriter classes hasbeen designed in java?

i studied inputStreamReader used to convert bytes to Character and OutputStreamWriter for Charcter to Byte. But what does it actually mean here by this i.e Charcter to Byte conversion and byte to charcter conversion. What actually happens internally .Everything is stored as bytes in our system . can't we use simply byte stream classes for reading and writing data of any type. Please help me to make this concept clear.

Traversing BSON binary representation in python?

Rather than deserializing a whole BSON document to a python dict, I would like to traverse it directly, taking advantage of the native traversability of the BSON format[1,2]
Is that possible with any of the python BSON libraries available? I can readily see the methods for getting a dict out, but methods for traversing the binary format don't seem to be apparent.
https://groups.google.com/forum/#!topic/bson/e7aBbwA6bAE
http://bsonspec.org/
This sounds like what you are looking for: https://github.com/bauman/python-bson-streaming
It allows to stream the bson, rather than loading the whole file in memory.
From the documentation:
from bsonstream import KeyValueBSONInput
from sys import argv
for file in argv[1:]:
f = open(file, 'rb')
stream = KeyValueBSONInput(fh=f, fast_string_prematch="somthing") #remove fast string match if not needed
for id, dict_data in stream:
if id:
...process dict_data...
The problem you have is that to convert the BSON string into a iterator, which is in itself an object you must actually convert into a language struct, i.e. a dictionary.
Even with a BSON library it would still have to convert it into a traversable object that python understands, a.k.a a dict.
However to answer your question: I know of none.

How can I find which value caused a bson.errors.InvalidStringData

I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)
PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.
(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))

How to create an Array from Iterable in Scala 2.7.7?

I'm using Scala 2.7.7
I'm experiencing difficulties with access to the documentation, so code snippets would be greate.
Context
I parse an IP address of 4 or 16 bytes in length. I need an array of bytes, to pass into java.net.InetAddress. The result of String.split(separator).map(_.toByte) returns me an instance of Iterable.
I see two ways to solve the problem
use an array of 16 bytes length, fil it from Iterable and return just a part of it, if not all fields are used (Is there a function to fill an array in 2.7.7? How to get the part?).
use a dynamic length container and form an array form it (Which container is suitable?).
Current implementation is published in my other question about memory leaks.
In Scala 2.7, Iterable has a method called copyToArray.
I'd strongly advise you not to use an Array here, unless you have to use a particular library/framework then requires an array.
Normally, you'd be better off with a native Scala type:
String.split(separator).map(_.toByte).toList
//or
String.split(separator).map(_.toByte).toSeq
Update
Assuming that your original string is a delimited list of hostnames, why not just:
val namesStr = "www.sun.com;www.stackoverflow.com;www.scala-tools.com"
val separator = ";"
val addresses = namesStr.split(separator).map(InetAddress.getByName)
That'll give you an iterable of InetAddress instances.

Parsing of binary data with scala

I need to parse some simple binary Files. (The files contains n entries which consists of several signed/unsigned Integers of different sizes etc.)
In the moment i do the parsing "by hand". Does somebody know a library which helps to do this type of parsing?
Edit: "By hand" means that i get the Data Byte by Byte sort it in to the correct Order and convert it to an Int/Byte etc. Also some of the Data is unsigned.
I've used the sbinary library before and it's very nice. The documentation is a little sparse but I would suggest first looking at the old wiki page as that gives you a starting point. Then check out the test specifications, as that gives you some very nice examples.
The primary benefit of sbinary is that it gives you a way to describe the wire format of each object as a Format object. You can then encapsulate those formatted types in a higher level Format object and Scala does all the heavy lifting of looking up that type as long as you've included it in the current scope as an implicit object.
As I say below, I'd now recommend people use scodec instead of sbinary. As an example of how to use scodec, I'll implement how to read a binary representation in memory of the following C struct:
struct ST
{
long long ll; // # 0
int i; // # 8
short s; // # 12
char ch1; // # 14
char ch2; // # 15
} ST;
A matching Scala case class would be:
case class ST(ll: Long, i: Int, s: Short, ch1: String, ch2: String)
I'm making things a bit easier for myself by just saying we're storing Strings instead of Chars and I'll say that they are UTF-8 characters in the struct. I'm also not dealing with endian details or the actual size of the long and int types on this architecture and just assuming that they are 64 and 32 respectively.
Scodec parsers generally use combinators to build higher level parsers from lower level ones. So for below, we'll define a parser which combines a 8 byte value, a 4 byte value, a 2 byte value, a 1 byte value and one more 1 byte value. The return of this combination is a Tuple codec:
val myCodec: Codec[Long ~ Int ~ Short ~ String ~ String] =
int64 ~ int32 ~ short16 ~ fixedSizeBits(8L, utf8) ~ fixedSizeBits(8L, utf8)
We can then transform this into the ST case class by calling the xmap function on it which takes two functions, one to turn the Tuple codec into the destination type and another function to take the destination type and turn it into the Tuple form:
val stCodec: Codec[ST] = myCodec.xmap[ST]({case ll ~ i ~ s ~ ch1 ~ ch2 => ST(ll, i, s, ch1, ch2)}, st => st.ll ~ st.i ~ st.s ~ st.ch1 ~ st.ch2)
Now, you can use the codec like so:
stCodec.encode(ST(1L, 2, 3.shortValue, "H", "I"))
res0: scodec.Attempt[scodec.bits.BitVector] = Successful(BitVector(128 bits, 0x00000000000000010000000200034849))
res0.flatMap(stCodec.decode)
=> res1: scodec.Attempt[scodec.DecodeResult[ST]] = Successful(DecodeResult(ST(1,2,3,H,I),BitVector(empty)))
I'd encourage you to look at the Scaladocs and not at the Guide as there's much more detail in the Scaladocs. The guide is a good start at the very basics but it doesn't get into the composition part much but the Scaladocs cover that pretty well.
Scala itself doesn't have a binary data input library, but the java.nio package does a decent job. It doesn't explicitly handle unsigned data--neither does Java, so you need to figure out how you want to manage it--but it does have convenience "get" methods that take byte order into account.
I don't know what you mean with "by hand" but using a simple DataInputStream (apidoc here) is quite concise and clear:
val dis = new DataInputStream(yourSource)
dis.readFloat()
dis.readDouble()
dis.readInt()
// and so on
Taken from another SO question: http://preon.sourceforge.net/, it should be a framework to do binary encoding/decoding.. see if it has the capabilities you need
If you are looking for a Java based solution, then I will shamelessly plug Preon. You just annotate the in memory Java data structure, and ask Preon for a Codec, and you're done.
Byteme is a parser combinators library for doing binary. You can try to use it for your tasks.