Why inputStreamReader and OutputStreamWriter classes hasbeen designed in java? - inputstreamreader

i studied inputStreamReader used to convert bytes to Character and OutputStreamWriter for Charcter to Byte. But what does it actually mean here by this i.e Charcter to Byte conversion and byte to charcter conversion. What actually happens internally .Everything is stored as bytes in our system . can't we use simply byte stream classes for reading and writing data of any type. Please help me to make this concept clear.

Related

Difference between Codec and Parser in LibAV

I am following this link to figure out decoding using libAV library. In the decode function, it declares a codec and a parser.
codec = avcodec_find_decoder(AV_CODEC_ID_H264);
parser = av_parser_init(AV_CODEC_ID_H264);
What is the difference between the two >
The parser takes a stream of bytes and turns it into a representation in memory, but does not convert the bytes to pixeles. The parser can read things like resolution, encoding parameters, where frames begin and end, etc.

Extracting ByteBuffer from GenericRecord sometimes has extra values

I serialise objects to avro format in spark. These objects include byte arrays (edit: polylines, which are represented as strings). When I inspect the file, the data is correct.
$ java -jar ~/data/avro-tools-1.8.1.jar tojson part-00000.avro | grep 123
{"key":123, "data":{"bytes":"gt_upA`mjrcE{Cw^uBwY"}}
# ^ this example has been simplified for this question
gt_upA`mjrcE{Cw^uBwY is the correct string representation of the byte array.
I then try to deserialise these files in my plain Scala app. Most values are parsed correctly, but sometimes there are extra bytes in the parsed arrays.
val entity: GenericRecord
val byteBuffer = entity.get("data").asInstanceOf[ByteBuffer]
println(new String(byteBuffer.array, "UTF-8"))
Results in gt_upA`mjrcE{Cw^uBwYB. Note the extra trailing B.
I am parsing the files in parallel, and I guess that the ByteBuffer instance is not thread safe and backing arrays are being overwritten.
How should I be parsing these files?
edit: While the question stands, I have since encoded the values as UTF-8 strings directly. It adds additional work when parsing, but avoids the problems with ByteBuffer's inability to be read concurrently.
you can't print arbitrary binary data as UTF-8. Some byte combinations are invalid or ambiguous, and converting them to characters isn't well defined, and depends on the library you are using (and also on your terminal setting).
Just print them as hexademicals instead:
byteBuffer.array.foreach { b => print("%02X".format(b)) }
println

How can I find which value caused a bson.errors.InvalidStringData

I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)
PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.
(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))

NSCoding and integer arrays

How do you use NSCoding to code (and decode) an array of of ten values of primitive type int? Encode each integer individually (in a for-loop). But what if my array held one million integers? Is there a more satisfying alternative to using a for-loop here?
Edit (after first answer): And decode? (#Justin: I'll then tick your answer.)
If performance is your concern here: CFData/NSData is NSCoding compliant, so just wrap your serialized representation of the array as NSCFData.
edit to detail encoding/decoding:
your array of ints will need to to be converted to a common endian format (depending on the machine's endianness) - e.g. always store it as little or big endian. during encoding, convert it to an array of integers in the specified endianness, which is passed to the NSData object. then pass the NSData representation to the NSCoder instance. at decode, you'll receive an NSData object for the key, you conditionally convert it to the native endianness of the machine when decoding it. one set of byte swapping routines available for OS X and iOS begin with OSSwap*.
alternatively, see -[NSCoder encodeBytes:voidPtr length:numBytes forKey:key]. this routine also requires the client to swap endianness.

Reading byte stream returned from JavaEE server

We have a JavaEE server and servlets providing data to mobile clients (first JavaME, now soon iPhone). The servlet writes out data using the following code:
DataOutputStream dos = new DataOutputStream(out);
dos.writeInt(someInt);
dos.writeUTF(someString);
... and so on
This data is returned to the client as bytes in the HTTP response body, to reduce the number of bytes transferred.
In the iPhone app, the response payload is loaded into NSData object. Now, after spending hours and hours trying to figure out how to read the data out in the Objective-C application, I'm almost ready to give up, as I haven't found any good way to read the data into NSInteger and NSString (as corresponding to above protocol)
Would anyone have any pointers how to read stuff out from a binary protocol written by a java app? Any help is greatly appreciated!
Thanks!
You'll have to do the demarshalling yourself; fortunately, it's fairly straightforward. Java's DataOutputStream class writes integers in big-endian (network) format. So, to demarshall the integer, we grab 4 bytes and unpack them into a 4-byte integer.
For UTF-8 strings, DataOutputStream first writes a 2-byte value indicating the number of bytes that follow. We read that in, and then read the subsequent bytes. Then, to decode the string, we can use the NSString method initWithBytes:length:encoding: as so:
NSData *data = ...; // this comes from the HTTP request
int length = [data length];
const uint8_t *bytes = (const uint8_t *)[data bytes];
if(length < 4)
; // oops, handle error
// demarshall the big-endian integer from 4 bytes
uint32_t myInt = (bytes[0] << 24) | (bytes[1] << 16) | (bytes[2] << 8) | (bytes[3]);
// convert from (n)etwork endianness to (h)ost endianness (may be a no-op)
// ntohl is defined in <arpa/inet.h>
myInt = ntohl(myInt);
// advance to next datum
bytes += 4;
length -= 4;
// demarshall the string length
if(length < 2)
; // oops, handle error
uint16_t myStringLen = (bytes[0] << 8) | (bytes[1]);
// convert from network to host endianness
myStringLen = ntohs(myStringLen);
bytes += 2;
length -= 2;
// make sure we actually have as much data as we say we have
if(myStringLen > length)
myStringLen = (uint16_t)length;
// demarshall the string
NSString *myString = [[NSString alloc] initWithBytes:bytes length:myStringLen encoding:NSUTF8StringEncoding];
bytes += myStringLen;
length -= myStringLen;
You can (and probably should) write functions to demarshall, so that you don't have to repeat this code for every field you want to demarshall. Also, be extra careful about buffer overflows. You're handling data sent over the network, which you should always distrust. Always verify your data, and always check your buffer lengths.
The main thing is to understand the binary data format itself. It doesn't matter what's written it, so long as you know what the bytes mean.
As such, the docs for DataOutputStream are your best bet. They specify everything (hopefully) about what the binary data will look like.
Next, I would try to basically come up with a class on the iPhone which will read the same format into appropriate data structure. I don't know Objective C at all, but I'm sure that it can't be too hard to read 4 bytes, know that the first byte is the most significant (etc) and do appropriate bit-twiddling to get the right kind of integer. (Basically read a byte, shift it left 8, read the next byte and add it into the result, shift the whole lot left 8 bits, etc.) There may well be more efficient ways of doing it, but get something that works first. When you've got unit tests around it all, you can move onto optimising it.
Don't forget that Objective-C is just C in a pretty dress--and C excels at this kind of bit-grovelling. To a large extent, you should be able to just define a C struct that looks like your data and cast the pointer to your data into a pointer to that struct. Now, exactly which types to use, and if you need to byte-swap anything, will depend on how Java constructs this stream; that's what you'll need to spend time with Java's documentation for.
Fundamentally, though, this is a design smell. You're having this problem because you made assumptions about your client platform that are no longer valid. If it's an option, I'd recommend you offer a second, more portable interface to the same functions (just adding "WithXML" wrappers or something should suffice). This will save you time if you ever end up porting to another platform that doesn't use Java.
Carl, if you really can't change what the server provides have a look at this class. It should be the pointer you are looking for. That said: The idea to use native java serialization as a transport format does not sound like a good idea. My first choice would have been JSON. If that's still too big I would probably rather use something like Thrift or Protobuffers. They also provide binary serialization - but in a cross language manner. (There is also the oldie ASN1 - but that's painful)