Efficiently pack a list of Longs in Scodec representation - scala

I have a case class with a List[Long] attribute that I am converting into a token using the Scodec library. Right now, it is not efficient (space-wise) because I am using this codec:
listOfN(uint16, int64)
This is using all 64 bits even though my Longs are never more than a few thousand (as of now). Is there a built-in way in Scodec library to use only as many bits as absolutely needed?
Thanks

If your long values are non-negative, try using the vpbcd codec:
listOfN(uint16, vpbcd)
This encodes using variable length packed binary-coded decimal format.

Related

Scodec: Using vectorOfN with a vlong field

I am playing around with the Bitcoin blockchain to learn Scala and some useful libraries.
Currently I am trying to decode and encode Blocks with SCodec and my problem is that the vectorOfN function takes its size as an Int. How can I use a long field for the size while still preserving the full value range.
In other words is there a vectorOfLongN function?
This is my code which would compile fine if I were using vintL instead of vlongL:
object Block {
implicit val codec: Codec[Block] = {
("header" | Codec[BlockHeader]) ::
(("numTx" | vlongL) >>:~
{ numTx => ("transactions" | vectorOfN(provide(numTx), Codec[Transaction]) ).hlist })
}.as[Block]
}
You may assume that appropriate Codecs for the Blockheader and the Transactions are implemented. Actually, vlong is used as a simplification for this question, because Bitcoin uses its own codec for variable sized ints.
I'm not a scodec specialist but my common sense suggests that this is not possible because Scala's Vector being a subtype of GenSeqLike is limited to have length of type Int and apply that accepts Int index as its argument. And AFAIU this limitation comes from the underlying JVM platform where you can't have an array of size more than Integer.MAX_VALUE i.e. around 2^31 (see also "Criticism of Java" wiki). And although Vector theoretically could have work this limitation around, it was not done. So it makes no sense for vectorOfN to support Long size as well.
In other words, if you want something like this, you probably should start from creating your own Vector-like class that does support Long indices working around JVM limitations.
You may want to take a look at scodec-stream, which comes in handy when all of your data is not available immediately or does not fit into memory.
Basically, you would use your usual codecs.X and turn it into a StreamDecoder via scodec.stream.decode.many(normal_codec). This way you can work with the data through scodec without the need to load it into memory entirely.
A StreamDecoder then offers methods like decodeInputStream along scodec's usual decode.
(I used it a while ago in a slightly different context – parsing data sent by a client to a server – but it looks like it would apply here as well).

Serialize [Bit] to NSData in Swift

I'm implementing Haffman coding in Swift and to get benefit from this coding I need to serialize my one-zero sequences as effective as possible. Usual approach in such situations is to use bit arrays. Swift contains Bit type and there is no problem to convert sequences into [Bit], but I haven't found any standard solution to get NSData from this array. The only way I see is to create [Int8] (which could be serialized) and fill all bits manually using bitwise operators, in the worst case I will lose 7 bits in the last element.
So if there is any standard (ready-to-use) solution for [Bit] serialization?
The simple answer is NO.
Bit in Swift is implemented as public enum Bit : Int, Comparable, RandomAccessIndexType, _Reflectable { ... }. I don't see any advantage to use Bit type against the Int, except the well defined level of abstraction. The minimal NSData instance use at least one byte (in reality the amount of memory usage depends on the underlying processor capabilities. Serialization is also just an abstraction, you can serialize your [Bit] as sequence of words "Bit with binary value One", "Bit with binary value Zero", .... To save your bit sequence to NSData and to be able to reconstruct it (unserialize it), you still need to make some kind of 'binary protocol'. At least you need to save the numbers of bits as a part of your data. If you use Huffman coding, you need to save your symbol table too ...

Diference between Numberlong(x) and Numberlong("x")

I stored some XML data in MongoDB databse. Some of the data is stored as Numberlong(1234) and some other as Numberlong("12353"). Example:
"nodes_id" : [
NumberLong(13879272),
NumberLong(252524625),
NumberLong(252524611),
NumberLong(252524630),
NumberLong(149809404),
NumberLong(605181143),
NumberLong("3068489546"),
NumberLong("3059300418"),
NumberLong(253351454),
NumberLong(253351438),
NumberLong(253348623),
NumberLong(253351472)
]
Is there any difference between the two types?
NumberLong(253351454) only works for numbers that are small enough that they don't need to be ... well, long: The shell must represent them in JS somehow, so it can only represent numbers that
For larger numbers, a textual representation is required because there's no large enough data type available, hence NumberLong("3059300418") with 3059300418 > 253351454.
In other words, no, there's no difference. It's just a limitation of the shell, or more generally speaking, of JS and floating point numbers.
Caveat: Don't try to invoke the constructor with a too large number, i.e. don't try db.foo.insert({"t" : NumberLong(1234657890132456789)}); Since that number is way too large for a double, it will cause roundoff errors. Above number would be converted to NumberLong("1234657890132456704"), which is wrong, obviously.

NSCoding and integer arrays

How do you use NSCoding to code (and decode) an array of of ten values of primitive type int? Encode each integer individually (in a for-loop). But what if my array held one million integers? Is there a more satisfying alternative to using a for-loop here?
Edit (after first answer): And decode? (#Justin: I'll then tick your answer.)
If performance is your concern here: CFData/NSData is NSCoding compliant, so just wrap your serialized representation of the array as NSCFData.
edit to detail encoding/decoding:
your array of ints will need to to be converted to a common endian format (depending on the machine's endianness) - e.g. always store it as little or big endian. during encoding, convert it to an array of integers in the specified endianness, which is passed to the NSData object. then pass the NSData representation to the NSCoder instance. at decode, you'll receive an NSData object for the key, you conditionally convert it to the native endianness of the machine when decoding it. one set of byte swapping routines available for OS X and iOS begin with OSSwap*.
alternatively, see -[NSCoder encodeBytes:voidPtr length:numBytes forKey:key]. this routine also requires the client to swap endianness.

Parsing of binary data with scala

I need to parse some simple binary Files. (The files contains n entries which consists of several signed/unsigned Integers of different sizes etc.)
In the moment i do the parsing "by hand". Does somebody know a library which helps to do this type of parsing?
Edit: "By hand" means that i get the Data Byte by Byte sort it in to the correct Order and convert it to an Int/Byte etc. Also some of the Data is unsigned.
I've used the sbinary library before and it's very nice. The documentation is a little sparse but I would suggest first looking at the old wiki page as that gives you a starting point. Then check out the test specifications, as that gives you some very nice examples.
The primary benefit of sbinary is that it gives you a way to describe the wire format of each object as a Format object. You can then encapsulate those formatted types in a higher level Format object and Scala does all the heavy lifting of looking up that type as long as you've included it in the current scope as an implicit object.
As I say below, I'd now recommend people use scodec instead of sbinary. As an example of how to use scodec, I'll implement how to read a binary representation in memory of the following C struct:
struct ST
{
long long ll; // # 0
int i; // # 8
short s; // # 12
char ch1; // # 14
char ch2; // # 15
} ST;
A matching Scala case class would be:
case class ST(ll: Long, i: Int, s: Short, ch1: String, ch2: String)
I'm making things a bit easier for myself by just saying we're storing Strings instead of Chars and I'll say that they are UTF-8 characters in the struct. I'm also not dealing with endian details or the actual size of the long and int types on this architecture and just assuming that they are 64 and 32 respectively.
Scodec parsers generally use combinators to build higher level parsers from lower level ones. So for below, we'll define a parser which combines a 8 byte value, a 4 byte value, a 2 byte value, a 1 byte value and one more 1 byte value. The return of this combination is a Tuple codec:
val myCodec: Codec[Long ~ Int ~ Short ~ String ~ String] =
int64 ~ int32 ~ short16 ~ fixedSizeBits(8L, utf8) ~ fixedSizeBits(8L, utf8)
We can then transform this into the ST case class by calling the xmap function on it which takes two functions, one to turn the Tuple codec into the destination type and another function to take the destination type and turn it into the Tuple form:
val stCodec: Codec[ST] = myCodec.xmap[ST]({case ll ~ i ~ s ~ ch1 ~ ch2 => ST(ll, i, s, ch1, ch2)}, st => st.ll ~ st.i ~ st.s ~ st.ch1 ~ st.ch2)
Now, you can use the codec like so:
stCodec.encode(ST(1L, 2, 3.shortValue, "H", "I"))
res0: scodec.Attempt[scodec.bits.BitVector] = Successful(BitVector(128 bits, 0x00000000000000010000000200034849))
res0.flatMap(stCodec.decode)
=> res1: scodec.Attempt[scodec.DecodeResult[ST]] = Successful(DecodeResult(ST(1,2,3,H,I),BitVector(empty)))
I'd encourage you to look at the Scaladocs and not at the Guide as there's much more detail in the Scaladocs. The guide is a good start at the very basics but it doesn't get into the composition part much but the Scaladocs cover that pretty well.
Scala itself doesn't have a binary data input library, but the java.nio package does a decent job. It doesn't explicitly handle unsigned data--neither does Java, so you need to figure out how you want to manage it--but it does have convenience "get" methods that take byte order into account.
I don't know what you mean with "by hand" but using a simple DataInputStream (apidoc here) is quite concise and clear:
val dis = new DataInputStream(yourSource)
dis.readFloat()
dis.readDouble()
dis.readInt()
// and so on
Taken from another SO question: http://preon.sourceforge.net/, it should be a framework to do binary encoding/decoding.. see if it has the capabilities you need
If you are looking for a Java based solution, then I will shamelessly plug Preon. You just annotate the in memory Java data structure, and ask Preon for a Codec, and you're done.
Byteme is a parser combinators library for doing binary. You can try to use it for your tasks.