Parsing of binary data with scala

Parsing of binary data with scala - scala

I need to parse some simple binary Files. (The files contains n entries which consists of several signed/unsigned Integers of different sizes etc.)
In the moment i do the parsing "by hand". Does somebody know a library which helps to do this type of parsing?
Edit: "By hand" means that i get the Data Byte by Byte sort it in to the correct Order and convert it to an Int/Byte etc. Also some of the Data is unsigned.

I've used the sbinary library before and it's very nice. The documentation is a little sparse but I would suggest first looking at the old wiki page as that gives you a starting point. Then check out the test specifications, as that gives you some very nice examples.
The primary benefit of sbinary is that it gives you a way to describe the wire format of each object as a Format object. You can then encapsulate those formatted types in a higher level Format object and Scala does all the heavy lifting of looking up that type as long as you've included it in the current scope as an implicit object.
As I say below, I'd now recommend people use scodec instead of sbinary. As an example of how to use scodec, I'll implement how to read a binary representation in memory of the following C struct:
struct ST
{
long long ll; // # 0
int i; // # 8
short s; // # 12
char ch1; // # 14
char ch2; // # 15
} ST;
A matching Scala case class would be:
case class ST(ll: Long, i: Int, s: Short, ch1: String, ch2: String)
I'm making things a bit easier for myself by just saying we're storing Strings instead of Chars and I'll say that they are UTF-8 characters in the struct. I'm also not dealing with endian details or the actual size of the long and int types on this architecture and just assuming that they are 64 and 32 respectively.
Scodec parsers generally use combinators to build higher level parsers from lower level ones. So for below, we'll define a parser which combines a 8 byte value, a 4 byte value, a 2 byte value, a 1 byte value and one more 1 byte value. The return of this combination is a Tuple codec:
val myCodec: Codec[Long ~ Int ~ Short ~ String ~ String] =
int64 ~ int32 ~ short16 ~ fixedSizeBits(8L, utf8) ~ fixedSizeBits(8L, utf8)
We can then transform this into the ST case class by calling the xmap function on it which takes two functions, one to turn the Tuple codec into the destination type and another function to take the destination type and turn it into the Tuple form:
val stCodec: Codec[ST] = myCodec.xmap[ST]({case ll ~ i ~ s ~ ch1 ~ ch2 => ST(ll, i, s, ch1, ch2)}, st => st.ll ~ st.i ~ st.s ~ st.ch1 ~ st.ch2)
Now, you can use the codec like so:
stCodec.encode(ST(1L, 2, 3.shortValue, "H", "I"))
res0: scodec.Attempt[scodec.bits.BitVector] = Successful(BitVector(128 bits, 0x00000000000000010000000200034849))
res0.flatMap(stCodec.decode)
=> res1: scodec.Attempt[scodec.DecodeResult[ST]] = Successful(DecodeResult(ST(1,2,3,H,I),BitVector(empty)))
I'd encourage you to look at the Scaladocs and not at the Guide as there's much more detail in the Scaladocs. The guide is a good start at the very basics but it doesn't get into the composition part much but the Scaladocs cover that pretty well.

Scala itself doesn't have a binary data input library, but the java.nio package does a decent job. It doesn't explicitly handle unsigned data--neither does Java, so you need to figure out how you want to manage it--but it does have convenience "get" methods that take byte order into account.

I don't know what you mean with "by hand" but using a simple DataInputStream (apidoc here) is quite concise and clear:
val dis = new DataInputStream(yourSource)
dis.readFloat()
dis.readDouble()
dis.readInt()
// and so on
Taken from another SO question: http://preon.sourceforge.net/, it should be a framework to do binary encoding/decoding.. see if it has the capabilities you need

If you are looking for a Java based solution, then I will shamelessly plug Preon. You just annotate the in memory Java data structure, and ask Preon for a Codec, and you're done.

Byteme is a parser combinators library for doing binary. You can try to use it for your tasks.

Related

How to understand this line of chisel code

I'm in the process of learning chisel and scala language and try to analyse some lines of rocket-chip code.Could anyone try to explain me this line? https://github.com/chipsalliance/rocket-chip/blob/54237b5602a273378e3b35307bb47eb1e58cb9fb/src/main/scala/rocket/RocketCore.scala#L957
I understand what log2Up function is doing but don't understand why that log2Up(n)-1 and 0,were passed like "arguments" to addr which is val of type UInt!?

I could not find where UInt was defined, but if I had to guess, UInt is a class that has an apply method. This is a special method that allows us to use a parenthesis operator on an instance of the class.
For example lets say we have a class called Multiply that defines an apply method.
class Multiply {
def apply(a: Int, b: Int): Int = a * b
}
This allows you to call operator () on any instance of that class. For example:
val mul = new Multiply()
println(mul(5, 6)) // prints 30

What i concluded is that we use addr(log2Up(n)-1, 0) to get addres bits starting from zero up to log2Up(n)-1 bit. Lets take an example.
If we made object of class RegFile in this way
val reg = RegFile(31, 10)
First, memory rf is created. Size of that memory is 31 data of type UInt width of 10, starting from 0 up to 30.
When we compute log2Up(n)-1 we get 4, and we have something like this: addr(4,0). This gives us last five bits of addr. Like #Jack Koenig said in one of the comment above: "Rocket's register file uses a little trick where it reverses the order of the registers physically compared to the RISC-V", that's why we use ~addr. And at least rf(~addr) gives us back what is in that memory location.
This is implemented in this way to provide adequate memory access.
Take a look what whould be if we try to get data from memory location that we don't have in our memory. If method access was called in this way
access(42)
We try to access memory location on 41th place, but we only have 31 memory location(30 is top). 42 binary is 101010. Using what i said above
~addr(log2Up(n)-1,0)
would return us 10101 or 21 in decimal. Because order of registers is reversed this is
10th memory location (we try to access 41th but only have 31, 41 minus 31 is 10).

Scodec: Using vectorOfN with a vlong field

I am playing around with the Bitcoin blockchain to learn Scala and some useful libraries.
Currently I am trying to decode and encode Blocks with SCodec and my problem is that the vectorOfN function takes its size as an Int. How can I use a long field for the size while still preserving the full value range.
In other words is there a vectorOfLongN function?
This is my code which would compile fine if I were using vintL instead of vlongL:
object Block {
implicit val codec: Codec[Block] = {
("header" | Codec[BlockHeader]) ::
(("numTx" | vlongL) >>:~
{ numTx => ("transactions" | vectorOfN(provide(numTx), Codec[Transaction]) ).hlist })
}.as[Block]
}
You may assume that appropriate Codecs for the Blockheader and the Transactions are implemented. Actually, vlong is used as a simplification for this question, because Bitcoin uses its own codec for variable sized ints.

I'm not a scodec specialist but my common sense suggests that this is not possible because Scala's Vector being a subtype of GenSeqLike is limited to have length of type Int and apply that accepts Int index as its argument. And AFAIU this limitation comes from the underlying JVM platform where you can't have an array of size more than Integer.MAX_VALUE i.e. around 2^31 (see also "Criticism of Java" wiki). And although Vector theoretically could have work this limitation around, it was not done. So it makes no sense for vectorOfN to support Long size as well.
In other words, if you want something like this, you probably should start from creating your own Vector-like class that does support Long indices working around JVM limitations.

You may want to take a look at scodec-stream, which comes in handy when all of your data is not available immediately or does not fit into memory.
Basically, you would use your usual codecs.X and turn it into a StreamDecoder via scodec.stream.decode.many(normal_codec). This way you can work with the data through scodec without the need to load it into memory entirely.
A StreamDecoder then offers methods like decodeInputStream along scodec's usual decode.
(I used it a while ago in a slightly different context – parsing data sent by a client to a server – but it looks like it would apply here as well).

Working with opaque types (Char and Long)

I'm trying to export a Scala implementation of an algorithm for use in JavaScript. I'm using #JSExport. The algorithm works with Scala Char and Long values which are marked as opaque in the interoperability guide.
I'd like to know (a) what this means; and (b) what the recommendation is for dealing with this.
I presume it means I should avoid Char and Long and work with String plus a run-time check on length (or perhaps use a shapeless Sized collection) and Int instead.
But other ideas welcome.
More detail...
The kind of code I'm looking at is:
#JSExport("Foo")
class Foo(val x: Int) {
#JSExport("add")
def add(n: Int): Int = x+n
}
...which works just as expected: new Foo(1).add(2) produces 3.
Replacing the types with Long the same call reports:
java.lang.ClassCastException: 1 is not an instance of scala.scalajs.runtime.RuntimeLong (and something similar with methods that take and return Char).

Being opaque means that
There is no corresponding JavaScript type
There is no way to create a value of that type from JavaScript (except if there is an #JSExported constructor)
There is no way of manipulating a value of that type (other than calling #JSExported methods and fields)
It is still possible to receive a value of that type from Scala.js code, pass it around, and give it back to Scala.js code. It is also always possible to call .toString(), because java.lang.Object.toString() is #JSExported. Besides toString(), neither Char nor Long export anything, so you can't do anything else with them.
Hence, as you have experienced, a JavaScript 1 cannot be used as a Scala.js Long, because it's not of the right type. Neither is 'a' a valid Char (but it's a valid String).
Therefore, as you have inferred yourself, you must indeed avoid opaque types, and use other types instead if you need to create/manipulate them from JavaScript. The Scala.js side can convert back and forth using the standard tools in the language, such as someChar.toInt and someInt.toChar.
The choice of which type is best depends on your application. For Char, it could be Int or String. For Long, it could be String, a pair of Ints, or possibly even Double if the possible values never use more than 52 bits of precision.

Embedding XML (and other languages?) in Scala

I'm wondering how the scala.xml library is implemented, to get an Elem-instance out of XML.
So I can write:
val xml = {
<myxml>
Some wired text withoud "'s or code like
import x
x.func()
It's like a normal sting in triple-quotes.
</myxml>
}
xml.text
String =
"
Some text wired withoud "'s or code like
import x
x.func()
It's like a normal sting in triple-quotes.
"
A look at the source code doesn't gave me the insight, how this is achieved.
Is the "XML-detection" a (hard) scala language feature or is it an internal DSL? Because I would like to build my own things like this:
var x = LatexCode {
\sqrt{\frac{a}{b}}
}
x.toString
"\sqrt{\frac{a}{b}}"
or
var y = PythonCode {
>>> import something
>>> something.func()
}
y.toString
"""import something ..."""
y.execute // e.g. passed to python as python-script
or
object o extends PythonCode {
import x
x.y()
}
o.toString
"""import x...."""
I would like to avoid using such things like PythonCode { """import ...""" } as "DSL". And in scala, XML is magically transported to a scala.xml-Class; same with Symbol which I can get with val a = 'symoblname, but in the source code there's no clue how this is implemented.
How can I do something like that on myself, preferably as internal DSL?

XML is a scala language feature (*) - see the SLS, section 1.5.
I think that string interpolation is coming in 2.10, however, which would at least allow you to define your own DSL:
val someLatex = latex"""\sqrt{\frac{a}{b}}}"""
It's an experimental feature explained more fully in the SIP but has been additionally blogged about by the prolific Daniel Sobral. The point of this is (of course) that the correctness of the code in the String can be checked at compile time (well, to the extent possible in an untyped language :-) and your IDE can even help you write it (well, to the extent possible in an untyped language :-( )
(*) - We might expect this to change in the future given the many shortcomings of the implementation. My understanding is that a combination of string interpolation and anti-xml may yet be the one true way.

It's an XML literal, just like "foo" is a string literal, 42 is an integer literal, 12.34 is a floating point literal, 'foo is a symbol literal, (foo) => foo + 1 is a function literal and so on.
Scala has fewer literals than other languages (for example, it doesn't have array literals or regexp literals), but it does have XML literals.
Note that in Scala 2.10 string literals become vastly more powerful by allowing you to intercept and re-interpret them using StringContexts. These more powerful string literals would allow you to implement all of your snippets, including XML without separate language support. It is likely that XML literals will be removed in a future version of the language.

Count all occurrences of a char within a string

Does Scala have a native way to count all occurrences of a character in a string?
If so, how do I do it?
If not, do I need to use Java? If so, how do I do that?
Thanks!

"hello".count(_ == 'l') // returns 2

i don't use Scala or even java but google search for "Scala string" brought me to here
which contains :
def
count (p: (Char) ⇒ Boolean): Int
Counts the number of elements in the string which satisfy a predicate.
p
the predicate used to test elements.
returns
the number of elements satisfying the predicate p.
Definition Classes
TraversableOnce → GenTraversableOnce
Seems pretty straight forward but i dont use Scala so don't know the syntax of calling a member function. May be more overhead than needed this way because it looks like it can search for a sequence of characters. read on a different result page a string can be changed into a sequence of characters and you can probably easily loop through them and increase a counter.

You can also take a higher level approach to look into substring occurrences within another string, by using sliding:
def countSubstring(str: String, sub: String): Int =
str.sliding(sub.length).count(_ == sub)