Taking a substring of a ByteString in Akka/Scala - scala

I'm using Akka to develop a server application and I was wondering if there was a "cleaner" way to go about getting a substring of a ByteString - Something like
bytestr.getSubstringAtFor(int start, int len): ByteString
or similar. Right now I'm converting the ByteString to a list, creating another List[Byte], looping over it with a for loop and copying the relevant bytes to my new list, then converting that list of bytes back to a ByteString.
Is there a "cleaner" way to get a substring of a ByteString?

You should be able to use slice to get a contiguous subset of the bytes taking a start index that is inclusive and an end index that is exclusive. For instance, if you had a ByteString wrapping the string "foobar" and wanted to get a ByteString of just "oob" then that would look like this:
val bs = ByteString("foobar")
val subbs = bs.slice(1, 4)

Related

How to get the BSON from a ReactiveMongo BSONDocument in scala?

I have a ReactiveMongo BSONDocument but I want to write it to a file - I know there's the BSON format (http://bsonspec.org/spec.html) and I want to write it according to those specs, but the problem is that I can't find any method call to do this. I've been able to convert it to an array of Bytes, but the problem begins when I convert to a string, UTF8 format by default.
However the BSON specs require a 32 bit number in the beginning. Is there a library that can do this for me? If not, how can I add string representing a 32 bit number and UTF8 string together without losing the encoding for either or both?
Here's what I've got in Scala:
import reactivemongo.bson.buffer.ArrayBSONBuffer
val doc = BSONDocument("data" -> overall)
val buffer = new ArrayBSONBuffer()
BSONDocument.write(doc, buffer)
val bytes = buffer.array
val str = new String(bytes, Charset.forName("UTF8"))
For reference, I know in Ruby, we can do something like this, but how do I do the same thing with ReactiveMongo?
bson_data = BSON.serialize({data: arr}).to_s
As indicated in the documentation, you can use BSONDocument.pretty(myDoc).
Note that you are using the deprecated/being removed BSON API.

Copy all elements in RDD to Array

So, I'm reading data from a JSON file and creating a DataFrame. Usually, I would use
sqlContext.read.json("//line//to//some-file.json")
Problem is that my JSON file isn't consistent. So, for each line in the file, there are 2 JSONs. Each line looks like this
{...data I don't need....}, {...data I need....}
I only need my DataFrame to be formed from the data I need, i.e. the second JSON of each line. So I read each line as a string and substring the part that I need, like so
val lines = sc.textFile(link, 2)
val part = lines.map( x => x.substring(x.lastIndexOf('{')).trim)
I want to get all the elements in 'part' as an Array[String] then turn the Array[String] into one string and make the DataFrame. Like so
val strings = part .collect() //doesn't work
val strings = part.take(1000) //works
val jsonStr = "[".concat(strings.mkString(", ")).concat("]")
The problem is, if I call part.collect(), it doesn't work but if I call part.take(N) it works. However, I'd like to get all my data and not just the first N.
Also, if I try part.take(part.count().toInt) it still doesn't work.
Any Ideas??
EDIT
I realized my problem after a good sleep. It was a silly mistake on my part. The very last line of my input file had a different format from the rest of the file.
So part.take(N) would work for all N less than part.count(). That's why part.collect() wasn't working.
Thanks for the help though!

Parsing options that take more than one value with scopt in scala

I am using scopt to parse command line arguments in scala. I want it to be able to parse options with more than one value. For instance, the range option, if specified, should take exactly two values.
--range 25 45
Coming, from python background, I am basically looking for a way to do the following with scopt instead of python's argparse:
parser.add_argument("--range", default=None, nargs=2, type=float,
metavar=('start', 'end'),
help=(" Foo bar start and stop "))
I dont think minOccurs and maxOccurs solves my problem exactly, nor the key:value example in its help.
Looking at the source code, this is not possible. The Read type class used has a member tuplesToRead, but it doesn't seem to be working when you force it to 2 instead of 1. You will have to make a feature request, I guess, or work around this by using --min 25 --max 45, or --range '25 45' with a custom Read instance that splits this string into two parts. As #roterl noted, this is not a standard way of parsing.
It should be ok if only your values are delimited with something else than a space...
--range 25-45
... although you need to split them manually. Parse it with something like:
opt[String]('r', "range").action { (x, c) =>
val rx = "([0-9]+)\\-([0-9]+)".r
val rx(from, to) = x
c.copy(from = from.toInt, to = to.toInt)
}
// ...
println(s" Got range ${parsedArgs.from}..${parsedArgs.to}")

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.

Haskell mongodb: Convert Binary value back to ByteString

This is something simple and stupid that I cant just see.
If a new type is defined:
newtype Binary
Constructors
Binary ByteString
Instances:
Eq Binary
Ord Binary
Read Binary
Show Binary
Typeable Binary
Val Binary
How can I deconstruct the Binary value to get the ByteString back?
If I want to save a binary data into mongodb, say a jpg picture, I am able to construct the Val Binary type out of ByteString read from the filesystem. I then insert it into a document.
When I read the data back from the database and take it out of the document I end up with the Binary type and I am sooo stuck with it. I can not get the ByteString type back for use with ByteString.writeFile.
So skipping all the connection stuff the flow is like this:
file <- B.readFile "pic.jpg" -- reading file
let doc = ["file" =: (Binary file)] -- constructing a document to be inserted
run $ insert_ "files" doc -- insert the document
r <- run $ fetch (select [] "files") -- get Either Failure Document back from db
let d = either (error . show) (id ) r -- Get the Document out
let f = at "file" d :: Binary -- Get the data out of the document of type Binary
Thank you.
Assuming your newtype looks like this,
newtype Binary = Binary ByteString
then you can simply pattern match on the constructor to get the ByteString back:
unBinary :: Binary -> ByteString
unBinary (Binary s) = s