I'm new to Scala with Akka. I'm using LoggingAdapter to log ByteString messages. I see a common scenario where a decently large message gets truncated. For example,
I need to see the entire message. Please help me here.
This is because of the logic in akka.util.ByteString.toString method:
override def toString(): String = {
val maxSize = 100
if (size > maxSize)
take(maxSize).toString + s"... and [${size - maxSize}] more"
else
super.toString
}
you can convert ByteString to List and this will print everything.
val myBs: ByteString = ??? //some very long ByteString
println(myBs) //this will get truncated
println(myBs.toList) //but this will not
We can convert the ByteString to an array and then print the individual bytes in a space-separated manner.
val myArr : Array[Byte] = byteStringMessage.toArray
log.info(s" byte array : ${myArr.mkString(" ")}")
Related
I have a use case in my program where I need to take a file, split them equally N times and upload them remotely.
I'd like a function that takes, say a File and output a list of BufferedReader. To which I can just distribute them and send them to another function which uses some API to store them.
I've seen examples where authors utilize the .lines() method of a BufferedReader:
def splitFile: List[Stream] = {
val temp = "Test mocked file contents\nTest"
val is = new ByteArrayInputStream(lolz.getBytes)
val br = new BufferedReader(new InputStreamReader(is))
// Chunk the file into two sort-of equal parts.
// Stream 1
val test = br.lines().skip(1).limit(1)
// Stream 2
val test2 = br.lines().skip(2).limit(1)
List(test, test2)
}
I suppose that the above example works, it's not beautiful but it works.
My questions:
Is there a way to split a BufferedReader into multiple list of streams?
I don't control the format of the File, so the file contents could potentially be a single line long. Wouldn't that just mean that .lines() just load all that into a Stream of one element?
It will be much easier to read the file once, and then split it up. You're not really losing anything either, since the file read is a serial operation anyway. From there, just devise a scheme to slice up the list. In this case, I group everything based on its index modulo the number of desired output lists, then pull out the lists. If there are fewer lines than you ask for, it will just place each line in a separate List.
val lines: List[String] = br.lines
val outputListQuantity: Int = 2
val data: List[List[String]] = lines.zipWithIndex.groupBy(_._2 % outputListQuantity}.
values.map(_.map(_._1)).toList
Well, if you don't mind reading the whole stream into memory, it's easy enough (assuming, that file contains text - since you are talking about Readers, but it would be the same idea with binary):
Source.fromFile("filename")
.mkString
.getBytes
.grouped(chunkSize)
.map { chunk => new BufferedReader(new InputStreamReader(chunk)) }
But that seems to sorta defeat the purpose: if the file is small enough to be loaded into memory entirely, why bother splitting it to begin with?
So, a more practical solution is a little bit more involved:
def splitFile(
input: InputStream,
chunkSize: Int
): Iterator[InputStream] = new AbstractIterator[InputStream] {
var hasNext = true
def next = {
val buffer = new Array[Byte](chunkSize)
val bytes = input.read(buffer)
hasNext = bytes == chunkSize
new ByteArrayInputStream(buffer, 0, bytes max 0)
}
}
I'm working on a process of upload files from S3 to Facebook using Akka. According to Facebook API docs, files should be uploaded via small parts - chunks. Based on a file size, Facebook gives you an information about bytes offset which it expects to receive in a next request.
Firstly I make a GetObjectRequest to S3 via Java AWS SDK, in order to receive a chunk with a required bytes size:
val objChunkReq = new GetObjectRequest(get.s3ObjId.bucketName, get.s3ObjId.key)
objChunkReq.setRange(get.fbUploadSession.from, get.fbUploadSession.to)
Try(s3Client.getObject(objChunkReq)) match {
case Success(s3ObjChunk) => Right(S3ObjChunk(s3ObjChunk, get.fbUploadSession))
case Failure(ex) => Left(S3Exception(ex.getMessage))
}
Then in case if the S3 response is successful, I can work with the received chunk as with an InputStream in order to pass it then into Facebook HTTP request:
private def inputStreamToArrayByte(is: InputStream) = {
Try {
val reads: Int = is.read()
val byteStringBuilder = ByteString.newBuilder
while (is.read() != -1) {
byteStringBuilder.asOutputStream.write(reads)
is.read()
}
is.close()
byteStringBuilder.result()
}
}
The issue I faced is that size of s3ObjChunk from the first code snippet has twice bigger size in bytes than the resulting ByteString from the second one code snippet.
s3ObjChunk.getObjectMetadata.getContentLength == n
byteStringBuilder.result().length == n / 2
I have two assumptions:
a) I transform the InputStream into ByteString incorrectly
b) The ByteString compresses the InputStream
How to transform an S3 object InputStream into a ByteString correctly?
The issue with n vs n / 2 in the resulting output may be explained by a bug in the implementation.
is.read() is called twice in the loop, and none of its returns is indeed written into the output stream, but only the first one, stored in val reads.
The implementation should change to something like:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
try {
var reads: Int = is.read() // note "var" instead of "val"
while (reads != -1) {
output.write(reads)
reads = is.read()
}
} finally {
is.close() // should it be here or closed by the caller?
// also close "output"
}
byteStringBuilder.result()
Or, another approach would be to use slightly more idiomatic stream reading with scala.io.Source, for example:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
scala.io.Source.fromInputStream(is).foreach(output.write(_))
byteStringBuilder.result()
I'm trying to implement Huffman compression. After encoding the char into 0 and 1, how do I write it to a file so it'll be compressed? Obviously simply writing the characters 0,1 will only make the file larger.
Let's say I have a string "0101010" which represents bits.
I wish to write the string into file but as in binary, i.e. not the char '0' but the bit 0.
I tried using the getBytesArray() on the string but it seems to make no difference rather than simply writing the string.
Although Sarvesh Kumar Singh code will probably work, it looks so Javish to me that I think this question needs one more answer. The way I imaging Huffman coding to be implemented in Scala is something like this:
import scala.collection._
type Bit = Boolean
type BitStream = Iterator[Bit]
type BitArray = Array[Bit]
type ByteStream = Iterator[Byte]
type CharStream = Iterator[Char]
case class EncodingMapping(charMapping: Map[Char, BitArray], eofCharMapping: BitArray)
def buildMapping(src: CharStream): EncodingMapping = {
def accumulateStats(src: CharStream): Map[Char, Int] = ???
def buildMappingImpl(stats: Map[Char, Int]): EncodingMapping = ???
val stats = accumulateStats(src)
buildMappingImpl(stats)
}
def convertBitsToBytes(bits: BitStream): ByteStream = {
bits.grouped(8).map(bits => {
val res = bits.foldLeft((0.toByte, 0))((acc, bit) => ((acc._1 * 2 + (if (bit) 1 else 0)).toByte, acc._2 + 1))
// last byte might be less than 8 bits
if (res._2 == 8)
res._1
else
(res._1 << (8 - res._2)).toByte
})
}
def encodeImpl(src: CharStream, mapping: EncodingMapping): ByteStream = {
val mainData = src.flatMap(ch => mapping.charMapping(ch))
val fullData = mainData ++ mapping.eofCharMapping
convertBitsToBytes(fullData)
}
// can be used to encode String as src. Thanks to StringLike/StringOps extension
def encode(src: Iterable[Char]): (EncodingMapping, ByteStream) = {
val mapping = buildMapping(src.iterator)
val encoded = encode(src.iterator, mapping)
(mapping, encoded)
}
def wrapClose[A <: java.io.Closeable, B](resource: A)(op: A => B): B = {
try {
op(resource)
}
finally {
resource.close()
}
}
def encodeFile(fileName: String): (EncodingMapping, ByteStream) = {
// note in real life you probably want to specify file encoding as well
val mapping = wrapClose(Source.fromFile(fileName))(buildMapping)
val encoded = wrapClose(Source.fromFile(fileName))(file => encode(file, mapping))
(mapping, encoded)
}
where in accumulateStats you find out how often each Char is present in the src and in buildMappingImpl (which is the main part of the whole Huffman encoding) you first build a tree from that stats and then use the create a fixed EncodingMapping. eofCharMapping is a mapping for the pseudo-EOF char as mentioned in one of the comments. Note that high-level encode methods return both EncodingMapping and ByteStream because in any real life scenario you want to save both.
The piece of logic specifically being asked is located in convertBitsToBytes method. Note that I use Boolean to represent single bit rather than Char and thus Iterator[Bit] (effectively Iterator[Boolean]) rather than String to represent a sequence of bits. The idea of the implementation is based on the grouped method that converts a BitStream into a stream of Bits grouped in a byte-sized groups (except possible for the last one).
IMHO the main advantage of this stream-oriented approach comparing to Sarvesh Kumar Singh answer is that you don't need to load the whole file into memory at once or store the whole encoded file in the memory. Note however that in such case you'll have to read the file twice: first time to build the EncodingMapping and second to apply it. Obviously if the file is small enough you can load it into the memory first and then convert ByteStream to Array[Byte] using .toArray call. But if your file is big, you can use just stream-based approach and easily save the ByteStream into a file using something like .foreach(b => out.write(b))
I don't think this will help you achieve your Huffman compression goal, but in answer to your question:
string-to-value
Converting a String to the value it represents is pretty easy.
val x: Int = "10101010".foldLeft(0)(_*2 + _.asDigit)
Note: You'll have to check for formatting (only ones and zeros) and overflow (strings too long) before conversion.
value-to-file
There are a number of ways to write data to a file. Here's a simple one.
import java.io.{FileOutputStream, File}
val fos = new FileOutputStream(new File("output.dat"))
fos.write(x)
fos.flush()
fos.close()
Note: You'll want to catch any errors thrown.
I will specify all the required imports first,
import java.io.{ File, FileInputStream, FileOutputStream}
import java.nio.file.Paths
import scala.collection.mutable.ArrayBuffer
Now, We are going to need following smaller units to achieve this whole thing,
1 - We need to be able to convert our binary string (eg. "01010") to Array[Byte],
def binaryStringToByteArray(binaryString: String) = {
val byteBuffer = ArrayBuffer.empty[Byte]
var byteStr = ""
for (binaryChar <- binaryString) {
if (byteStr.length < 7) {
byteStr = byteStr + binaryChar
}
else {
try{
val byte = java.lang.Byte.parseByte(byteStr + binaryChar, 2)
byteBuffer += byte
byteStr = ""
}
catch {
case ex: java.lang.NumberFormatException =>
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = "" + binaryChar
}
}
}
if (!byteStr.isEmpty) {
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = ""
}
byteBuffer.toArray
}
2 - We need to be able to open the file to serve in our little play,
def openFile(filePath: String): File = {
val path = Paths.get(filePath)
val file = path.toFile
if (file.exists()) file.delete()
if (!file.exists()) file.createNewFile()
file
}
3 - We need to be able to write bytes to a file,
def writeBytesToFile(bytes: Array[Byte], file: File): Unit = {
val fos = new FileOutputStream(file)
fos.write(bytes)
fos.close()
}
4 - We need to be able to read bytes back from the file,
def readBytesFromFile(file: File): Array[Byte] = {
val fis = new FileInputStream(file)
val bytes = new Array[Byte](file.length().toInt)
fis.read(bytes)
fis.close()
bytes
}
5 - We need to be able convert bytes back to our binaryString,
def byteArrayToBinaryString(byteArray: Array[Byte]): String = {
byteArray.map(b => b.toBinaryString).mkString("")
}
Now, we are ready to do every thing we want,
// lets say we had this binary string,
scala> val binaryString = "00101110011010101010101010101"
// binaryString: String = 00101110011010101010101010101
// Now, we need to "pad" this with a leading "1" to avoid byte related issues
scala> val paddedBinaryString = "1" + binaryString
// paddedBinaryString: String = 100101110011010101010101010101
// The file which we will use for this,
scala> val file = openFile("/tmp/a_bit")
// file: java.io.File = /tmp/a_bit
// convert our padded binary string to bytes
scala> val bytes = binaryStringToByteArray(paddedBinaryString)
// bytes: Array[Byte] = Array(75, 77, 85, 85)
// write the bytes to our file,
scala> writeBytesToFile(bytes, file)
// read bytes back from file,
scala> val bytesFromFile = readBytesFromFile(file)
// bytesFromFile: Array[Byte] = Array(75, 77, 85, 85)
// so now, we have our padded string back,
scala> val paddedBinaryStringFromFile = byteArrayToBinaryString(bytes)
// paddedBinaryStringFromFile: String = 1001011100110110101011010101
// remove that "1" from the front and we have our binaryString back,
scala> val binaryStringFromFile = paddedBinaryString.tail
// binaryStringFromFile: String = 00101110011010101010101010101
NOTE :: you may have to make few changes if you want to deal with very large "binary strings" (more than few millions of characters long) to improve performance or even be usable. For example - You will need to start using Streams or Iterators instead of Array[Byte].
Ok, so basically, if I am in the console (Intellij) and I type FileScramble.getRandomPW, I get an ASCII password. But if I run the command in the code, I don't. Instead, I get "org.jasypt.exceptions.EncryptionInitializationException: InvalidKeySpecException: Password is not ASCII."
Here is a screen shot of what I mean.
The fact that I've been up and down that block of code so many times leads me to believe that I'm missing something fundamental in the scala language. The try-catch of the getRandomPW block is never triggered. And, like I said, if I call it from the console, I get only ASCII.
The program is just going to scramble the contents of a file before deletion. It's by no means secure -- it's an exercise. It's me getting familiar with 1) scala, 2) encryption, and 3) sbt.
So here is the relevant code:
import java.io.{BufferedOutputStream, File, FileOutputStream, InputStream}
import java.nio.ByteBuffer
import java.security.SecureRandom
import org.jasypt.util.binary.BasicBinaryEncryptor
object FileScramble {
val base64chars = ('a' to 'z').union('A' to 'Z').union(0 to 9).union(List('/', '+'))
def byteArrayToBase64(x: java.nio.ByteBuffer) : String = {
// convert to string and filter out anything but base64chars
val nowString = new String(x.array.takeWhile(_ != 0), "UTF-8")
nowString.filter(base64chars.contains(_))
}
def writeBytes( data : Stream[Byte], file : File ) = {
val target = new BufferedOutputStream( new FileOutputStream(file) );
try data.foreach( target.write(_) ) finally target.close;
}
def getRandomPW : String = {
try {
var output : String = ""
while (output.length() < 10) {
// val r = scala.util.Random
val r = SecureRandom.getInstance("SHA1PRNG")
var bytePW : Array[Byte] = new Array[Byte](1000)
r.nextBytes(bytePW)
// get 1000 random bytes into a ByteBuffer
val preString = ByteBuffer.allocate(1000).put(bytePW)
// get a random base 64 password at least 10 chars long
output = byteArrayToBase64(preString)
}
output
}
catch {
case e : Exception => e.getMessage()
}
}
def main( args: Array[String] ): Unit = {
val fileHandle = new java.io.File(args(0))
// https://github.com/liufengyun/scala-bug
val source = scala.io.Source.fromFile(fileHandle, "ISO-8859-1")
// source = new MyInputStream(dataStream)
val byteArray = source.map(_.toByte).toArray
// val byteStream = source.map(_.toByte).toStream
source.close()
var binaryEncryptor = new BasicBinaryEncryptor();
val pw = getRandomPW
println("BEGIN: " + pw + ":END")
binaryEncryptor.setPassword(pw);
val encryptedOut = binaryEncryptor.encrypt(byteArray).toStream
writeBytes(encryptedOut, fileHandle)
}
}
Honestly, I've been up and down the block for a few hours and have not come up with any ideas as to what could be happening. It's by far the biggest head-scratcher I've had recently, to the point that I've asked SO a question for the first time in several years.
Your help is appreciated! I thank you in advance, whether you can help or not.
You have only one small, elusive mistake - when you're trying to add the numeric characters 0 - 9, you should add union('0' to '9'), instead of union(0 to 9) - otherwise you're adding non-ASCII characters (unicode values 0 - 9...) and thus getting the (justifiable) exception.
#TzachZohar has it exactly right.
What you might also consider, though, is letting the compiler help you out a bit more by adding your expected type.
val base64anys: Seq[Char] = ('a' to 'z').union('A' to 'Z').union(0 to 9).union(List('/', '+'))
does not compile. So you would have seen the error.
So I have an association that associates a pair of Ints with a Vector[Long] that can be up to size 10000, and I have anywhere from several hundred thousand to a million of such data. I would like to store this in a single file for later processing in Scala.
Clearly storing this in a plain-text format would take way too much space, so I've been trying to figure out how to do it by writing a Byte stream. However I'm not too sure if this will work since it seems to me that the byteValue() of a Long returns the Byte representation which is still 4 bytes long, and hence I won't save any space? I do not have much experience working with binary formats.
It seems the Scala standard library had a BytePickle that might have been what I was looking for, but has since been deprecated?
An arbitrary Long is about 19.5 ASCII digits long, but only 8 bytes long, so you'll gain a savings of a factor of ~2 if you write it in binary. Now, it may be that most of the values are not actually taking all 8 bytes, in which case you could define some compression scheme yourself.
In any case, you are probably best off writing block data using java.nio.ByteBuffer and friends. Binary data is most efficiently read in blocks, and you might want your file to be randomly accessible, in which case you want your data to look something like so:
<some unique binary header that lets you check the file type>
<int saying how many records you have>
<offset of the first record>
<offset of the second record>
...
<offset of the last record>
<int><int><length of vector><long><long>...<long>
<int><int><length of vector><long><long>...<long>
...
<int><int><length of vector><long><long>...<long>
This is a particularly convenient format for reading and writing using ByteBuffer because you know in advance how big everything is going to be. So you can
val fos = new FileOutputStream(myFileName)
val fc = fos.getChannel // java.nio.channel.FileChannel
val header = ByteBuffer.allocate(28)
header.put("This is my cool header!!".getBytes)
header.putInt(data.length)
fc.write(header)
val offsets = ByteBuffer.allocate(8*data.length)
data.foldLeft(28L+8*data.length){ (n,d) =>
offsets.putLong(n)
n = n + 12 + d.vector.length*8
}
fc.write(offsets)
...
and on the way back in
val fis = new FileInputStream(myFileName)
val fc = fis.getChannel
val header = ByteBuffer.allocate(28)
fc.read(header)
val hbytes = new Array[Byte](24)
header.get(hbytes)
if (new String(hbytes) != "This is my cool header!!") ???
val nrec = header.getInt
val offsets = ByteBuffer.allocate(8*nrec)
fc.read(offsets)
val offsetArray = offsets.getLongs(nrec) // See below!
...
There are some handy methods on ByteBuffer that are absent, but you can add them on with implicits (here for Scala 2.10; with 2.9 make it a plain class, drop the extends AnyVal, and supply an implicit conversion from ByteBuffer to RichByteBuffer):
implicit class RichByteBuffer(val b: java.nio.ByteBuffer) extends AnyVal {
def getBytes(n: Int) = { val a = new Array[Byte](n); b.get(a); a }
def getShorts(n: Int) = { val a = new Array[Short](n); var i=0; while (i<n) { a(i)=b.getShort(); i+=1 } ; a }
def getInts(n: Int) = { val a = new Array[Int](n); var i=0; while (i<n) { a(i)=b.getInt(); i+=1 } ; a }
def getLongs(n: Int) = { val a = new Array[Long](n); var i=0; while (i<n) { a(i)=b.getLong(); i+=1 } ; a }
def getFloats(n: Int) = { val a = new Array[Float](n); var i=0; while (i<n) { a(i)=b.getFloat(); i+=1 } ; a }
def getDoubles(n: Int) = { val a = new Array[Double](n); var i=0; while (i<n) { a(i)=b.getDouble(); i+=1 } ; a }
}
Anyway, the reason to do things this way is that you'll end up with decent performance, which is also a consideration when you have tens of gigabytes of data (which it sounds like you have given hundreds of thousands of vectors of length up to ten thousand).
If your problem is actually much smaller, then don't worry so much about it--pack it into XML or use JSON or some custom text solution (or use DataOutputStream and DataInputStream, which don't perform as well and won't give you random access).
If your problem is actually bigger, you can define two lists of longs; first, the ones that will fit in an Int, say, and then the ones that actually need a full Long (with indices so you know where they are). Data compression is a very case-specific task--assuming you don't just want to use java.util.zip--so without a lot more knowledge about what the data looks like, it's hard to know what to recommend beyond just storing it as a weakly hierarchical binary file as I've described above.
See Java's DataOutputStream. It allows easy and efficient writing of primitive types and Strings to byte streams. In particular, you want something like:
val stream = new DataOutputStream(new FileOutputStream("your_file.bin"))
You can then use the equivalent DataInputStream methods to read from that file to variables again.
I used scala-io, scala-arm to write a binary stream of Long-s. The libraries itself are supposed to be a Scala-way to do things, but these are not in Scala master branch - maybe someone knows why? I use them from time to time.
1) Clone scala-io:
git clone https://github.com/scala-incubator/scala-io.git
Go to scala-io/package and change in Build.scala, val scalaVersion to yours
sbt package
2) Clone scala-arm:
git clone https://github.com/jsuereth/scala-arm.git
Go to scala-arm/package and change in build.scala, scalaVersion := to yours
sbt package
3) Copy somewhere not too far:
scala-io/core/target/scala-xxx/scala-io-core_xxx-0.5.0-SNAPSHOT.jar
scala-io/file/target/scala-xxx/scala-io-file_xxx-0.5.0-SNAPSHOT.jar
scala-arm/target/scala-xxx/scala-arm_xxx-1.3-SNAPSHOT.jar
4) Start REPL:
scala -classpath "/opt/scala-io/scala-io-core_2.10-0.5.0-SNAPSHOT.jar:
/opt/scala-io/scala-io-file_2.10-0.5.0-SNAPSHOT.jar:
/opt/scala-arm/scala-arm_2.10-1.3-SNAPSHOT.jar"
5) :paste actual code:
import scalax.io._
// create data stream
val EOData = Vector(0xffffffffffffffffL)
val data = List(
(0, Vector(0L,1L,2L,3L))
,(1, Vector(4L,5L))
,(2, Vector(6L,7L,8L))
,(3, Vector(9L))
)
var it = Iterator[Long]()
for (rec <- data) {
it = it ++ Vector(rec._1).iterator.map(_.toLong)
it = it ++ rec._2.iterator
it = it ++ EOData.iterator
}
// write data at once
val out: Output = Resource.fromFile("/tmp/data")
out.write(it)(OutputConverter.TraversableLongConverter)