Decoded Snappy compressed byte arrays have trailing zeros - scala

I am trying to write and read Snappy compressed byte array created from a protobuf from a Hadoop Sequence File.
The array read back from hadoop has trailing zeros. If a byte array is a small and simple removing trailing zeros is enough to parse the protobuf back, however for more complex objects and big sequence files parsing fails.
Byte array example:
val data = Array(1,2,6,4,2,1).map(_.toByte)
val distData = sparkContext.parallelize(Array.fill(5)(data))
.map(j => (NullWritable.get(), new BytesWritable(j)))
distData
.saveAsSequenceFile(file, Some(classOf[SnappyCodec]))
val original = distData.map(kv=> kv._2.getBytes).collect()
val decoded = sparkContext
.sequenceFile[NullWritable, BytesWritable](file)
.map( kv => kv._2.getBytes.mkString).collect().foreach(println(_))
Output:
original := 126421
decoded := 126421000

This problem stems from BytesWritable.getBytes, which returns a backing array that may be longer than your data. Instead, call copyBytes (as in Write and read raw byte arrays in Spark - using Sequence File SequenceFile).
See HADOOP-6298: BytesWritable#getBytes is a bad name that leads to programming mistakes for more details.

Related

convert ByteArray to String to ByteArray

I want to convert ByteArray to string and then convert the string to ByteArray,But while converting values changed. someone help to solve this problem.
person.proto:
syntax = "proto3";
message Person{
string name = 1;
int32 age = 2;
}
After sbt compile it gives case class Person (created by google protobuf while compiling)
My MainClass:
val newPerson = Person(
name = "John Cena",
age = 44 //output
)
println(newPerson.toByteArray) //[B#50da041d
val l = newPerson.toByteArray.toString
println(l) //[B#7709e969
val l1 = l.getBytes
println(l1) //[B#f44b405
why the values changed?? how to convert correctly??
[B#... is the format that a JVM byte array's .toString returns, and is just [B (which means "byte array") and a hex-string which is analogous to the memory address at which the array resides (I'm deliberately not calling it a pointer but it's similar; the precise mapping of that hex-string to a memory address is JVM-dependent and could be affected by things like which garbage collector is in use). The important thing is that two different arrays with the same bytes in them will have different .toStrings. Note that in some places (e.g. the REPL), Scala will instead print something like Array(-127, 0, 0, 1) instead of calling .toString: this may cause confusion.
It appears that toByteArray emits a new array each time it's called. So the first time you call newPerson.toByteArray, you get an array at a location corresponding to 50da041d. The second time you call it you get a byte array with the same contents at a location corresponding to 7709e969 and you save the string [B#7709e969 into the variable l. When you then call getBytes on that string (saving it in l1), you get a byte array which is an encoding of the string "[B#7709e969" at the location corresponding to f44b405.
So at the locations corresponding to 50da041d and 7709e969 you have two different byte arrays which happen to contain the same elements (those elements being the bytes in the proto representation of newPerson). At the location corresponding to f44b405 you have a byte array where the bytes encode (in some character set, probably UTF-16?) [B#7709e969.
Because a proto isn't really a string, there's no general way to get a useful string (depending on what definition of useful you're dealing with). You could try interpreting a byte array from toByteArray as a string with a given character encoding, but there's no guarantee that any given proto will be valid in an arbitrary character encoding.
An encoding which is purely 8-bit, like ISO-8859-1 is guaranteed to at least be decodable from a byte array, but there could be non-printable or control characters, so it's not likely to that useful:
val iso88591Representation = new String(newPerson.toByteArray, java.nio.charset.StandardCharsets.ISO_8859_1)
Alternatively, you might want a representation like how the Scala REPL will (sometimes) render it:
"Array(" + newPerson.toByteArray.mkString(", ") + ")"

Akka Streams: keep the delimiter on framing stage

I want to split a byte sequence by each line and max line size
val f = Source(List(ByteString("a\n")))
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 256))
.runFold(ByteString())(_ ++ _)
Await.result(f, 3.seconds) should be(ByteString("a\n"))
The delimiter \n will be missing. I want to keep it. Is there a way to do this?
P.S. The issue is described here: https://github.com/akka/akka/issues/19664
P.P.S. Just adding map section to the flow and farther concatenation of each bytestring with a delimiter is not an option since data is split not just by the delimiter, but also by the chunkSize property since it actually comes from the file: val source = FileIO.fromPath(path, chunkSize = MAX_BYTES)

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?
Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.
You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.
You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.