So I'm using BooPickle to serialize Scala classes before writing them to RocksDB. To serialize a class,
case class Key(a: Long, b: Int) {
def toStringEncoding: String = s"${a}-${b}"
}
I have this implicit class
implicit class KeySerializer(key: Key) {
def serialize: Array[Byte] =
Pickle.intoBytes(key.toStringEncoding).array
}
The method toStringEncoding is necessary because BooPickle wasn't serializing the case class in a way that worked well with RocksDb's requirements on key ordering. I then write a bunch of key, value pairs to several SST files and ingest them into RocksDb. However when I go to look up the keys from the db, they're not found.
If I iterate over all of the keys in the db, I find keys are successfully written, however extra bytes are written to the byte representation in the db. For example if key.serialize outputs something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...)
What I'll find in the db is something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 51, 101, 52, 97, 49, 100, 102, 48, 50, 53, 5, 8, ...)
Extra non zero bytes replace the zero bytes at the end of the byte array. In addition the size of the byte arrays are different. When I call the serialize method the size of the byte array is 512, but when I retrieve the key from the db the size is 4112. Anyone know what might be causing this?
I have no experience with RocksDb or BooPickle but I guess that the problem is in calling ByteBuffer.array.
It returns the whole array backing the byte buffer rather than the relevant part.
You can look e.g. here Gets byte array from a ByteBuffer in java
how to properly extract the data from a ByteBuffer.
The BooPickle docs suggest the following for getting BooPickled data as a byte array:
val data: Array[Byte] = Array.ofDim[Byte](buf.remaining)
buf.get(data)
So in your case it would be something like
def serialize: Array[Byte] = {
val buf = Pickle.intoBytes(key.toStringEncoding)
val arr = Array.ofDim[Byte](buf.remaining)
buf.get(arr)
arr
}
Related
When the following code is run:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
The output is:
49
0
Which is incorrect as there are 39 values between 10 and 49. It seems like changing t to 10 from 50 effected the first filter as well and it got re-evalutated so when both filters are applied consecutively it effectively becomes x<10 which would result in 1, 2, 3, 4, 5, 6 ,7, 8, 9 followed by x>10 resulting in an empty rdd.
But When I add debug prints in the code the result is not what I expect and I am looking for an explanation:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.collect())
t = 10
print(B.collect())
print(B.count())
C = B.filter(lambda x: x > t)
print(C.collect())
print(C.count())
The output is:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
9
[]
0
How come the count is 10 after t=10 but print(B.collect()) shows the expected rdd with values from 1 to 49? If triggering collect after changing t re-did the filter, then shouldn't collect() show values from 1-9?
I am new to pyspark, I suspect this has to do with spark's lazy operations and caching. Can someone explain what is going on behind the scenes?
Thanks!
Your assumption is correct, the observed behaviour is related to Spark's lazy evaluation of transformations.
When B.count() is executed, Spark simply applies the filter x < t with t = 50 and prints the expected value of 49.
When C.count() is executed, Spark sees two filters in the excution plan of C, namely x < t and x > t. At this point in time t has been set to 10 and no element of the rdd satiesfies both conditions to be smaller and larger than 10. Spark ignores that fact that the first filter has already been evaluated. When a Spark action is called all transformations in the history of the current rdd are executed (unless some intermediate result has been cached, see below).
A way to examine this behaviour (a bit) more in details is to switch to Scala and print toDebugString for both rdds.1
println(B.toDebugString)
prints
(4) MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
while
println(C.toDebugString)
prints
(4) MapPartitionsRDD[2] at filter at SparkStarter.scala:28 []
| MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
Here we can see that for rdd B one filter is applied and for rdd C two filters are applied.
How to fix the issue?
If the result of the first filter is cached the expected result is printed out. When then t is changed and the second filter is applied C.count() only triggers the second filter based on the cached result of B:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t).cache()
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
prints the expected result.
49
39
1 Unfortunately this works only in the Scala version of Spark. PySpark seems to "condense" the output of toDebugString (version 3.1.1).
Suppose I have an RDD of integers that looks like this:
10, 20, 30, 40, 50, 60, 70, 80 ...
(ie there is a stream of different integers)
and modify the RDD so it looks like this:
15, 25, 35, 45, 55, 65, 75, 85...
(ie each item on the RDD is the difference of of the two RDDs above.)
My question is: In Spark, how do I transform my RDD into a list of differences between RDD items?
You can take help of rdd's sliding function. like below
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd=sc.parallelize(List(10, 20, 30, 40, 50, 60, 70, 80))
rdd.sliding(2).map(_.sum/2).collect
//output
res14: Array[Int] = Array(15, 25, 35, 45, 55, 65, 75)
output : prime numbers
2
3
()
5
()
7
()
()
i want as
2
3
5
7
def primeNumber(range: Int): Unit ={
val primeNumbers: immutable.IndexedSeq[AnyVal] =
for (number :Int <- 2 to range) yield{
var isPrime = true
for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
if(isPrime) number
}
println("prime numbers")
for(prime <- primeNumbers)
println(prime)
}
so the underlying problem here is that your yield block effectively will return an Int or a Unit depending on isPrime this leads your collection to be of type AnyVal because that's pretty much the least upper bound that can represent both types. Unit is a type only inhabited by one value which is represented as an empty set of round brackets in scala () so that's what you see in your list.
As Puneeth Reddy V said you can use collect to filter out all the non-Int values but I think that is a suboptimal approach (partial functions are often considered a code-smell depending on what type of scala-style you do). More idiomatic would be to rethink your loop (such for loops are scarcely used in scala) and this could be definitely be done using a foldLeft operation maybe even something else.
You can use collect on your output
primeNumbers.collect{
case i : Int => i
}
res2: IndexedSeq[Int] = Vector(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
The reason is if else returning two types of value one is prime value and another is empty. You could return 0 in else case and filter it before printing it.
scala> def primeNumber(range: Int): Unit ={
|
| val primeNumbers: IndexedSeq[Int] =
|
| for (number :Int <- 2 to range) yield{
|
| var isPrime = true
|
| for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
|
| if(isPrime) number
| else
| 0
| }
|
| println("prime numbers" + primeNumbers)
| for(prime <- primeNumbers.filter(_ > 0))
| println(prime)
| }
primeNumber: (range: Int)Unit
scala> primeNumber(10)
prime numbersVector(2, 3, 0, 5, 0, 7, 0, 0, 0)
2
3
5
7
But we should not write the code the way you have written it. You are using mutable variables. Try to write code in an immutable way.
For example
scala> def isPrime(number: Int) =
| number > 1 && !(2 to number - 1).exists(e => e % number == 0)
isPrime: (number: Int)Boolean
scala> def generatePrimes(starting: Int): Stream[Int] = {
| if(isPrime(starting))
| starting #:: generatePrimes(starting + 1)
| else
| generatePrimes(starting + 1)
| }
generatePrimes: (starting: Int)Stream[Int]
scala> generatePrimes(2).take(100).toList
res12: List[Int] = List(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101)
I would like to convert every single word into decimal ascii.
for example "RESEP" :
R = 82,
E = 69,
S = 83,
E = 69,
P = 80
my code is:
val LIST_KEYWORD = List("RESEP",
"DAGING SAPI",
"DAGING KAMBING")
val RUBAH_BYTE = LIST_KEYWORD.map(_.split(",")).map
{
baris =>
(
baris(0).getBytes
)
}
and then, I get stuck and I don't know what I am supposed to do next.
scala> "RESEP".map(_.toByte)
res1: scala.collection.immutable.IndexedSeq[Byte] = Vector(82, 69, 83, 69, 80)
scala> "RESEP".map(x => x -> x.toByte)
res2: scala.collection.immutable.IndexedSeq[(Char, Byte)] = Vector((R,82), (E,69), (S,83), (E,69), (P,80))
scala> val LIST_KEYWORD = List("RESEP",
| "DAGING SAPI",
| "DAGING KAMBING")
LIST_KEYWORD: List[String] = List(RESEP, DAGING SAPI, DAGING KAMBING)
scala> LIST_KEYWORD.map(_.map(_.toByte))
res3: List[scala.collection.immutable.IndexedSeq[Byte]] = List(Vector(82, 69, 83, 69, 80), Vector(68, 65, 71, 73, 78, 71, 32, 83, 65, 80, 73), Vector(68, 65, 71, 73, 78, 71, 32, 75, 65, 77, 66, 73, 78, 71))
scala> LIST_KEYWORD.map(_.map(x => x -> x.toByte))
res4: List[scala.collection.immutable.IndexedSeq[(Char, Byte)]] = List(Vector((R,82), (E,69), (S,83), (E,69), (P,80)), Vector((D,68), (A,65), (G,71), (I,73), (N,78), (G,71), ( ,32), (S,83), (A,65), (P,80), (I,73)), Vector((D,68), (A,65), (G,71), (I,73), (N,78), (G,71), ( ,32), (K,75), (A,65), (M,77), (B,66), (I,73), (N,78), (G,71)))
Below sample data
val combineList = List(("A",12),("B",11),("C",12),("D",14),("E",23),("F",12),("D",53),("C",23),("B",12),("A",22),("E",21),("F",12),("C",21),("B",34),("A",34),("G",67),("D",23),("E",21),("F",12),("D",31),("B",41),("E",14),("F",15),("G",18),("A",11),("C",10),("D",9),("A",13),("E",1),("F",14))
and
val X = 98
Now want final output as below,
first group by all values as below
val groupKey = List(Map("A"->List(12,22,34,11,13)),Map("B"->List(11,12,34,41)),Map("C"->List(12,23,21,10)),Map("D"->List(14,53,23,31,9)),
Map("E"->List(23,21,21,14,1)),Map("F"->List(12,12,12,15,14)),Map("G"->List(67,18)))
Second substract X from groupKey List values here X always gretter than List values so second output will be as
val substrackValues = List(Map("A"->List(86,76,34,87,85)),Map("B"->List(87,86,34,57)),Map("C"->List(86,75,77,88)),Map("D"->List(84,45,75,31,89)),
Map("E"->List(75,77,77,84,97)),Map("F"->List(86,86,86,15,84)),Map("G"->List(31,80)))
Consider
combineList.groupBy(_._1).mapValues(xs => xs.map(v => X-v._2))
which delivers
Map(E -> List(75, 77, 77, 84, 97), F -> List(86, 86, 86, 83, 84), A -> List(86, 76, 64, 87, 85), G -> List(31, 80), B -> List(87, 86, 64, 57), C -> List(86, 75, 77, 88), D -> List(84, 45, 75, 67, 89))
Note the embedded maps in groupKey above are singleton maps which can well be represented with tuples of [(String,List[Int])] or even better agglomerated into one map.
In the solution proposed here after grouping by first tuple element, we transform each element in each list by the value of X.