How to get first 100 prime numbers in scala as i got result but it dispays blank where it's not found - scala

output : prime numbers
2
3
()
5
()
7
()
()
i want as
2
3
5
7
def primeNumber(range: Int): Unit ={
val primeNumbers: immutable.IndexedSeq[AnyVal] =
for (number :Int <- 2 to range) yield{
var isPrime = true
for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
if(isPrime) number
}
println("prime numbers")
for(prime <- primeNumbers)
println(prime)
}

so the underlying problem here is that your yield block effectively will return an Int or a Unit depending on isPrime this leads your collection to be of type AnyVal because that's pretty much the least upper bound that can represent both types. Unit is a type only inhabited by one value which is represented as an empty set of round brackets in scala () so that's what you see in your list.
As Puneeth Reddy V said you can use collect to filter out all the non-Int values but I think that is a suboptimal approach (partial functions are often considered a code-smell depending on what type of scala-style you do). More idiomatic would be to rethink your loop (such for loops are scarcely used in scala) and this could be definitely be done using a foldLeft operation maybe even something else.

You can use collect on your output
primeNumbers.collect{
case i : Int => i
}
res2: IndexedSeq[Int] = Vector(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)

The reason is if else returning two types of value one is prime value and another is empty. You could return 0 in else case and filter it before printing it.
scala> def primeNumber(range: Int): Unit ={
|
| val primeNumbers: IndexedSeq[Int] =
|
| for (number :Int <- 2 to range) yield{
|
| var isPrime = true
|
| for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
|
| if(isPrime) number
| else
| 0
| }
|
| println("prime numbers" + primeNumbers)
| for(prime <- primeNumbers.filter(_ > 0))
| println(prime)
| }
primeNumber: (range: Int)Unit
scala> primeNumber(10)
prime numbersVector(2, 3, 0, 5, 0, 7, 0, 0, 0)
2
3
5
7
But we should not write the code the way you have written it. You are using mutable variables. Try to write code in an immutable way.
For example
scala> def isPrime(number: Int) =
| number > 1 && !(2 to number - 1).exists(e => e % number == 0)
isPrime: (number: Int)Boolean
scala> def generatePrimes(starting: Int): Stream[Int] = {
| if(isPrime(starting))
| starting #:: generatePrimes(starting + 1)
| else
| generatePrimes(starting + 1)
| }
generatePrimes: (starting: Int)Stream[Int]
scala> generatePrimes(2).take(100).toList
res12: List[Int] = List(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101)

Related

Pyspark RDD filter behaviour when filter condition variable is modified

When the following code is run:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
The output is:
49
0
Which is incorrect as there are 39 values between 10 and 49. It seems like changing t to 10 from 50 effected the first filter as well and it got re-evalutated so when both filters are applied consecutively it effectively becomes x<10 which would result in 1, 2, 3, 4, 5, 6 ,7, 8, 9 followed by x>10 resulting in an empty rdd.
But When I add debug prints in the code the result is not what I expect and I am looking for an explanation:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.collect())
t = 10
print(B.collect())
print(B.count())
C = B.filter(lambda x: x > t)
print(C.collect())
print(C.count())
The output is:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
9
[]
0
How come the count is 10 after t=10 but print(B.collect()) shows the expected rdd with values from 1 to 49? If triggering collect after changing t re-did the filter, then shouldn't collect() show values from 1-9?
I am new to pyspark, I suspect this has to do with spark's lazy operations and caching. Can someone explain what is going on behind the scenes?
Thanks!
Your assumption is correct, the observed behaviour is related to Spark's lazy evaluation of transformations.
When B.count() is executed, Spark simply applies the filter x < t with t = 50 and prints the expected value of 49.
When C.count() is executed, Spark sees two filters in the excution plan of C, namely x < t and x > t. At this point in time t has been set to 10 and no element of the rdd satiesfies both conditions to be smaller and larger than 10. Spark ignores that fact that the first filter has already been evaluated. When a Spark action is called all transformations in the history of the current rdd are executed (unless some intermediate result has been cached, see below).
A way to examine this behaviour (a bit) more in details is to switch to Scala and print toDebugString for both rdds.1
println(B.toDebugString)
prints
(4) MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
while
println(C.toDebugString)
prints
(4) MapPartitionsRDD[2] at filter at SparkStarter.scala:28 []
| MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
Here we can see that for rdd B one filter is applied and for rdd C two filters are applied.
How to fix the issue?
If the result of the first filter is cached the expected result is printed out. When then t is changed and the second filter is applied C.count() only triggers the second filter based on the cached result of B:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t).cache()
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
prints the expected result.
49
39
1 Unfortunately this works only in the Scala version of Spark. PySpark seems to "condense" the output of toDebugString (version 3.1.1).

Extra bytes being added to serialization with BooPickle and RocksDb

So I'm using BooPickle to serialize Scala classes before writing them to RocksDB. To serialize a class,
case class Key(a: Long, b: Int) {
def toStringEncoding: String = s"${a}-${b}"
}
I have this implicit class
implicit class KeySerializer(key: Key) {
def serialize: Array[Byte] =
Pickle.intoBytes(key.toStringEncoding).array
}
The method toStringEncoding is necessary because BooPickle wasn't serializing the case class in a way that worked well with RocksDb's requirements on key ordering. I then write a bunch of key, value pairs to several SST files and ingest them into RocksDb. However when I go to look up the keys from the db, they're not found.
If I iterate over all of the keys in the db, I find keys are successfully written, however extra bytes are written to the byte representation in the db. For example if key.serialize outputs something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...)
What I'll find in the db is something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 51, 101, 52, 97, 49, 100, 102, 48, 50, 53, 5, 8, ...)
Extra non zero bytes replace the zero bytes at the end of the byte array. In addition the size of the byte arrays are different. When I call the serialize method the size of the byte array is 512, but when I retrieve the key from the db the size is 4112. Anyone know what might be causing this?
I have no experience with RocksDb or BooPickle but I guess that the problem is in calling ByteBuffer.array.
It returns the whole array backing the byte buffer rather than the relevant part.
You can look e.g. here Gets byte array from a ByteBuffer in java
how to properly extract the data from a ByteBuffer.
The BooPickle docs suggest the following for getting BooPickled data as a byte array:
val data: Array[Byte] = Array.ofDim[Byte](buf.remaining)
buf.get(data)
So in your case it would be something like
def serialize: Array[Byte] = {
val buf = Pickle.intoBytes(key.toStringEncoding)
val arr = Array.ofDim[Byte](buf.remaining)
buf.get(arr)
arr
}

How to convert String to ASCII using Scala

I would like to convert every single word into decimal ascii.
for example "RESEP" :
R = 82,
E = 69,
S = 83,
E = 69,
P = 80
my code is:
val LIST_KEYWORD = List("RESEP",
"DAGING SAPI",
"DAGING KAMBING")
val RUBAH_BYTE = LIST_KEYWORD.map(_.split(",")).map
{
baris =>
(
baris(0).getBytes
)
}
and then, I get stuck and I don't know what I am supposed to do next.
scala> "RESEP".map(_.toByte)
res1: scala.collection.immutable.IndexedSeq[Byte] = Vector(82, 69, 83, 69, 80)
scala> "RESEP".map(x => x -> x.toByte)
res2: scala.collection.immutable.IndexedSeq[(Char, Byte)] = Vector((R,82), (E,69), (S,83), (E,69), (P,80))
scala> val LIST_KEYWORD = List("RESEP",
| "DAGING SAPI",
| "DAGING KAMBING")
LIST_KEYWORD: List[String] = List(RESEP, DAGING SAPI, DAGING KAMBING)
scala> LIST_KEYWORD.map(_.map(_.toByte))
res3: List[scala.collection.immutable.IndexedSeq[Byte]] = List(Vector(82, 69, 83, 69, 80), Vector(68, 65, 71, 73, 78, 71, 32, 83, 65, 80, 73), Vector(68, 65, 71, 73, 78, 71, 32, 75, 65, 77, 66, 73, 78, 71))
scala> LIST_KEYWORD.map(_.map(x => x -> x.toByte))
res4: List[scala.collection.immutable.IndexedSeq[(Char, Byte)]] = List(Vector((R,82), (E,69), (S,83), (E,69), (P,80)), Vector((D,68), (A,65), (G,71), (I,73), (N,78), (G,71), ( ,32), (S,83), (A,65), (P,80), (I,73)), Vector((D,68), (A,65), (G,71), (I,73), (N,78), (G,71), ( ,32), (K,75), (A,65), (M,77), (B,66), (I,73), (N,78), (G,71)))

Concatenating multiple lists in Scala

I have a function called generateList and concat function as follows. It is essentially concatenating lists returned by the generateList with i starting at 24 and ending at 1
def concat(i: Int, l: List[(String, Int)]) : List[(String, Int)] = {
if (i==1) l else l ::: concat(i-1, generateList(signs, i))
}
val all = concat(23, generateList(signs, 24))
I can convert this to tail-recursion. But I am curious if there a scala way of doing this?
There are many ways to do this with Scala's built in methods available to Lists.
Here is one approach that uses foldRight
(1 to 24).foldRight(List[Int]())( (i, l) => l ::: generateList(i))
Starting with the range of ints you use to build separate lists, it concats the result of generateList(i) to the initial empty list.
Here is one way to do this:
val signs = ""
def generateList(s: String, n: Int) = n :: n * 2 :: Nil
scala> (24 to 1 by -1) flatMap (generateList(signs, _))
res2: scala.collection.immutable.IndexedSeq[Int] = Vector(24, 48, 23, 46, 22, 44, 21, 42, 20, 40, 19, 38, 18, 36, 17, 34, 16, 32, 15, 30, 14, 28, 13, 26, 12, 24, 11, 22, 10, 20, 9, 18, 8, 16, 7, 14, 6, 12, 5, 10, 4, 8, 3, 6, 2, 4, 1, 2)
What you want to do is to map the list with x => generateList(signs, x) function and then concatenate the results, i.e. flatten the list. This is just what flatMap does.

How do you group elements from Enumerator[A] to Enumerator[Seq[A]]?

I have elements from an Enumerator[A], and want to group/batch the elements to get an Enumerator[Seq[A]]. Here's code I wrote which groups A to Seq[A], but doesn't produce an Enumerator[Seq[A]].
val batchSize = 1000
dogsEnumerator
.run(
Iteratee.fold1[Dog, Vector[Dog]](Future.successful(Vector[Dog]())){
(r, c) =>
if (r.size > batchSize)
processBatch(r).map(_ => Vector[Dog]())
else
Future.successful(r :+ c)
}.map(_ => ())
)
This can be done pretty straightforwardly with the help of some of the Enumeratee combinators:
import play.api.libs.iteratee._
def batch[A](n: Int): Enumeratee[A, List[A]] = Enumeratee.grouped(
Enumeratee.take(n) &>> Iteratee.getChunks[A]
)
We can then use this enumeratee to transform any enumerator into a new enumerator of lists:
val intsEnumerator = Enumerator(1 to 40: _*)
intsEnumerator.through(batch(7)).run(Iteratee.foreach(println))
This will print the following:
List(1, 2, 3, 4, 5, 6, 7)
List(8, 9, 10, 11, 12, 13, 14)
List(15, 16, 17, 18, 19, 20, 21)
List(22, 23, 24, 25, 26, 27, 28)
List(29, 30, 31, 32, 33, 34, 35)
List(36, 37, 38, 39, 40)
As expected.