How does mapping on an rdd work in pyspark? - pyspark

I was learning pyspark when I encounterd this.
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
Row([0,0,0,85,0,69,0]),
Row([0,89,56,0,0,0,0])],
['features'])
+--------------------+
| features|
+--------------------+
|[0, 45, 63, 0, 0,...|
|[0, 0, 0, 85, 0, ...|
|[0, 89, 56, 0, 0,...|
+--------------------+
sample = df.rdd.map(lambda row: row[0]*2)
sample.collect()
[[0, 45, 63, 0, 0, 0, 0, 0, 45, 63, 0, 0, 0, 0],
[0, 0, 0, 85, 0, 69, 0, 0, 0, 0, 85, 0, 69, 0],
[0, 89, 56, 0, 0, 0, 0, 0, 89, 56, 0, 0, 0, 0]]
My question is why is row[0] is taken as a complete list rather than one value?
What is the property that gives the above output

It is Taken as Complete list as you have given it as one, and you have also defined it under one column "features"
when You are saying
df.rdd.map(lambda row: row[0]*2)
You are just asking spark that "I want all values in this list to occur twice". Hence you get the output that you are getting.
Now How to get Individual values in list.
df = spark.createDataFrame([Row(0,45,63,0,0,0,0),
Row(0,0,0,85,0,69,0),
Row(0,89,56,0,0,0,0)],
['feature1' , 'feature2' , 'feature3' , 'feature4', 'feature5' , 'feature6' , 'feature7'])
This should give you access to individual values in a dedicated column.
Note : syntax for schema is just representation. please refer spark docs for exact syntax.
Hope This helps :)

Related

Pyspark RDD filter behaviour when filter condition variable is modified

When the following code is run:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
The output is:
49
0
Which is incorrect as there are 39 values between 10 and 49. It seems like changing t to 10 from 50 effected the first filter as well and it got re-evalutated so when both filters are applied consecutively it effectively becomes x<10 which would result in 1, 2, 3, 4, 5, 6 ,7, 8, 9 followed by x>10 resulting in an empty rdd.
But When I add debug prints in the code the result is not what I expect and I am looking for an explanation:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t)
print(B.collect())
t = 10
print(B.collect())
print(B.count())
C = B.filter(lambda x: x > t)
print(C.collect())
print(C.count())
The output is:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
9
[]
0
How come the count is 10 after t=10 but print(B.collect()) shows the expected rdd with values from 1 to 49? If triggering collect after changing t re-did the filter, then shouldn't collect() show values from 1-9?
I am new to pyspark, I suspect this has to do with spark's lazy operations and caching. Can someone explain what is going on behind the scenes?
Thanks!
Your assumption is correct, the observed behaviour is related to Spark's lazy evaluation of transformations.
When B.count() is executed, Spark simply applies the filter x < t with t = 50 and prints the expected value of 49.
When C.count() is executed, Spark sees two filters in the excution plan of C, namely x < t and x > t. At this point in time t has been set to 10 and no element of the rdd satiesfies both conditions to be smaller and larger than 10. Spark ignores that fact that the first filter has already been evaluated. When a Spark action is called all transformations in the history of the current rdd are executed (unless some intermediate result has been cached, see below).
A way to examine this behaviour (a bit) more in details is to switch to Scala and print toDebugString for both rdds.1
println(B.toDebugString)
prints
(4) MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
while
println(C.toDebugString)
prints
(4) MapPartitionsRDD[2] at filter at SparkStarter.scala:28 []
| MapPartitionsRDD[1] at filter at SparkStarter.scala:23 []
| ParallelCollectionRDD[0] at parallelize at SparkStarter.scala:19 []
Here we can see that for rdd B one filter is applied and for rdd C two filters are applied.
How to fix the issue?
If the result of the first filter is cached the expected result is printed out. When then t is changed and the second filter is applied C.count() only triggers the second filter based on the cached result of B:
A = ss.sparkContext.parallelize(range(1, 100))
t = 50
B = A.filter(lambda x: x < t).cache()
print(B.count())
t = 10
C = B.filter(lambda x: x > t)
print(C.count())
prints the expected result.
49
39
1 Unfortunately this works only in the Scala version of Spark. PySpark seems to "condense" the output of toDebugString (version 3.1.1).

Extra bytes being added to serialization with BooPickle and RocksDb

So I'm using BooPickle to serialize Scala classes before writing them to RocksDB. To serialize a class,
case class Key(a: Long, b: Int) {
def toStringEncoding: String = s"${a}-${b}"
}
I have this implicit class
implicit class KeySerializer(key: Key) {
def serialize: Array[Byte] =
Pickle.intoBytes(key.toStringEncoding).array
}
The method toStringEncoding is necessary because BooPickle wasn't serializing the case class in a way that worked well with RocksDb's requirements on key ordering. I then write a bunch of key, value pairs to several SST files and ingest them into RocksDb. However when I go to look up the keys from the db, they're not found.
If I iterate over all of the keys in the db, I find keys are successfully written, however extra bytes are written to the byte representation in the db. For example if key.serialize outputs something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...)
What I'll find in the db is something like this
Array[Byte] = Array( 25, 49, 54, 48, 53, 55, 52, 52, 48, 48, 48, 45, 48, 45, 49, 54, 48, 53, 55, 52, 52, 48, 51, 48, 45, 48, 51, 101, 52, 97, 49, 100, 102, 48, 50, 53, 5, 8, ...)
Extra non zero bytes replace the zero bytes at the end of the byte array. In addition the size of the byte arrays are different. When I call the serialize method the size of the byte array is 512, but when I retrieve the key from the db the size is 4112. Anyone know what might be causing this?
I have no experience with RocksDb or BooPickle but I guess that the problem is in calling ByteBuffer.array.
It returns the whole array backing the byte buffer rather than the relevant part.
You can look e.g. here Gets byte array from a ByteBuffer in java
how to properly extract the data from a ByteBuffer.
The BooPickle docs suggest the following for getting BooPickled data as a byte array:
val data: Array[Byte] = Array.ofDim[Byte](buf.remaining)
buf.get(data)
So in your case it would be something like
def serialize: Array[Byte] = {
val buf = Pickle.intoBytes(key.toStringEncoding)
val arr = Array.ofDim[Byte](buf.remaining)
buf.get(arr)
arr
}

Nested Looping over dynamic list of list in scala [duplicate]

Given the following list:
List(List(1,2,3), List(4,5))
I would like to generate all the possible combinations. Using yield, it can be done as follows:
scala> for (x <- l.head; y <- l.last) yield (x,y)
res17: List[(Int, Int)] = List((1,4), (1,5), (2,4), (2,5), (3,4), (3,5))
But the problem I have is that the List[List[Int]] is not fixed; it can grow and shrink in size, so I never know how many for loops I will need in advance. What I would like is to be able to pass that list into a function which will dynamically generate the combinations regardless of the number of lists I have, so:
def generator (x : List[List[Int]) : List[List[Int]]
Is there a built-in library function that can do this. If not how do I go about doing this. Any pointers and hints would be great.
UPDATE:
The answer by #DNA blows the heap with the following (not so big) nested List structure:
List(
List(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300),
List(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300),
List(0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300),
List(0, 50, 100, 150, 200, 250, 300),
List(0, 100, 200, 300),
List(0, 200),
List(0)
)
Calling the generator2 function as follows:
generator2(
List(
List(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300),
List(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300),
List(0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300),
List(0, 50, 100, 150, 200, 250, 300),
List(0, 100, 200, 300),
List(0, 200),
List(0)
)
)
Is there a way to generate the cartesian product without blowing the heap?
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.LowPriorityImplicits.wrapRefArray(LowPriorityImplicits.scala:73)
at recfun.Main$.recfun$Main$$generator$1(Main.scala:82)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at recfun.Main$.recfun$Main$$generator$1(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at recfun.Main$.recfun$Main$$generator$1(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at recfun.Main$.recfun$Main$$generator$1(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at recfun.Main$$anonfun$recfun$Main$$generator$1$1.apply(Main.scala:83)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
Here's a recursive solution:
def generator(x: List[List[Int]]): List[List[Int]] = x match {
case Nil => List(Nil)
case h :: _ => h.flatMap(i => generator(x.tail).map(i :: _))
}
which produces:
val a = List(List(1, 2, 3), List(4, 5))
val b = List(List(1, 2, 3), List(4, 5), List(6, 7))
generator(a) //> List(List(1, 4), List(1, 5), List(2, 4),
//| List(2, 5), List(3, 4), List(3, 5))
generator(b) //> List(List(1, 4, 6), List(1, 4, 7), List(1, 5, 6),
//| List(1, 5, 7), List(2, 4, 6), List(2, 4, 7),
//| List(2, 5, 6), List(2, 5, 7), Listt(3, 4, 6),
//| List(3, 4, 7), List(3, 5, 6), List(3, 5, 7))
Update: the second case can also be written as a for comprehension, which may be a little clearer:
def generator2(x: List[List[Int]]): List[List[Int]] = x match {
case Nil => List(Nil)
case h :: t => for (j <- generator2(t); i <- h) yield i :: j
}
Update 2: for larger datasets, if you run out of memory, you can use Streams instead (if it makes sense to process the results incrementally). For example:
def generator(x: Stream[Stream[Int]]): Stream[Stream[Int]] =
if (x.isEmpty) Stream(Stream.empty)
else x.head.flatMap(i => generator(x.tail).map(i #:: _))
// NB pass in the data as Stream of Streams, not List of Lists
generator(input).take(3).foreach(x => println(x.toList))
>List(0, 0, 0, 0, 0, 0, 0)
>List(0, 0, 0, 0, 0, 200, 0)
>List(0, 0, 0, 0, 100, 0, 0)
Feels like your problem can be described in terms of recursion:
If you have n lists of int:
list1 of size m and list2, ... list n
generate the X combinations for list2 to n (so n-1 lists)
for each combination, you generate m new ones for each value of list1.
the base case is a list of one list of int, you just split all the elements in singleton lists.
so with List(List(1,2), List(3), List(4, 5))
the result of your recursive call is List(List(3,4),List(3,5)) and for each you add 2 combinations: List(1,3,4), List(2,3,4), List(1,3,5), List(2,3,5).
Ezekiel has exactly what I was looking for. This is just a minor tweak of it to make it generic.
def generateCombinations[T](x: List[List[T]]): List[List[T]] = {
x match {
case Nil => List(Nil)
case h :: _ => h.flatMap(i => generateCombinations(x.tail).map(i :: _))
}
}
Here is another solution based on Ezekiel's one. More verbose, but it's tail recursion (stack-safe).
def generateCombinations[A](in: List[List[A]]): List[List[A]] =
generate(in, List.empty)
#tailrec
private def generate[A](in: List[List[A]], acc: List[List[A]]): List[List[A]] = in match {
case Nil => acc
case head :: tail => generate(tail, generateAcc(acc, head))
}
private def generateAcc[A](oldAcc: List[List[A]], as: List[A]): List[List[A]] = {
oldAcc match {
case Nil => as.map(List(_))
case nonEmptyAcc =>
for {
a <- as
xs <- nonEmptyAcc
} yield a :: xs
}
}
I realize this is old, but it seems like no other answer provided the non-recursive solution with fold.
def generator[A](xs: List[List[A]]): List[List[A]] = xs.foldRight(List(List.empty[A])) { (next, combinations) =>
for (a <- next; as <- combinations) yield a +: as
}

How to get first 100 prime numbers in scala as i got result but it dispays blank where it's not found

output : prime numbers
2
3
()
5
()
7
()
()
i want as
2
3
5
7
def primeNumber(range: Int): Unit ={
val primeNumbers: immutable.IndexedSeq[AnyVal] =
for (number :Int <- 2 to range) yield{
var isPrime = true
for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
if(isPrime) number
}
println("prime numbers")
for(prime <- primeNumbers)
println(prime)
}
so the underlying problem here is that your yield block effectively will return an Int or a Unit depending on isPrime this leads your collection to be of type AnyVal because that's pretty much the least upper bound that can represent both types. Unit is a type only inhabited by one value which is represented as an empty set of round brackets in scala () so that's what you see in your list.
As Puneeth Reddy V said you can use collect to filter out all the non-Int values but I think that is a suboptimal approach (partial functions are often considered a code-smell depending on what type of scala-style you do). More idiomatic would be to rethink your loop (such for loops are scarcely used in scala) and this could be definitely be done using a foldLeft operation maybe even something else.
You can use collect on your output
primeNumbers.collect{
case i : Int => i
}
res2: IndexedSeq[Int] = Vector(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
The reason is if else returning two types of value one is prime value and another is empty. You could return 0 in else case and filter it before printing it.
scala> def primeNumber(range: Int): Unit ={
|
| val primeNumbers: IndexedSeq[Int] =
|
| for (number :Int <- 2 to range) yield{
|
| var isPrime = true
|
| for(checker : Int <- 2 to Math.sqrt(number).toInt if number%checker==0 if isPrime) isPrime = false
|
| if(isPrime) number
| else
| 0
| }
|
| println("prime numbers" + primeNumbers)
| for(prime <- primeNumbers.filter(_ > 0))
| println(prime)
| }
primeNumber: (range: Int)Unit
scala> primeNumber(10)
prime numbersVector(2, 3, 0, 5, 0, 7, 0, 0, 0)
2
3
5
7
But we should not write the code the way you have written it. You are using mutable variables. Try to write code in an immutable way.
For example
scala> def isPrime(number: Int) =
| number > 1 && !(2 to number - 1).exists(e => e % number == 0)
isPrime: (number: Int)Boolean
scala> def generatePrimes(starting: Int): Stream[Int] = {
| if(isPrime(starting))
| starting #:: generatePrimes(starting + 1)
| else
| generatePrimes(starting + 1)
| }
generatePrimes: (starting: Int)Stream[Int]
scala> generatePrimes(2).take(100).toList
res12: List[Int] = List(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101)

One hot encoding in RDD in scala

I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf