One hot encoding in RDD in scala

One hot encoding in RDD in scala - scala

I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks

I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))

[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf

Related

Generate Adjacency matrix from a Map

I know this is a lengthy question :) I'm trying to implement Hamiltonian Cycle on a dataset in Scala 2.11, as part of this I'm trying to generate Adjacency matrix from a Map of values.
Explanation:
Keys 0 to 4 are the different cities, so in below "allRoads" Variable
0 -> Set(1, 2) Means city0 is connected to city1 and city2
1 -> Set(0, 2, 3, 4) Means City1 is connected to city0,city2,city3,city4
.
.
I need to generate adj Matrix, for E.g:
I need to generate 1 if the city is connected, or else I've to generate 0, meaning
for: "0 -> Set(1, 2)", I need to generate: Map(0 -> Array(0,1,1,0,0))
input-
var allRoads = Map(0 -> Set(1, 2), 1 -> Set(0, 2, 3, 4), 2 -> Set(0, 1, 3, 4), 3 -> Set(2, 4, 1), 4 -> Set(2, 3, 1))
My Code:
val n: Int = 5
val listOfCities = (0 to n-1).toList
var allRoads = Map(0 -> Set(1, 2), 1 -> Set(0, 2, 3, 4), 2 -> Set(0, 1, 3, 4), 3 -> Set(2, 4, 1), 4 -> Set(2, 3, 1))
var adjmat:Array[Int] = Map()
for( i <- 0 until allRoads.size;j <- listOfCities) {
allRoads.get(i) match {
case Some(elem) => if (elem.contains(j)) adjmat = adjmat:+1 else adjmat = adjmat :+0
case _ => None
}
}
which outputs:
output: Array[Int] = Array(0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0)
Expected output - Something like this, please suggest if there's something better to generate input to Hamiltonian Cycle
Map(0 -> Array(0, 1, 1, 0, 0),1 -> Array(1, 0, 1, 1, 1),2 -> Array(1, 1, 0, 1, 1),3 -> Array(0, 1, 1, 0, 1),4 -> Array(0, 1, 1, 1, 0))
Not sure how to store the above output as a Map or a Plain 2D Array.

Try
val cities = listOfCities.toSet
allRoads.map { case (city, roads) =>
city -> listOfCities.map(city => if ((cities diff roads).contains(city)) 0 else 1)
}
which outputs
Map(0 -> List(0, 1, 1, 0, 0), 1 -> List(1, 0, 1, 1, 1), 2 -> List(1, 1, 0, 1, 1), 3 -> List(0, 1, 1, 0, 1), 4 -> List(0, 1, 1, 1, 0))

How does mapping on an rdd work in pyspark?

I was learning pyspark when I encounterd this.
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
Row([0,0,0,85,0,69,0]),
Row([0,89,56,0,0,0,0])],
['features'])
+--------------------+
| features|
+--------------------+
|[0, 45, 63, 0, 0,...|
|[0, 0, 0, 85, 0, ...|
|[0, 89, 56, 0, 0,...|
+--------------------+
sample = df.rdd.map(lambda row: row[0]*2)
sample.collect()
[[0, 45, 63, 0, 0, 0, 0, 0, 45, 63, 0, 0, 0, 0],
[0, 0, 0, 85, 0, 69, 0, 0, 0, 0, 85, 0, 69, 0],
[0, 89, 56, 0, 0, 0, 0, 0, 89, 56, 0, 0, 0, 0]]
My question is why is row[0] is taken as a complete list rather than one value?
What is the property that gives the above output

It is Taken as Complete list as you have given it as one, and you have also defined it under one column "features"
when You are saying
df.rdd.map(lambda row: row[0]*2)
You are just asking spark that "I want all values in this list to occur twice". Hence you get the output that you are getting.
Now How to get Individual values in list.
df = spark.createDataFrame([Row(0,45,63,0,0,0,0),
Row(0,0,0,85,0,69,0),
Row(0,89,56,0,0,0,0)],
['feature1' , 'feature2' , 'feature3' , 'feature4', 'feature5' , 'feature6' , 'feature7'])
This should give you access to individual values in a dedicated column.
Note : syntax for schema is just representation. please refer spark docs for exact syntax.
Hope This helps :)

Scala Split Seq or List by Delimiter

Let's say I have a sequence of ints like this:
val mySeq = Seq(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
I want to split this by let's say 0 as a delimiter to look like this:
val mySplitSeq = Seq(Seq(0, 1, 2, 1), Seq(0, -1), Seq(0, 1, 2, 3, 2))
What is the most elegant way to do this in Scala?

This works alright
mySeq.foldLeft(Vector.empty[Vector[Int]]) {
case (acc, i) if acc.isEmpty => Vector(Vector(i))
case (acc, 0) => acc :+ Vector(0)
case (acc, i) => acc.init :+ (acc.last :+ i)
}
where 0 (or whatever) is your delimiter.

Efficient O(n) solution
Tail-recursive solution that never appends anything to lists:
def splitBy[A](sep: A, seq: List[A]): List[List[A]] = {
#annotation.tailrec
def rec(xs: List[A], revAcc: List[List[A]]): List[List[A]] = xs match {
case Nil => revAcc.reverse
case h :: t =>
if (h == sep) {
val (pref, suff) = xs.tail.span(_ != sep)
rec(suff, (h :: pref) :: revAcc)
} else {
val (pref, suff) = xs.span(_ != sep)
rec(suff, pref :: revAcc)
}
}
rec(seq, Nil)
}
val mySeq = List(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
println(splitBy(0, mySeq))
produces:
List(List(0, 1, 2, 1), List(0, -1), List(0, 1, 2, 3, 2))
It also handles the case where the input does not start with the separator.
For fun: Another O(n) solution that works for small integers
This is more of warning rather than a solution. Trying to reuse String's split does not result in anything sane:
val mySeq = Seq(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
val z = mySeq.min
val res = (mySeq
.map(x => (x - z).toChar)
.mkString
.split((-z).toChar)
.map(s => 0 :: s.toList.map(_.toInt + z)
).toList.tail)
It will fail if the integers span a range larger than 65535, and it looks pretty insane. Nevertheless, I find it amusing that it works at all:
res: List[List[Int]] = List(List(0, 1, 2, 1), List(0, -1), List(0, 1, 2, 3, 2))

You can use foldLeft:
val delimiter = 0
val res = mySeq.foldLeft(Seq[Seq[Int]]()) {
case (acc, `delimiter`) => acc :+ Seq(delimiter)
case (acc, v) => acc.init :+ (acc.last :+ v)
}
NOTE: This assumes input necessarily starts with delimiter.

One more variant using indices and reverse slicing
scala> val s = Seq(0,1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).reverse.map( {var p=0; x=>{ val y=s.slice(x,s.size-p);p=s.size-x;y}}).reverse
res173: scala.collection.immutable.IndexedSeq[scala.collection.mutable.Seq[Int]] = Vector(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
if the starting doesn't have the delimiter, then also it works.. thanks to jrook
scala> val s = Seq(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).reverse.map( {var p=0; x=>{ val y=s.slice(x,s.size-p);p=s.size-x;y}}).reverse
res174: scala.collection.immutable.IndexedSeq[scala.collection.mutable.Seq[Int]] = Vector(ArrayBuffer(1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
UPDATE1:
More compact version by removing the "reverse" in above
scala> val s = Seq(0,1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).:+(s.size).sliding(2,1).map( x=>s.slice(x(0),x(1)) ).toList
res189: List[scala.collection.mutable.Seq[Int]] = List(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
scala> val s = Seq(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).:+(s.size).sliding(2,1).map( x=>s.slice(x(0),x(1)) ).toList
res190: List[scala.collection.mutable.Seq[Int]] = List(ArrayBuffer(1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
scala>

Here is a solution I believe is both short and should run in O(n):
def seqSplitter[T](s: ArrayBuffer[T], delimiter : T) =
(0 +: s.indices.filter(s(_)==delimiter) :+ s.size) //find split locations
.sliding(2)
.map(idx => s.slice(idx.head, idx.last)) //extract the slice
.dropWhile(_.isEmpty) //take care of the first element
.toList
The idea is to take all the indices where the delimiter occurs, slide over them and slice the sequence at those locations. dropWhile takes care of the first element being a delimiter or not.
Here I am putting all the data in an ArrayBuffer to ensure slicing will take O(size_of_slice).
val mySeq = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
seqSplitter(mySeq, 0).toList
Gives:
List(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
A more detailed complexity analysis
The operations are:
Filter the delimiter indices (O(n))
loop over a list of indices obtained from previous step (O(num_of_delimeters)); for each pair of indices corresponding to a slice:
Copy the slice from the array and put it into the final collection (O(size_of_slice))
The last two steps sum up to O(n).

Garbage collection in the Scala shell

So I have just started using Scala and have the following code to create an IndexedSeq of dummy data called out. The dummy data consists of 20000 tuples each containing a 36 character unique identifier and a list of 1000 floats.
import scala.util.Random
def uuid = java.util.UUID.randomUUID.toString
def generateRandomList(size: Int): List[Float] = {
List.fill(size)(Random.nextFloat)
}
val numDimensions = 1000
val numberToWrite = 20000
val out = for ( i <- 1 to numberToWrite) yield {
val randomList = generateRandomList(numDimensions)
(uuid, randomList) // trying tuples insread
}
But when I run the last statement (just by copying and pasting into the Scala shell) I get the following error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Float.valueOf(Float.java:433)
at scala.runtime.BoxesRunTime.boxToFloat(BoxesRunTime.java:73)
at $anonfun$generateRandomArray$1.apply(<console>:14)
at scala.collection.generic.GenTraversableFactory.fill(GenTraversableFactory.scala:90)
at .generateRandomArray(<console>:14)
at $anonfun$1.apply(<console>:17)
at $anonfun$1.apply(<console>:16)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
... 20 elided
Which is explained as a Java exception that occurs when most of my time is spent doing garbage collection (GC) [1].
According to [2], a 36 char string should take about 112 Bytes. Float takes 4 bytes. I have 1000 in my inner list so about 4000bytes in total. So ignoring the list and tuple overhead then each element of my out IndexedSeq will be about ~4200 bytes say. So having 20000 means ~84e6 bytes overall.
With this in mind after the exception I run this (taken from [3]):
scala> val heapSize = Runtime.getRuntime().totalMemory(); // Get current size of heap in bytes
heapSize: Long = 212860928
scala> val heapMaxSize = Runtime.getRuntime().maxMemory(); // Get maximum size of heap in bytes. The heap cannot grow beyond this size.// Any attempt will result in an OutOfMemoryException.
heapMaxSize: Long = 239075328
scala> val heapFreeSize = Runtime.getRuntime().freeMemory(); // Get amount of free memory within the heap in bytes. This size will increase // after garbage collection and decrease as new objects are created.
heapFreeSize: Long = 152842176
Although it seems that my max heap size available is greater than the rough amount of memory I think I need, I try increasing the heap size ([4]), via ./scala -J-Xmx2g. And although this solves my problem it would be good to know if there is a better way to create this random data that avoids me having to increase the memory available to the JVM?
I therefore have these three questions, which I would be grateful if someone could answer:
When does garbage collection occur in Scala, and in particular the Scala shell? In my commands above what is there that can get collected and so why is the GC being called (sorry this second part probably shows my lack of knowledge about the GC) ?
Are my rough calculations of the amount of memory I am taking up approximatley valid (sure I expect a bit more overhead for the list and tuples but am assuming relatively not that much)? If so why do I run out of memory when my max heap size (239e6 bytes) should cover this? And if not what extra memory am I using?
Is there a better way to create random data for this? For context I am trying to just create some dummy data that I can parallelise into Spark (using sc.parallelize) and then play around with. (so to get it to work when I moved to trying it in Spark I increased the driver memory by setting spark.driver.memory 2g in my spark conf rather than the -J-Xmx2g command above).
Thanks for your help!
Links
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
How much memory does a string use in Java 8?
How to view the current heap size that an application is using?
Increase JVM heap size for Scala?

To answer the REPL-specific part:
https://issues.scala-lang.org/browse/SI-4331
Folks doing big allocations usually prefer Array and Buffer.
Note that there's overhead in List, including boxing the primitive values.
JVM heap is managed in pools, which you can size relative to each other. But generally speaking:
scala> var x = new Array[Byte](20000000 * 4)
x: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> x = null
x: Array[Byte] = null
scala> x = new Array[Byte](20000000 * 4)
x: Array[Byte] = [B#475530b9
scala> x = null
x: Array[Byte] = null
scala> x = new Array[Byte](20000000 * 4)
java.lang.OutOfMemoryError: Java heap space
... 32 elided

Array initializing in Scala

I'm new to Scala ,just started learning it today.I would like to know how to initialize an array in Scala.
Example Java code
String[] arr = { "Hello", "World" };
What is the equivalent of the above code in Scala ?

scala> val arr = Array("Hello","World")
arr: Array[java.lang.String] = Array(Hello, World)

To initialize an array filled with zeros, you can use:
> Array.fill[Byte](5)(0)
Array(0, 0, 0, 0, 0)
This is equivalent to Java's new byte[5].

Can also do more dynamic inits with fill, e.g.
Array.fill(10){scala.util.Random.nextInt(5)}
==>
Array[Int] = Array(0, 1, 0, 0, 3, 2, 4, 1, 4, 3)

Additional to Vasil's answer: If you have the values given as a Scala collection, you can write
val list = List(1,2,3,4,5)
val arr = Array[Int](list:_*)
println(arr.mkString)
But usually the toArray method is more handy:
val list = List(1,2,3,4,5)
val arr = list.toArray
println(arr.mkString)

If you know Array's length but you don't know its content, you can use
val length = 5
val temp = Array.ofDim[String](length)
If you want to have two dimensions array but you don't know its content, you can use
val row = 5
val column = 3
val temp = Array.ofDim[String](row, column)
Of course, you can change String to other type.
If you already know its content, you can use
val temp = Array("a", "b")

Another way of declaring multi-dimentional arrays:
Array.fill(4,3)("")
res3: Array[Array[String]] = Array(Array("", "", ""), Array("", "", ""),Array("", "", ""), Array("", "", ""))

[Consolidating all the answers]
Initializing 1-D Arrays
// With fixed values
val arr = Array("a", "ab", "c")
// With zero value of the type
val size = 13
val arrWithZeroVal = Array.ofDim[Int](size) //values = 0
val arrBoolWithZeroVal = Array.ofDim[Boolean](size) //values = false
// With default value
val defVal = -1
val arrWithDefVals = Array.fill[Int](size)(defVal)
//With random values
val rand = scala.util.Random
def randomNumberGen: Int = rand.nextInt(5)
val arrWithRandomVals = Array.fill[Int](size){randomNumberGen}
Initializing 2-D/3-D/n-D Arrays
// With zero value of the type
val arr3dWithZeroVal = Array.ofDim[Int](5, 4, 3)
// With default value
val defVal = -1
val arr3dWithDefVal = Array.fill[Int](5, 4, 3)(defVal)
//With random values
val arr3dWithRandomValv = Array.fill[Int](5, 4, 3){randomNumberGen}
Conclusion :
Use Array.ofDim[TYPE](d1, d2, d3...) to use zero value of the type.
Use Array.fill[TYPE](d1, d2, d3...)(functionWhichReturnsTYPE) otherwise
Output for reference :
scala> val arr = Array("a", "ab", "c")
arr: Array[String] = Array(a, ab, c)
scala> val size = 13
size: Int = 13
scala> val arrWithZeroVal = Array.ofDim[Int](size) //values = 0
arrWithZeroVal: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
scala> val arrBoolWithZeroVal = Array.ofDim[Boolean](size) //values = false
arrBoolWithZeroVal: Array[Boolean] = Array(false, false, false, false, false, false, false, false, false, false, false, false, false)
scala> val defVal = -1
defVal: Int = -1
scala> val arrWithDefVals = Array.fill[Int](size)(defVal)
arrWithDefVals: Array[Int] = Array(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)
scala> val rand = scala.util.Random
rand: util.Random.type = scala.util.Random$#6e3dd5ce
scala> def randomNumberGen: Int = rand.nextInt(5)
randomNumberGen: Int
scala> val arrWithRandomVals = Array.fill[Int](size){randomNumberGen}
arrWithRandomVals: Array[Int] = Array(2, 2, 3, 1, 1, 3, 3, 3, 2, 3, 2, 2, 0)
scala> val arr3dWithZeroVal = Array.ofDim[Int](5, 4, 3)
arr3dWithZeroVal: Array[Array[Array[Int]]] = Array(Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)))
scala> val arr3dWithDefVal = Array.fill[Int](5, 4, 3)(defVal)
arr3dWithDefVal: Array[Array[Array[Int]]] = Array(Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)))
scala> val arr3dWithRandomVals = Array.fill[Int](5, 4, 3){randomNumberGen}
arr3dWithRandomVals: Array[Array[Array[Int]]] = Array(Array(Array(2, 0, 0), Array(4, 1, 0), Array(4, 0, 0), Array(0, 0, 1)), Array(Array(0, 1, 2), Array(2, 0, 2), Array(0, 4, 2), Array(0, 4, 2)), Array(Array(4, 3, 0), Array(2, 2, 4), Array(4, 0, 4), Array(4, 2, 1)), Array(Array(0, 3, 3), Array(0, 0, 4), Array(4, 1, 3), Array(2, 2, 3)), Array(Array(0, 2, 3), Array(1, 4, 1), Array(1, 3, 3), Array(0, 0, 3)))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

One hot encoding in RDD in scala - scala

Related

Generate Adjacency matrix from a Map

How does mapping on an rdd work in pyspark?

Scala Split Seq or List by Delimiter

Garbage collection in the Scala shell

Array initializing in Scala

Categories

Resources