So I have just started using Scala and have the following code to create an IndexedSeq of dummy data called out. The dummy data consists of 20000 tuples each containing a 36 character unique identifier and a list of 1000 floats.
import scala.util.Random
def uuid = java.util.UUID.randomUUID.toString
def generateRandomList(size: Int): List[Float] = {
List.fill(size)(Random.nextFloat)
}
val numDimensions = 1000
val numberToWrite = 20000
val out = for ( i <- 1 to numberToWrite) yield {
val randomList = generateRandomList(numDimensions)
(uuid, randomList) // trying tuples insread
}
But when I run the last statement (just by copying and pasting into the Scala shell) I get the following error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Float.valueOf(Float.java:433)
at scala.runtime.BoxesRunTime.boxToFloat(BoxesRunTime.java:73)
at $anonfun$generateRandomArray$1.apply(<console>:14)
at scala.collection.generic.GenTraversableFactory.fill(GenTraversableFactory.scala:90)
at .generateRandomArray(<console>:14)
at $anonfun$1.apply(<console>:17)
at $anonfun$1.apply(<console>:16)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
... 20 elided
Which is explained as a Java exception that occurs when most of my time is spent doing garbage collection (GC) [1].
According to [2], a 36 char string should take about 112 Bytes. Float takes 4 bytes. I have 1000 in my inner list so about 4000bytes in total. So ignoring the list and tuple overhead then each element of my out IndexedSeq will be about ~4200 bytes say. So having 20000 means ~84e6 bytes overall.
With this in mind after the exception I run this (taken from [3]):
scala> val heapSize = Runtime.getRuntime().totalMemory(); // Get current size of heap in bytes
heapSize: Long = 212860928
scala> val heapMaxSize = Runtime.getRuntime().maxMemory(); // Get maximum size of heap in bytes. The heap cannot grow beyond this size.// Any attempt will result in an OutOfMemoryException.
heapMaxSize: Long = 239075328
scala> val heapFreeSize = Runtime.getRuntime().freeMemory(); // Get amount of free memory within the heap in bytes. This size will increase // after garbage collection and decrease as new objects are created.
heapFreeSize: Long = 152842176
Although it seems that my max heap size available is greater than the rough amount of memory I think I need, I try increasing the heap size ([4]), via ./scala -J-Xmx2g. And although this solves my problem it would be good to know if there is a better way to create this random data that avoids me having to increase the memory available to the JVM?
I therefore have these three questions, which I would be grateful if someone could answer:
When does garbage collection occur in Scala, and in particular the Scala shell? In my commands above what is there that can get collected and so why is the GC being called (sorry this second part probably shows my lack of knowledge about the GC) ?
Are my rough calculations of the amount of memory I am taking up approximatley valid (sure I expect a bit more overhead for the list and tuples but am assuming relatively not that much)? If so why do I run out of memory when my max heap size (239e6 bytes) should cover this? And if not what extra memory am I using?
Is there a better way to create random data for this? For context I am trying to just create some dummy data that I can parallelise into Spark (using sc.parallelize) and then play around with. (so to get it to work when I moved to trying it in Spark I increased the driver memory by setting spark.driver.memory 2g in my spark conf rather than the -J-Xmx2g command above).
Thanks for your help!
Links
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
How much memory does a string use in Java 8?
How to view the current heap size that an application is using?
Increase JVM heap size for Scala?
To answer the REPL-specific part:
https://issues.scala-lang.org/browse/SI-4331
Folks doing big allocations usually prefer Array and Buffer.
Note that there's overhead in List, including boxing the primitive values.
JVM heap is managed in pools, which you can size relative to each other. But generally speaking:
scala> var x = new Array[Byte](20000000 * 4)
x: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> x = null
x: Array[Byte] = null
scala> x = new Array[Byte](20000000 * 4)
x: Array[Byte] = [B#475530b9
scala> x = null
x: Array[Byte] = null
scala> x = new Array[Byte](20000000 * 4)
java.lang.OutOfMemoryError: Java heap space
... 32 elided
Related
I was learning pyspark when I encounterd this.
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
Row([0,0,0,85,0,69,0]),
Row([0,89,56,0,0,0,0])],
['features'])
+--------------------+
| features|
+--------------------+
|[0, 45, 63, 0, 0,...|
|[0, 0, 0, 85, 0, ...|
|[0, 89, 56, 0, 0,...|
+--------------------+
sample = df.rdd.map(lambda row: row[0]*2)
sample.collect()
[[0, 45, 63, 0, 0, 0, 0, 0, 45, 63, 0, 0, 0, 0],
[0, 0, 0, 85, 0, 69, 0, 0, 0, 0, 85, 0, 69, 0],
[0, 89, 56, 0, 0, 0, 0, 0, 89, 56, 0, 0, 0, 0]]
My question is why is row[0] is taken as a complete list rather than one value?
What is the property that gives the above output
It is Taken as Complete list as you have given it as one, and you have also defined it under one column "features"
when You are saying
df.rdd.map(lambda row: row[0]*2)
You are just asking spark that "I want all values in this list to occur twice". Hence you get the output that you are getting.
Now How to get Individual values in list.
df = spark.createDataFrame([Row(0,45,63,0,0,0,0),
Row(0,0,0,85,0,69,0),
Row(0,89,56,0,0,0,0)],
['feature1' , 'feature2' , 'feature3' , 'feature4', 'feature5' , 'feature6' , 'feature7'])
This should give you access to individual values in a dedicated column.
Note : syntax for schema is just representation. please refer spark docs for exact syntax.
Hope This helps :)
I have a simple time series where a switch is turned on and off by an operator. My aim is to label each of the "turned on" phases with a different ID, e.g., the result with column eventID would look like this:
val eventDF = sc.parallelize(List(("2016-05-01 10:00:00", 0, 0),
("2016-05-01 10:00:30", 0, 0),
("2016-05-01 10:01:00", 1, 1),
("2016-05-01 10:01:20", 1, 1),
("2016-05-01 10:02:10", 1, 1),
("2016-05-01 10:03:30", 0, 0),
("2016-05-01 10:04:00", 0, 0),
("2016-05-01 10:05:20", 0, 0),
("2016-05-01 10:06:10", 1, 2),
("2016-05-01 10:06:30", 1, 2),
("2016-05-01 10:07:00", 1, 2),
("2016-05-01 10:07:20", 0, 0),
("2016-05-01 10:08:10", 0, 0),
("2016-05-01 10:08:50", 0, 0)))
.toDF("timestamp", "switch", "eventID")
So far, I tried the rank/rangeBetween/lag window functions without any luck...therefore, any hint is appreciated.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I'm updating a variable within a for-loop in scala that I need outside the loop. I tested the code below and got the message "ERROR: undefined". The variable is not empty, inside the loop the values are returned. Thank you.
val example=List(0,0,1,0.7,10,2,5,7,4,1,-9,0,0,0,0,3,3,0,0,0,-80,-6.6,-1,0)
var b=scala.collection.mutable.MutableList.empty[Double]
var b_val:Double=0
for (i<-1 to 24) {
if ( example (i) != 0 ){b_val = b_val + example(i)} else {b_val = 0}
b += b_val;
}
println(b);
You're getting an error because example only has 24 items. Therefore example(i) will throw an IndexOutOfBoundsException when i is 24.
you are trying to add b_val which is double to AnyVal which won't work.
So, to add you need to cast the elements in List to Double as below or define example as List[Double], also your iteration won't work because you are doing to 24 which will blow up because list indexes from 0 to 23
val example = List(0, 0, 1, 0.7, 10, 2, 5, 7, 4, 1, -9, 0, 0, 0, 0, 3, 3, 0, 0, 0, -80, -6.6, -1, 0)
var b = scala.collection.mutable.MutableList.empty[Double]
var b_val: Double = 0
for (i <- 0 until example.length) {
if (example(i) != 0) {
b_val = (b_val + example(i).asInstanceOf[Double])
} else {
b_val = 0
}
b += b_val
}
println(b)
Result is : MutableList(0.0, 0.0, 1.0, 1.7, 11.7, 13.7, 18.7, 25.7, 29.7, 30.7, 21.7, 0.0, 0.0, 0.0, 0.0, 3.0, 6.0, 0.0, 0.0, 0.0, -80.0, -86.6, -87.6, 0.0)
But only you know what you are trying to achieve, small refactor I would do to it in scala way
val example = List(0, 0, 1, 0.7, 10, 2, 5, 7, 4, 1, -9, 0, 0, 0, 0, 3, 3, 0, 0, 0, -80, -6.6, -1, 0)
var b = scala.collection.mutable.MutableList.empty[Double]
var b_val: Double = 0
example.foreach(elem => {
if (elem != 0) {
b_val = b_val + elem.asInstanceOf[Double]
} else {
b_val = 0
}
b += b_val
})
println(b)
I think, you are looking for something like this:
val b = example.scanLeft(0.0) {
case (_, 0) => 0
case (l, r) => l + r
}
If you are going to be writing scala code anyway, learn to do it in scala way. There is no point otherwise.
I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf
The FirstExample in scalaquery-examples project
provides an example of batch insert with the following Syntax:
Coffees.insertAll(
("Colombian", 101, 7.99, 0, 0),
("French_Roast",49, 8.99, 0, 0),
("Espresso",150, 9.99, 0, 0),
("Colombian_Decaf",101, 8.99, 0, 0),
("French_Roast_Decaf", 49, 9.99, 0, 0)
)
How is it possible to pass a dynamically constructed List of tuples in the InsertAll method given that for this example the function definition is:
def insertAll(values: (String, Int, Double, Int, Int)*)(implicit session: org.scalaquery.session.Session): Option[Int]
You can transform your List to variable-length arguments like this:
insertAll(tuplesList.toSeq:_*)