Scala Iterator ++ blows the stack - scala

I recently notice this bug causing StackOverFlowError in Scala Iterator++ caused by lazy init. Here's the code to make the bug appear.
var lines = Source.fromFile("file").getLines()
var line = lines.next()
lines = Array(line).toIterator ++ lines
lines.foreach { println(_) }
System.exit(0)
What I get is
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
...
It should be caused by this line in scala source (scala.collection.Iterator.scala:208)
lazy val rhs: Iterator[A] = that.toIterator
As rhs is a lazy init val, when the iterator is used what the name "lines" refers to already changed, and caused a loop reference, which leads to the error.
I noticed this post talks about the problem in 2013. However it seems it has not been fully repaired. I am running Scala 2.11.8 from Maven Repo.
My Question: I can rename the iterator, e.g. "lines2" to avoid this bug, but is this the only way to solve the problem? I feel like using the name "lines" is more natural and don't want to forsake it if possible.

If you want to reload an Iterator using the same var, this appears to work. [Tested on 2.11.7 and 2.12.1]
scala> var lines = io.Source.fromFile("file.txt").getLines()
lines: Iterator[String] = non-empty iterator
scala> var line = lines.next()
line: String = this,that,other,than
scala> lines = Iterator(line +: lines.toSeq:_*)
lines: Iterator[String] = non-empty iterator
scala> lines.foreach(println)
this,that,other,than
here,there,every,where
But it might make more sense to use a BufferedIterator where you can call head on it to peek at the next element without consuming it.
explanation
lines.toSeq <-- turn the Iterator[String] into a Seq[String] (The REPL will show this as a Stream but that's because the REPL has to compile and represent each line of input separately.)
line +: lines.toSeq <-- create a new Seq[String] with line as the first element (i.e. prepended)
(line +: lines.toSeq:_*) <-- turns a single Seq[T] into a parameter list that can be passed to the Iterator.apply() method. #som-snytt has cleverly pointed out that this can be simplified to (line +: lines.toSeq).iterator
BufferedIterator example
scala> var lines = io.Source.fromFile("file.txt").getLines.buffered
lines: scala.collection.BufferedIterator[String] = non-empty iterator
^^^^^^^^^^^^^^^^^^^^^^^^ <-- note the type
scala> lines.head
res5: String = this,that,other,than
scala> lines foreach println
this,that,other,than
here,there,every,where

Simple capture:
scala> var lines = Iterator.continually("x")
lines: Iterator[String] = non-empty iterator
scala> lines = { val z = lines ; Iterator.single("y") ++ z }
lines: Iterator[String] = non-empty iterator
scala> lines.next
res0: String = y
scala> lines.next
res1: String = x

Related

UDF to extract String in scala

I'm trying to extract the last set number from this data type:
urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)
In this example I'm trying to extract 10342800535 as a string.
This is my code in scala,
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
This is run as an UDF and it throws the following error,
org.apache.spark.SparkException: Failed to execute user defined function
What am I doing wrong?
You can simply use regexp_extract function. Check this
val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")
df.show(false)
+-------------------------------------------------------------------+
|x |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+
df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x |NestedUrn |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+
One reason that org.apache.spark.SparkException: Failed to execute user defined function exception are raised is when an exception is raised inside your user defined function.
Analysis
If I try to run your user defined function with the example input you provided, using the code below:
import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._
val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
val extract_urn = udf(extractNestedUrn _)
dataframe.select(extract_urn(col("urn"))).show(false)
I get this complete stack trace:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
at UdfExtractionError$.main(UdfExtractionError.scala:37)
at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 86 more
The important part of this stack trace is actually:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :. The result of the first split is actually this array:
["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
and not this array:
["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]
So, if we execute the remaining statements of your function, you get:
val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]
As at the next line you call the fourth element of the array subarr that contains only one element, an ArrayOutOfBound exception is raised and then Spark returns a SparkException
Solution
Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:
def extractNestedUrn(urn: String): String = {
val arr = urn.split(':') // split using character instead of string regexp
val nested = arr.last // get last element of array, here "187236028,10342800535)"
val subarr = nested.split(',')
val res = subarr.last // get last element, here "10342800535)"
val out = res.init // take all the string except the last character, to remove ')'
out // no need to use .toString as out is already a String
}
However, as said before, the best solution is to use spark inner function regexp_extract as explained in first answer. Your code will be easier to understand and more performant

Scala : adding to Scala List

I am trying to append to a List[String] based on a condition But List shows empty
Here is the Simple code :
object Mytester{
def main(args : Array[String]): Unit = {
val columnNames = List("t01354", "t03345", "t11858", "t1801566", "t180387685", "t015434")
//println(columnNames)
val prim = List[String]()
for(i <- columnNames) {
if(i.startsWith("t01"))
println("Printing i : " + i)
i :: prim :: Nil
}
println(prim)
}
}
Output :
Printing i : t01354
Printing i : t015434
List()
Process finished with exit code 0
This line, i :: prim :: Nil, creates a new List[] but that new List is not saved (i.e. assigned to a variable) so it is thrown away. prim is never changed, and it can't be because it is a val.
If you want a new List of only those elements that meet a certain condition then filter the list.
val prim: List[String] = columnNames.filter(_.startsWith("t01"))
// prim: List[String] = List(t01354, t015434)
1) why can't I add to List?
List is immutable, you have to mutable List (called ListBuffer)
definition
scala> val list = scala.collection.mutable.ListBuffer[String]()
list: scala.collection.mutable.ListBuffer[String] = ListBuffer()
add elements
scala> list+="prayagupd"
res3: list.type = ListBuffer(prayagupd)
scala> list+="urayagppd"
res4: list.type = ListBuffer(prayagupd, urayagppd)
print list
scala> list
res5: scala.collection.mutable.ListBuffer[String] = ListBuffer(prayagupd, urayagppd)
2. Filtering a list in scala?
Also, in your case the best approach to solve the problem would be to use List#filter, no need to use for loop.
scala> val columnNames = List("t01354", "t03345", "t11858", "t1801566", "t180387685", "t015434")
columnNames: List[String] = List(t01354, t03345, t11858, t1801566, t180387685, t015434)
scala> val columnsStartingWithT01 = columnNames.filter(_.startsWith("t01"))
columnsStartingWithT01: List[String] = List(t01354, t015434)
Related resources
Add element to a list In Scala
filter a List according to multiple contains
In addition to what jwvh explained.
Note that in Scala you'd usually do what you want as
val prim = columnNames.filter(_.startsWith("t01"))

Appending Data to List or any other collection Dynamically in scala [duplicate]

This question already has answers here:
Add element to a list In Scala
(4 answers)
Closed 6 years ago.
I am new to scala.
Can we Add/Append data into List or any other Collection Dynamically in scala.
I mean can we add data in List or any collection using foreach (or any other loop).
I am trying to do something like below:
var propertyData = sc.textFile("hdfs://ip:8050/property.conf")
var propertyList = new ListBuffer[(String,String)]()
propertyData.foreach { line =>
var c = line.split("=")
propertyList.append((c(0), c(1)))
}
And suppose property.conf file contains:
"spark.shuffle.memoryFraction"="0.5"
"spark.yarn.executor.memoryOverhead"="712"
This is compiled fine But value is not added in ListBuffer.
I tried it using Darshan's code from his (updated) question:
val propertyData = List(""""spark.shuffle.memoryFraction"="0.5"""", """"spark.yarn.executor.memoryOverhead"="712" """)
val propertyList = new ListBuffer[(String,String)]()
propertyData.foreach { line =>
val c = line.split("=")
propertyList.append((c(0), c(1)))
}
println(propertyList)
It works as expected: it prints to the console:
ListBuffer(("spark.shuffle.memoryFraction","0.5"), ("spark.yarn.executor.memoryOverhead","712" ))
I didn't do it in a Spark Context, although I will try that in a few minutes. So, I provided the data in a list of Strings (shouldn't make a difference). I also changed the "var" keywords to "val" since none of them needs to be a mutable variable, but of course that makes no difference either. The code works whether they are val or var.
See my comment below. But here is idiomatic Spark/Scala code which does behave exactly as you would expect:
object ListTest extends App {
val conf = new SparkConf().setAppName("listtest")
val sc = new SparkContext(conf)
val propertyData = sc.textFile("listproperty.conf")
val propertyList = propertyData map { line =>
val xs: Array[String] = line.split("""\=""")
(xs(0),xs(1))
}
propertyList foreach ( println(_))
}
yes thats possible using mutable collections (see this link), example:
import scala.collection.mutable
val buffer = mutable.ListBuffer.empty[String]
// add elements
buffer += "a string"
buffer += "another string"
or in a loop:
val buffer = mutable.ListBuffer.empty[Int]
for(i <- 1 to 10) {
buffer += i
}
You can either use a mutable collection (not functional), or return a new collection (functional and more idiomatic) as below :
scala> val a = List(1,2,3)
a: List[Int] = List(1, 2, 3)
scala> val b = a :+ 4
b: List[Int] = List(1, 2, 3, 4)

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail

Why is the iterator evaluated when creating a new Iterable from it?

> scala> val myI = new Iterable[Int]{def iterator = Iterator.continually(1)}
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2882) at
> <snip>
Now, is this expected behavior? I find it somewhat strange and it gets in my way.
This is just the REPL trying too hard to be helpful--it's trying to print out your new Iterable as part of what it does when you return a value. You can either stick it in some container that doesn't print its contents or override toString.
scala> val myI = new Iterable[Int] { def iterator = Iterator.continually(1);
| override def toString = "myI" }
myI: Iterable[Int] = myI