Finding lines that start with a digit in Scala using filter() method - scala

I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.
I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.
So, in my Python application, I have these lines.
l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())
which works exactly as I want.
This is what I have tried so far.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))
This throws out an error saying that char does not have forall() function.
I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)
and this too...
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)
This also throws an error.
This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.
Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.

In Scala indexing syntax uses parens () instead of brackets []. The exact translation of your Python code would be this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)
A more idiomatic extraction of the first symbol would be using head method:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)
Both of these methods would fail if your file contains empty lines.
If that's the case, then you probably want this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))
UPD.
As curious noted map(predicate).getOrElse(false) on Option could be shortened to exists(predicate):
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))

You can use regular expressions:
scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)
or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.
scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)
replace List[String] with your lines l in your case to work.

Related

Spark - create list of words from text file and the word that comes immediately after it

I'm trying to create a pair rdd of every word from a text file and every word that follows it.
So for instance,
("I'm", "trying"), ("trying", "to"), ("to", "create") ...
It seems like I can almost use the zip fuction here, if I was able to start with an offset of 1 on the second bit.
How can I do this, or is there a better way?
I'm still not quite used to thinking in terms of functional programming here.
You can manipulate the index, then join on the initial pair RDD:
val rdd = sc.parallelize("I'm trying to create a".split(" "))
val el1 = rdd.zipWithIndex().map(l => (-1+l._2, l._1))
val el2 = rdd.zipWithIndex().map(l => (l._2, l._1))
el2.join(el1).map(l => l._2).collect()
Which outputs:
Array[(String, String)] = Array((I'm,trying), (trying,to), (to,create), (create,a))

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

Get all possible combinations of 3 values from n possible elements

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

Merge multiple RDD generated in loop

I am calling a function in scala which gives an RDD[(Long,Long,Double)] as its output.
def helperfunction(): RDD[(Long, Long, Double)]
I call this function in loop in another part of the code and I want to merge all the generated RDDs. The loop calling the function looks something like this
for (i <- 1 to n){
val tOp = helperfunction()
// merge the generated tOp
}
What I want to do is something similar to what StringBuilder would do for you in Java when you wanted to merge the strings. I have looked at techniques of merging RDDs, which mostly point to using union function like this
RDD1.union(RDD2)
But this requires both RDDs to be generated before taking their union. I though of initializing a var RDD1 to accumulate the results outside the for loop but I am not sure how can I initialize a blank RDD of type [(Long,Long,Double)]. Also I am starting out with spark, so I am not even sure if this is the most elegant method to solve this problem.
Instead of using vars, you can use functional programming paradigms to achieve what you want :
val rdd = (1 to n).map(x => helperFunction()).reduce(_ union _)
Also, if you still need to create an empty RDD, you can do it using :
val empty = sc.emptyRDD[(long, long, String)]
You're correct that this might not be the optimal way to do this, but we would need more info on what you're trying to accomplish with generating a new RDD with each call to your helper function.
You could define 1 RDD prior to the loop and assign it a var then run it through your loop. Here's an example:
val rdd = sc.parallelize(1 to 100)
val rdd_tuple = rdd.map(x => (x.toLong, (x*10).toLong, x.toDouble))
var new_rdd = rdd_tuple
println("Initial RDD count: " + new_rdd.count())
for (i <- 2 to 4) {
new_rdd = new_rdd.union(rdd_tuple)
}
println("New count after loop: " + new_rdd.count())

Reading Second Value In Line Scala

I'd like to read line of input of the form "x y" from standard input using Scala and assign only y to a var. Here's what I have so far:
val Array(_, t) = readLine.split(" ").map(_.toInt)
This looks pretty ugly though. I tried val t = readLine.split(" ").map(_.toInt)(1), but the compiler complains when I try this. If there's a cleaner solution than using an Array, I'd really appreciate the help. Thanks!
Your solution val Array(_, t) = readLine.split(" ").map(_.toInt) is ok when string contains valid data.
If you know that second token is valid use this:
val t = line.split(" ")(1).toInt