out of bound exception in spark using scala - scala

I'm a beginner to scala and what i'm doing is to map dataset into (k, v) pairs where kv(0) and kv(1) are Strings and kv(2) is a list. The code is listed below:
val rdd_q1_bs = rdd_business.map(lines => lines.split('^')).map(kv =>
(kv(0), (kv(1), kv(2))))
But here's the problem, there are some empty lists for kv(2) in the dataset. So when I use .collect() to gather all the elements, there can be an out of bounds exception.
What I'm thinking is to define a function and check the length of kv. Is there any simple way I can ignore the exception and keep the process, or replace kv(2) by a String?

lines => lines.split('^') function suggests that rdd_business rdd are all RDD[String] and you are splitting the strings with ^ which would give you RDD[Array[String]] and from that you are trying to extract the elements of Array using kv(0), kv(1) and kv(2). The exception you are getting is because there might be only one ^ in one of the RDD[String] (rdd_business object).
So what you can do in such case is to use Try or Option.
import scala.util.Try
val rdd_q1_bs = rdd_business.map(lines => lines.split('^')).map(kv =>
(kv(0), (kv(1), Try(kv(2)) getOrElse("not found"))))
For better safety you can apply Try or Option on all the elements of the Array as
val rdd_q1_bs = rdd_business.map(lines => lines.split('^')).map(kv =>
(Try(kv(0)) getOrElse("notFound"), (Try(kv(1)) getOrElse("notFound"), Try(kv(2)) getOrElse("not found"))))
You can proceed the same way for Option as well.
I hope the answer is helpful

Related

Char count in string

I'm new to scala and FP in general and trying to practice it on a dummy example.
val counts = ransomNote.map(e=>(e,1)).reduceByKey{case (x,y) => x+y}
The following error is raised:
Line 5: error: value reduceByKey is not a member of IndexedSeq[(Char, Int)] (in solution.scala)
The above example looks similar to staring FP primer on word count, I'll appreciate it if you point on my mistake.
It looks like you are trying to use a Spark method on a Scala collection. The two APIs have a few similarities, but reduceByKey is not part of it.
In pure Scala you can do it like this:
val counts =
ransomNote.foldLeft(Map.empty[Char, Int].withDefaultValue(0)) {
(counts, c) => counts.updated(c, counts(c) + 1)
}
foldLeft iterates over the collection from the left, using the empty map of counts as the accumulated state (which returns 0 is no value is found), which is updated in the function passed as argument by being updated with the found value, incremented.
Note that accessing a map directly (counts(c)) is likely to be unsafe in most situations (since it will throw an exception if no item is found). In this situation it's fine because in this scope I know I'm using a map with a default value. When accessing a map you will more often than not want to use get, which returns an Option. More on that on the official Scala documentation (here for version 2.13.2).
You can play around with this code here on Scastie.
On Scala 2.13 you can use the new groupMapReduce
ransomNote.groupMapReduce(identity)(_ => 1)(_ + _)
val str = "hello"
val countsMap: Map[Char, Int] = str
.groupBy(identity)
.mapValues(_.length)
println(countsMap)

Why is only one new object created when called multiple times in `map`?

As I understand it, a way to create a new ArrayBuffer with one element is to say
val buffer = ArrayBuffer(element)
or something like this:
val buffer = ArrayBuffer[Option[String]](None)
Suppose x is a collection with 3 elements. I'm trying to create a map that creates a new 1-element ArrayBuffer for each element in x and associates the element in x with the new buffer. (These are intentionally separate mutable buffers that will be modified by threads.) I tried this:
x.map(elem => (elem, ArrayBuffer[Option[String]](None))).toMap
However, I found (using System.identityHashCode) that only one ArrayBuffer was created, and all 3 elements were mapped to the same value.
Why is this happening? I expected that the tuple expression would be evaluated for each element of x and that this would result in a new ArrayBuffer being created for each evaluation of the expression.
What would be a good workaround?
I'm using Scala 2.11.
Update
In the process of creating a reproducible example, I figured out the problem. Here's the example; Source is an interface defined in our application.
def test1(x: Seq[Source]): Unit = {
val containers = x.map(elem => (elem, ArrayBuffer[Option[String]](None))).toMap
x.foreach(elem => println(
s"test1: elem=${System.identityHashCode(elem)} container=${System.identityHashCode(containers(elem))}"))
x.indices.foreach(n => containers(x(n)).update(0, Some(n.toString)))
x.foreach(elem => println(s"resulting value: ${containers(elem)(0)}"))
}
What I missed was that for the values of x I was trying to use, the class implementing Source was returning true for equals() for all combinations of values. So the resulting map only had one key-value pair.
Apologies for not figuring this out sooner. I'll delete the question after a while.
I think your problem is the toMap. If all three elements are None then you have in a Map just one element (as all have the same key).
I played a bit on Scalafiddle (remove the .toMap and you will have 3 ByteArrays)
let me know if I have misunderstood you.
I cannot seem to replicate the issue, for example
val m =
List(Some("a"), Some("b"), Some("c"))
.map(elem => (elem, ArrayBuffer[Option[String]](None)))
.toMap
m(Some("a")) += Some("42")
m
outputs
res2: scala.collection.immutable.Map[Some[String],scala.collection.mutable.ArrayBuffer[Option[String]]] = Map(
Some(a) -> ArrayBuffer(None, Some(42)),
Some(b) -> ArrayBuffer(None),
Some(c) -> ArrayBuffer(None)
)
where we see Some("42") was added to one buffer whilst others were unaffected.

How to work with a Spark RDD to produce or map to another RDD

I have a Key/Value RDD I want to take that "iterate over" the entities in it, Key/Value, and create, or map, to another RDD which could have more or less entries that the first RDD.
Example:
I have records in accumulo that represent observations of colors in paintings.
An observation entity/object holds data on the painting name and the colors in the painting.
Observation
public String getPaintingName() {return paintingName;}
public List<String> getObservedColors() {return colorList}
I pull the observations from accumulo into my code as an RDD.
val observationRDD: RDD[(Text, Observation)] = getObservationsFromAccumulo();
I want to take this RDD and create an RDD of the form of (Color, paintingName) where the key is the color observed and the value is the painting name which the color was observed in.
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.somefunction({ case (_, observation) =>
for(String color : observations.getObservedColors()) {
// Some how output a entry into a new RDD
//output/map (color, observation.getPaintingName)
})
I know map can't work, because its 1 to 1, I thought maybe observationRDD.flatmap(some function) but can't seem to find any examples on how to do that to create a new, larger or smaller, RDD.
Could someone help me out and tell me if flatmap is correct, and if so give me an example using this example I provided, or tell me if i'm way off base?
Please understand this is just a simple example, its not the content im asking about, its how one would transform a RDD to a RDD with more or less entries.
You should use flatmap and return a List[(String, String)] foreach element in RDD. FlatMap will flat the result and you'll get an RDD[(String, String)]
I didn't try the code, but it would be something like this:
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.flatMap { case (_, observation) =>
observations.getObservedColors().map(color => (color, observation.getPaintingName))
}
Probably if getObservedColors method is in Java you have to import JavaConversions and change to scala list.
import scala.collection.JavaConversions._
observations.getObservedColors().toList

Convert RDD[String] to RDD[Row] to Dataframe Spark Scala

I am reading in a file that has many spaces and need to filter out the space. Afterwards we need to convert it to a dataframe. Example input below.
2017123 ¦ ¦10¦running¦00000¦111¦-EXAMPLE
My solution to this was the following function which parses out all spaces and trims the file.
def truncateRDD(fileName : String): RDD[String] = {
val example = sc.textFile(fileName)
example.map(lines => lines.replaceAll("""[\t\p{Zs}]+""", ""))
}
However, I am not sure how to get it into a dataframe. sc.textFile returns a RDD[String]. I tried the case class way but the issue is we have 800 field schema, case class cannot go beyond 22.
I was thinking of somehow converting RDD[String] to RDD[Row] so I can use the createDataFrame function.
val DF = spark.createDataFrame(rowRDD, schema)
Any suggestions on how to do this?
First split/parse your strings into the fields.
rdd.map( line => parse(line)) where parse is some parsing function. It could be as simple as split but you may want something more robust. This will get you an RDD[Array[String]] or similar.
You can then convert to an RDD[Row] with rdd.map(a => Row.fromSeq(a))
From there you can convert to DataFrame wising sqlContext.createDataFrame(rdd, schema) where rdd is your RDD[Row] and schema is your schema StructType.
In your case simple way :
val RowOfRDD = truncateRDD("yourfilename").map(r => Row.fromSeq(r))
How to solve productarity issue if you are using scala 2.10 ?
However, I am not sure how to get it into a dataframe. sc.textFile
returns a RDD[String]. I tried the case class way but the issue is we
have 800 field schema, case class cannot go beyond 22.
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here

Converting a Scala Map to a List

I have a map that I need to map to a different type, and the result needs to be a List. I have two ways (seemingly) to accomplish what I want, since calling map on a map seems to always result in a map. Assuming I have some map that looks like:
val input = Map[String, List[Int]]("rk1" -> List(1,2,3), "rk2" -> List(4,5,6))
I can either do:
val output = input.map{ case(k,v) => (k.getBytes, v) } toList
Or:
val output = input.foldRight(List[Pair[Array[Byte], List[Int]]]()){ (el, res) =>
(el._1.getBytes, el._2) :: res
}
In the first example I convert the type, and then call toList. I assume the runtime is something like O(n*2) and the space required is n*2. In the second example, I convert the type and generate the list in one go. I assume the runtime is O(n) and the space required is n.
My question is, are these essentially identical or does the second conversion cut down on memory/time/etc? Additionally, where can I find information on storage and runtime costs of various scala conversions?
Thanks in advance.
My favorite way to do this kind of things is like this:
input.map { case (k,v) => (k.getBytes, v) }(collection.breakOut): List[(Array[Byte], List[Int])]
With this syntax, you are passing to map the builder it needs to reconstruct the resulting collection. (Actually, not a builder, but a builder factory. Read more about Scala's CanBuildFroms if you are interested.) collection.breakOut can exactly be used when you want to change from one collection type to another while doing a map, flatMap, etc. — the only bad part is that you have to use the full type annotation for it to be effective (here, I used a type ascription after the expression). Then, there's no intermediary collection being built, and the list is constructed while mapping.
Mapping over a view in the first example could cut down on the space requirement for a large map:
val output = input.view.map{ case(k,v) => (k.getBytes, v) } toList