Parallelize collection in spark scala shell - scala

I trying to parallelize the tuple and getting error below. Please let me know that is the error in below syntax
Thank you

Method parallelize need a Seq. Each item in the seq will be one record.
def parallelize[T](seq: Seq[T],
numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
In your example, you need add a Seq to wrap the Tuple, and in this case the RDD only has ONE record
scala> val rdd = sc.parallelize(Seq(("100", List("5", "-4", "2", "NA", "-1"))))
rdd: org.apache.spark.rdd.RDD[(String, List[String])] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> rdd.count
res4: Long = 1

Related

Transform Array[Seq[(Int, String)]] to Seq[(Int, String)] in SCALA

I'm pretty new to scala and I can't find a way to get rid of my Array[Seq[(Int, String)]] to one big Seq[(Int, String)] containing the (Int, String) of each Seq[(Int, String)].
Here is a more explicit example:
Array[Seq[(Int, String)]]:
ArrayBuffer((1,a), (1,group), (1,of))
ArrayBuffer((2,following), (2,clues))
ArrayBuffer((3,three), (3,girls))
And here is what I want my Seq[(Int, String)]] to looks like:
Seq((1,a), (1,group), (1,of), (2,following), (2,clues), (3,three), (3,girls))
You are looking for flatten: val flat: Array[(Int, String)] = originalArray.flatten
If you want it to be a Seq rather than an array (good choice), just tuck a .toSeq at the end: originalArray.flatten.toSeq

scala spark reducebykey use custom fuction

I want to use reducebykey but when i try to use it, it show error:
type miss match required Nothing
question: How can I create a custom function for reducebykey?
{(key,value)}
key:string
value: map
example:
rdd = {("a", "weight"->1), ("a", "weight"->2)}
expect{("a"->3)}
def combine(x: mutable.map[string,Int],y:mutable.map[string,Int]):mutable.map[String,Int]={
x.weight = x.weithg+y.weight
x
}
rdd.reducebykey((x,y)=>combine(x,y))
Lets say you have a RDD[(K, V)] (or PairRDD[K, V] to be more accurate) and you want to somehow combine values with same key then you can use reduceByKey which expects a function (V, V) => V and gives you the modified RDD[(K, V)] (or PairRDD[K, V])
Here, your rdd = {("a", "weight"->1), ("a", "weight"->2)} is not real Scala and similary the whole combine function is wrong both syntactically and logically (it will not compile). But I am guessing that what you have is something like following,
val rdd = sc.parallelize(List(
("a", "weight"->1),
("a", "weight"->2)
))
Which means that your rdd is of type RDD[(String, (String, Int))] or PairRDD[String, (String, Int)] which means that reduceByKey wants a function of type ((String, Int), (String, Int)) => (String, Int).
def combine(x: (String, Int), y: (String, Int])): (String, Int) =
(x._1, x._2 + y._2)
val rdd2 = rdd.reducebykey(combine)
If your problem is something else then please update the question to share your problem with real code, so that others can actually understand it.

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Scala [Functional]: Find the actual indices of an element within a list that may be repeated

I want to write a method that gets a list of characters, and returns a list that each element of it will be a (indexOfelement, element) tuple.
As you know, we can use the indexOf, as following:
def buggyAttempt (charsList: List[Char]): List[(Int, Char)] = charsList.map(char => (charsList.indexOf(char), char))
This works fine if there is no repetition within the elements. So the question is, how to deal with the a list of repeated characters? For example if I feed it List("a", "b", "c", "c"), I will get List((0,a), (1,b), (2,c), (3,c)).
I want to solve this problem in a functional manner, so no mutable variables.
First of all, here is the version of your code that compiles:
def notBuggyAttempt (charsList: List[Char]): List[(Int, Char)] = {
charsList.map(char => (charsList.indexOf(char), char))
}
This will return the tuples with only the first indices.
To obtain what you want, though, you may use zipWIthIndex, which returns List[(Char, Int)], then, if you want a List[(Int, Char)], you have to swap the elements:
def getIndexTuples (charsList: List[Char]): List[(Int, Char)] = {
charsList.zipWithIndex.map(_.swap)
}
Lets say your input is
val input = List('a','b','c','c')
you can get the output using
input.zipWithIndex.collect{
case (t1,t2) => (t2,t1)
}
Output will be
res0: List[(Int, Char)] = List((0,a), (1,b), (2,c), (3,c))
Use zipWithIndex with map as below:
x.zipWithIndex.map(v=>(v._2,v._1))
In Scala REPL:
scala> val x = List("a", "b", "c", "c")
x: List[String] = List(a, b, c, c)
scala> x.zipWithIndex.map(v=>(v._2,v._1))
res22: List[(Int, String)] = List((0,a), (1,b), (2,c), (3,c))
Using IndexOf() makes multiple parsings on the list. If the list is very large there will be significant performance overheads.

Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?

Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int) to DataFrame but not of type scala.collection.immutable.Map[Int,Int]. Why is that?
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
If you take a look at the org.apache.spark.sql.SQLContext source you'll find two different implementations of the createDataFrame method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
and
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
As you can see both require A to be a subclass of Product. When you call toDF on a RDD[(Map[Int,Int], Int)] it works because Tuple2 is indeed a Product. Map[Int,Int] by itself is not hence the error.
You can make it work by wrapping Map with Tuple1:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
Basically because there is no implicit to create a DataFrame for a Map inside an RDD.
In you first example you are returning a Tuple, which is a Product for which there is an implicit conversion.
rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A])
In the second example you use have a Map in your RDD, for which there is no implicit conversion.