how to get an index of each partition when using RDD.mapPartitionsWithIndex?

how to get an index of each partition when using RDD.mapPartitionsWithIndex? - scala

I am new in spark and scala. Is there a way in Spark to get the Partition ID/No from
RDD.mapPartitionsWithIndex where it defined as follows :
def randomint(index: Int, iter: Iterator[T]) : Iterator[(Int, T)]={
...
}
self.mapPartitionsWithIndex(randomint).partitionBy(new randParti(nump)).values

Your naming might be confusing, but the index variable in the randomint function does contain what you are looking for: the partition no.

Related

In spark, is there a way to convert the RDD objects into case objects

I am new to the Spark programing and I came across a scenario where I am novice to case class and I need to use case class in my RDDs:
For example, I have an RDD of tuples like :
Array[(String,String,String)]
having values like:
Array((20254552,ATM,-5100), (20174649,ATM,5120)........)
Is there any method to convert the above RDD into:
20254552,trans(ATM,-5100)
where trans is a case class?

Yes. Definitely you can do that. Following code should help you do that
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sparkContext.parallelize(array)
val transedRdd = rdd.map(x => (x._1, trans(x._2, x._3)))
You should create case class outside your current class
case class trans(atm : String, num: Int)
I hope it helps

It's not the really the answer of your question but I recommend that you use Dataframes and Datasets as much as possible. Using them will benefit you a lot such as improve coding effieciency, well tested framewords with optimizations to use less memory and benefit from spark-engine fully.
Please refer to A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets for more information about differences and uses case of RDD, Dataframes and Datasets
Using Datasets the solution for your problem is very simple :
import spark.implicits._
val ds = Seq((20254552,"ATM",-5100), (20174649,"ATM",5120)).toDS()
val transsedds = ds.map(x => (x._1, trans(x._2, x._3)))
As #Ramesh says you should create case class outside your current class
case class trans(atm : String, num: Int)
Hope it helps.

Get Type of RDD in Scala/Spark

I am not sure if type is the right word to use here, but let say I have an RDD of the following type
RDD[(Long, Array[(Long, Double)])]
Now if I have the RDD, how can i find the type of it (as mentioned above) at runtime ?
I basically want to compare two RDDs, at runtime to see if they store the same kind of data (the values it self might be different), is there another way to do it? Moreover, I want to get a cached RDD as an instance of RDD type using the following code
sc.getPersistentRDDs(0).asInstanceOf[RDD[(Long, Array[(Long, Double)])]]
where RDD[(Long, Array[(Long, Double)])] has been found out dynamically at run time based on another RDD of same type.
So is there a way to get this value on runtime from an RDD ?

You can use Scala's TypeTags
import scala.reflect.runtime.universe._
def checkEqualParameters [T1, T2](x : T1, y : T2)(implicit type1 : TypeTag[T1], type2 : TypeTag[T2]) = {
type1.tpe.typeArgs == type2.tpe.typeArgs
}
And then compare
checkEqualParameters (rdd1, rdd2)

How to sort on multiple columns using takeOrdered?

How to sort by 2 or multiple columns using the takeOrdered(4)(Ordering[Int]) approach in Spark-Scala.
I can achieve this using the sortBy like this :
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when i try to sort using the takeOrdered approach its failing

tl;dr Do something like this (but consider rewriting your code to call split only once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here is the explanation.
When you call takeOrdered directly on lines, the implicit Ordering that takes effect is Ordering[String] because lines is an RDD[String]. You need to transform lines into a new RDD[(Int, Int)]. Because there is an implicit Ordering[(Int, Int)] available, it takes effect on your transformed RDD.
Meanwhile, sortBy works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know that is an intimidating signature, but if you cut through the noise, you can see that sortBy takes a function that maps your original type to a new type just for sorting purposes and applies the Ordering for that return type if one is in implicit scope.
In your case, you are applying a function to the Strings in your RDD to transform them into a "view" of how Spark should treat them merely for sorting purposes, i.e as a (Int, Int), and then relying on the fact that the implicit Ordering[(Int, Int)] is available as mentioned.
The sortBy approach allows you to keep lines intact as an RDD[String] and use the mapping just to sort while the takeOrdered approach operates on a brand new RDD containing (Int, Int) derived from the original lines. Whichever approach is more suitable for your needs depends on what you wish to accomplish.
On another note, you probably want to rewrite your code to only split your text once.

You could implement your custom Ordering:
lines.takeOrdered(4)(new Ordering[String] {
override def compare(x: String, y: String): Int = {
val xs=x.split(",")
val ys=y.split(",")
val d1 = xs(1).toInt - ys(1).toInt
if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
}
})

Nesting of RDD's in Scala Spark

Referring to this question : NullPointerException in Scala Spark, appears to be caused be collection type?
Answer states "Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations."
This code :
val x = sc.parallelize(List(1 , 2, 3))
def fun1(n : Int) = {
fun2(n)
}
def fun2(n: Int) = {
n + 1
}
x.map(v => fun1(v)).take(1)
prints :
Array[Int] = Array(2)
This is correct.
But does this not disagree with "can't perform transformations or actions on RDDs inside of other RDD operations." since a nested action is occurring on an RDD ?

No. In the linked question d.filter(...) returns an RDD, so the type of
d.distinct().map(x => d.filter(_.equals(x)))
is RDD[RDD[String]]. This isn't allowed, but it doesn't happen in your code. If I understand the answer right, you can't refer to d or other RDDs inside map as well even if you don't get RDD[RDD[SomeType]] in the end.

Does Scala have syntax for 0- and 1-tuples?

scala> val two = (1,2)
two: (Int, Int) = (1,2)
scala> val one = (1,)
<console>:1: error: illegal start of simple expression
val one = (1,)
^
scala> val zero = ()
zero: Unit = ()
Is this:
val one = Tuple1(5)
really the most concise way to write a singleton tuple literal in Scala? And does Unit work like an empty tuple?
Does this inconsistency bother anyone else?

really the most concise way to write a singleton tuple literal in Scala?
Yes.
And does Unit work like an empty tuple?
No, since it does not implement Product.
Does this inconsistency bother anyone else?
Not me.

It really is the most concise way to write a tuple with an arity of 1.
In the comments above I see many references to "why Tuple1 is useful".
Tuples in Scala extend the Product trait, which lets you iterate over the tuple members.
One can implement a method that has a parameter of type Product, and in this case Tuple1 is the only generic way to iterate fixed size collections with multiple types without losing the type information.
There are other reasons for using Tuple1, but this is the most common use-case that I had.

I have never seen a single use of Tuple1. Nor can I imagine one.
In Python, where people do use it, tuples are fixed-size collections. Tuples in Scala are not collections, they are cartesian products of types. So, an Int x Int is a Tuple2[Int, Int], or (Int, Int) for short. Naturally, an Int is an Int, and no type is meaningless.

The previous answers have given a valid Tuple of 1 element.
For one of zero elements this code could work:
object tuple0 extends AnyRef with Product {
def productArity = 0
def productElement(n: Int) = throw new IllegalStateException("No element")
def canEqual(that: Any) = false
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to get an index of each partition when using RDD.mapPartitionsWithIndex? - scala

I am new in spark and scala. Is there a way in Spark to get the Partition ID/No from RDD.mapPartitionsWithIndex where it defined as follows : def randomint(index: Int, iter: Iterator[T]) : Iterator[(Int, T)]={ ... } self.mapPartitionsWithIndex(randomint).partitionBy(new randParti(nump)).values

Your naming might be confusing, but the index variable in the randomint function does contain what you are looking for: the partition no.

Related

In spark, is there a way to convert the RDD objects into case objects

Get Type of RDD in Scala/Spark

How to sort on multiple columns using takeOrdered?

Nesting of RDD's in Scala Spark

Does Scala have syntax for 0- and 1-tuples?

Categories

Resources