Using apache Spark to union to Lists of Tuples - scala

I'm attempting to union to RDD's :
val u1 = sc.parallelize(List ( ("a" , (1,2)) , ("b" , (1,2))))
val u2 = sc.parallelize(List ( ("a" , ("3")) , ("b" , (2))))
I receive error :
scala> u1 union u2
<console>:17: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Any)]
required: org.apache.spark.rdd.RDD[(String, (Int, Int))]
Note: (String, Any) >: (String, (Int, Int)), but class RDD is invariant in type
T.
You may wish to define T as -T instead. (SLS 4.5)
u1 union u2
^
The String type in each of above Tuples is a key.
Is it possible to union these two types ?
Once u1 and u2 are unioned I intent to use groupBy to group each item according to its key.

The issue you are facing is actually explained by the compiler: You are trying to join values of type (Int,Int) with values of type Any. The Any comes as the superclass of String and Int in this statement: sc.parallelize(List ( ("a" , ("3")) , ("b" , (2)))). This might be an error or might be intended.
In any case, I would try to make the values converge to a common type before the union.
Given that Tuple1, Tuple2 are different types, I'd consider some other container that is easier to transform.
Assuming that the "3" above is actually a 3 (Int):
val au1 = sc.parallelize(List ( ("a" , Array(1,2)) , ("b" , Array(1,2))))
val au2 = sc.parallelize(List ( ("a" , Array(3)) , ("b" , Array(2))))
au1 union au2
org.apache.spark.rdd.RDD[(String, Array[Int])] = UnionRDD[10] at union at <console>:17
res: Array[(String, Array[Int])] = Array((a,Array(1, 2)), (b,Array(1, 2)), (a,Array(3)), (b,Array(2)))
Once u1 and u2 are unioned I intent to use groupBy to group each item
according to its key.
If you intend to group both rdds by key, you may consider using join instead of union. That gets the job done at once
au1 join au2
res: Array[(String, (Array[Int], Array[Int]))] = Array((a,(Array(1, 2),Array(3))), (b,(Array(1, 2),Array(2))))
If the "3" above is actually a "3" (String): I'd consider to map the values first to a common type. Either all strings or all ints. It will make the data easier to manipulate than having Any as type. Your life will be easier.

If you want to use an (key,value) RDD with any value (I see you are trying and RDD with and (Int,Int), and Int and a String), you can define the type of your RDD on creation:
val u1:org.apache.spark.rdd.RDD[(String, Any)] = sc.parallelize(List ( ("a" , (1,2)) , ("b" , (1,2))))
val u2org.apache.spark.rdd.RDD[(String, Any)] = sc.parallelize(List ( ("a" , ("3")) , ("b" , (2))))
Then the union will work because it's the union between the same types.
Hope it helps

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.
Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

ReduceByKey for a HashMap based RDD

I have a RDD A of the form tuple (key,HashMap[Int, Set(String)]) which I want to convert to a new RDD B (key, HashMap[Int, Set(String)) where the latter RDD has unique keys and the value for each key k is union of all sets for key k in RDD A.
For example,
RDD A
(1,{1->Set(3,5)}), (2,{3->Set(5,6)}), (1,{1->Set(3,4), 7->Set(10, 11)})
will convert to
RDD B
(1, {1->Set(3,4,5), 7->Set(10,11)}), (2, {3->Set(5,6)})
I am not able to formulate a function for this in Scala as I am new to the language. Any help would be appreciated.
Thanks in advance.
cats Semigroup would be a great fit here. Add
spark.jars.packages org.typelevel:cats_2.11:0.9.0
to the configuration and use combine method:
import cats.implicits._
val rdd = sc.parallelize(Seq(
(1, Map(1 -> Set(3,5))),
(2, Map(3 -> Set(5,6))),
(1, Map(1 -> Set(3,4), 7 -> Set(10, 11)))
rdd.reduceByKey(_ combine _)

Joining 2 RDDs when one having a Option type as key

I have 2 RDDs I would like to join which looks like this
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
Is there any way I can do a left outer join on them?
I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]
q.leftOuterJoin(a)
The natural solution is to convert the Int to Option[Int] so they have the same type.
Following you example:
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)
If you want to recover the Int type at the output, you can do this:
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }
Note that you can do ".get" without any problem since it is not possible to get None's there.
One way to do is to convert it into dataframe and join
Here is a simple example
import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
(Some(3), 33),
(Some(1), 11),
(Some(2), 22)
)).toDF("id", "value1")
val q = spark.sparkContext.parallelize(Seq(
(Some(3), 33)
)).toDF("id", "value2")
q.join(a, a("id") === q("id") , "leftouter").show

Spark Joins with None Values

I am trying to perform a join in Spark knowing that one of my keys on the left does not have a corresponding value in the other RDD.
The documentation says it should perform the join with None as an option if no key is found, but I keep getting a type mismatch error.
Any insight here?
Take these two RDDs:
val rdd1 = sc.parallelize(Array(("test","foo"),("test2", "foo2")))
val rdd2 = sc.parallelize(Array(("test","foo3"),("test3", "foo4")))
When you join them, you have a couple of options. What you do depends on what you want. Do you want an RDD only with the common keys?
val leftJoined = rdd1.join(rdd2)
leftJoined.collect
res1: Array[(String, (String, String))] = Array((test,(foo,foo3)))
If you want keys missing from rdd2 to be filled in with None, use leftOuterJoin:
val leftOuter = rdd.leftOuterJoin(rdd2)
leftOuter.collect
res2: Array[(String, (String, Option[String]))] = Array((test2,(foo2,None)), (test,(foo,Some(foo3))))
If you want keys missing from either side to be filled in with None, use fullOuterJoin:
val fullOuter = rdd1.fullOuterJoin(rdd2)
fullOuter.collect
res3: Array[(String, (Option[String], Option[String]))] = Array((test2,(Some(foo2),None)), (test3,(None,Some(foo4))), (test,(Some(foo),Some(fo3))))

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.
For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)
It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)