I have two pairRDDs (Int, BreezeDenseMatrix[Double]) and what i want is, when the keys are the same to substrack their values.
E.g. when i have
RDD_1 : (1, BreezeMatrix_a)
RDD_2: (1, BreezeMatrix_b)
wanted result: (1, BreezeMatrix_a-BreezeMatrix_b)
I tried join but what is returned is (Int, (BreezeMatrix_a, BreezeMatrix_b)) and i don't know how the second part could be transformed. I can't understand if it is a set or an array, spark is not clear to that.
Any other ideas?
Let the result of the join be
joinresult = (Int, (BreezeMatrix_a, BreezeMatrix_b))
then give
actualresult = joinresult.map( a => (a._1,( a._2_1 - a._2_2)))
Related
val df = sc.parallelize(Seq((a, 1), (a, null), (b, null)(b, 2),(b, 3),(c, 2),(c, 4),(c, 3))).toDF("col1","col2")
The output should be like below.
col1 col2
a null
b null
c 4
I knew that groupBy on col1 and get the max of col2. which I can perform using df.groupBy("col1").agg("col2"->"max")
But my requirement is if null is there I want to select that record, but if null is not there I want to select max of col2.
How can I do this, can any please help me.
As I commented, your use of null makes things unnecessarily problematic, so if you can't work without null in the first place, I think it makes most sense to turn it into something more useful:
val df = sparkContext.parallelize(Seq((a, 1), (a, null), (b, null), (b, 2),(b, 3),(c, 2),(c, 4),(c, 3)))
.mapValues { v => Option(v) match {
case Some(i: Int) => i
case _ => Int.MaxValue
}
}.groupBy(_._1).map {
case (k, v) => k -> v.map(_._2).max
}
First, I use Option to get rid of null and to move things down the tree from Any to Int so I can enjoy more type safety. I replace null with MaxValue for reasons I'll explain shortly.
Then I groupBy as you did, but then I map over the groups to pair the keys with the max of the values, which will either be one of your original data items or MaxValue where the nulls once were. If you must, you can turn them back into null, but I wouldn't.
There might be a simpler way to do all this, but I like the null replacement with MaxValue, the pattern matching which helps me narrow the types, and the fact I can just treat everything the same afterwards.
What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:
rdd1 rdd2
id count id count
a 2 a 1
b 1 c 5
d 3
And the result I am searching for is this:
rdd2
id count
a 3
b 1
c 5
d 3
what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:
RDD[(String, org.apache.spark.sql.Row)]
For now, I have the following code:
var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}
How can I calculate that sum between RDDs?
If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:
val result = rdd1.fullOuterJoin(rdd2).map {
case (id, (left, right)) =>
(id, left.getOrElse(0) + right.getOrElse(0))
}
This applies if your RDD is of type (String, Int).
I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)
In Apache Flink, if I join two data sets on one primary key I get a tuple 2 containing the corresponding data set entry out each of the data sets.
The problem is, when applying a the map() method to the outcoming tuple 2 data set it does not really look nice, especially if the entries of both data sets have a high number of features.
Using tuples in both input data sets gets me some code like this:
var in1: DataSet[(Int, Int, Int, Int, Int)] = /* */
var in2: DataSet[(Int, Int, Int, Int)] = /* */
val out = in1.join(in2).where(0, 1, 2).equalTo(0, 1, 2)
.map(join => (join._1._1, join._1._2, join._1._3,
join._1._4, join._1._5, join._2._4))
I would not mind using POJOs or case classes, but I don't see how this would make it better.
Question 1: Is there a nice way to flaten that tuple 2? E.g. using another operator.
Question 2: How to handle a join of 3 data sets on the same key? It would make the example source even more messy.
Thanks for helping.
you can directly apply a join function on each pair of joined elements such as for example
val leftData: DataSet[(String, Int, Int)] = ...
val rightData: DataSet[(String, Int)] = ...
val joined: DataSet[(String, Int, Int)] = leftData
.join(rightData).where(0).equalTo(0) { (l, r) => (l._1, l._2, l._3 + r._2) ) }
To answer the second question, Flink handles only binary joins. However, Flink's optimizer can avoid to do unnecessary shuffles, if you give a hint about the behavior of your function. Forward Field annotations tell the optimizer, that certain fields (such as the join key) have not been modified by your join function and enables reusing existing partitioning and sortings.
I have created a map like this -
val b = a.map(x => (x(0), x) )
Here b is of the type
org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]
How can I sort the PairRDD within each key using a field from the value row?
After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
Is there any consideration needed for Partitioning the Pair RDD?
Answering only your first question:
val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))
What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).