What is the result of the join of two rdd? - pyspark

The element in clickRdd is (h5id,[query]), where h5id is a long number and query is a string; the element in revealRdd is (h5id, [0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3] ).
and what is the result of clickJoin = clickRdd.join(revealRdd), I guess the join key is h5id.
Anyone can give me the content after joining ?

The joined RDD will have both the values of rdd in a tuple and hi5id as key.
clickJoin.take(1)
[(hi5id,([query],[0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3]))]

Related

how to pick the rows from dataframe comparing with hashmap

I have two dataframes,
df1
id slt sln elt eln start end
df2
id evt slt sln speed detector
Hashmap
Map(351608084643945 -> List(1544497916,1544497916), 351608084643944 -> List(1544498103,1544498093))
I want to compare the values in the list and if the two values in the list match ,then I want to have the full row from dataframe(df1) of that id.
else,full row from df2 of that id.
Both the dataframes and maps will have distinct and unique id.
If I understand correctly you want to traverse your hash map and for entry you want to check if value which is list have all the values same. If list have same element that you want data from df1 else from df2 for that key. If that is what you want than below is the code for same.
hashMap.foreach(x => {
var key = x._1.toString
var valueElements = x._2.toList
if (valueElements.forall(_ == valueElements.head)) {
df1.filter($"id".equalTo(key))
} else {
df2.filter($"id".equalTo(key))
}
})
Two steps:
Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap.
Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

How to do aggregation on dataframe to get distinct count of column

How do I apply where condition on dataframe ,example I need to groupBy on one column and count the distinct values in the column based on certain where condition.I need to do this where condition for multiple columns
I tried the below way.Please let me know how Can I do this.
case class testRdd(name:String,id:Int,price:Int)
val Cols = testRdd.toDF().groupBy("id").agg( countDistinct("name").when(col("price")>0,1).otherwise(0)
This will not work,or Is there a way to do something like ? Thanks in advance
testRdd.toDF().groupBy("id").agg(if(col("price")>0)countDistinct("name"))
Here is an alternative approach to #Robin's answer, namely introducing an additional boolean column to group
df.groupBy($"id",when($"price">0,true).otherwise(false).as("positive_price"))
.agg(
countDistinct($"name")
)
.where($"positive_price")
.show
testRDD.select("name","id").where($"price">0).distinct.groupBy($"id").agg( count("name")).show

Apache Spark Dataframe - Issue with setting up a not-equal join

I have 2 dataframes that I'm doing a multi-column join. The first pair of columns is doing an equal comparison and the second pair is a not-equals comparison. The code looks like this:
val arule_1w = itemLHS
.join(itemRHS, itemLHS("CUST_ID") === itemRHS("CUST_ID") && itemLHS("LHS") != itemRHS("RHS")
The resulting data still has rows that contains itemLHS("LHS") = itemRHS("RHS"), which it shouldn't, with the not-equal join. It maybe user error as well but all my research tells me that format is correct. All datatypes are string values.
Thanks for your help!
Correct method is =!= not !=.
Used below syntaxt
itemLHS("LHS") !== itemRHS("RHS")

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)

Order by Value in Spark pairRDD from (Key,Value) where the value is from spark-sql

I have created a map like this -
val b = a.map(x => (x(0), x) )
Here b is of the type
org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]
How can I sort the PairRDD within each key using a field from the value row?
After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
Is there any consideration needed for Partitioning the Pair RDD?
Answering only your first question:
val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))
What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).