search rdd for value from another rdd

search rdd for value from another rdd - scala

I am using Spark + Scala. My rdd1 has customer info i.e. (id, [name, address]). rdd2 has only names of high profile customers. Now I want to find if customer in rdd1 is high profile or not. How can I search one rdd using another? Joining rdd's is not looking like a good solution for me.
My code:
val result = rdd1.map( case (id, customer) =>
customer.foreach ( c =>
rdd2.filter(_ == c._1).count()!=0 ))
Error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations;

You have to broadcast one rdd by collecting it. You can broadcast the smaller rdd to improve performance.
val bcastRdd = sc.broadcast(rdd2.collect)
rdd1.map(
case (id, customer) => customer.foreach(c =>
bcastRdd.value.filter(_ == c._1).count()!=0))

You can use the left outer join, to avoid an expensive operation such as the collect (if your RDDs are big)
Also like Daniel pointed out, a broadcast is not necessary.
Here is a snippet that can help to obtain RDD1 with a flag which denotes he is a high profile customer or a low profile customer.
val highProfileFlag = 1
val lowProfileFlag = 0
// Keying rdd 1 by the name
val rdd1Keyed = rdd1.map { case (id, (name, address)) => (name, (id, address)) }
// Keying rdd 2 by the name and adding a high profile flag
val rdd2Keyed = rdd2.map { case name => (name, highProfileFlag) }
// The join you are looking for is the left outer join
val rdd1HighProfileFlag = rdd1Keyed
.leftOuterJoin(rdd2Keyed)
.map { case (name, (id, address), highProfileString) =>
val profileFlag = highProfileString.getOrElse(lowProfileFlag)
(id , (name, address, profileFlag))
}

Related

How do you use the salted key technique to reduce groupBy skew in spark for RDDs

I have lets say rddA = List( ((skewKey, (col1, col2)) ), where skewKey is a number between 1 and 10k assigned to each col1 such that the total distribution of items over all the skewKeys are uniform.
I would like to do an operation grouped on col1, but it's highly skewed so some tasks take forever, so I would like to first repartition to 10k partitions using the skew key.
rddA.groupBy(x => x._1, numPartitions = 10k) // group by skew key first
.flatMap( // remove skewKey
group => {
(
group._2.map(x => x._2)
)
}
)
.groupBy(x => x._1) // then do the actual group computation
.map( // what I actually want to compute
group => {
(
group._1,
some_process(group._2)
)
}
)
Will this give me what I want? two group by operations, one to repartition to the skewKey, and the second to the actual groupBy operation

Spark Enhance Join between Terabytes of Datasets

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet format.
Example of one table below:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points,
Join all of them together with the most optimum way using Spark Scala. I tried to sortByKey before join but still there is a performance bottleneck. I tried to reparation by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results.
After join I need to apply a separate function for some of these columns.
Sample of the join I tried below,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")

I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.
First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value]
Third, You will groupByKey and union all the dataset with the same key.
I the above way it will be faster to union than join. But it need more work.
After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name).
Simple example of the implementation as following:
You will map first the dataset to the case class then you can union all of them. After that you will groupByKey then map it and reduce it.
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})

Calculating a variable inside RDD after full outer join in Scala

What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:
rdd1 rdd2
id count id count
a 2 a 1
b 1 c 5
d 3
And the result I am searching for is this:
rdd2
id count
a 3
b 1
c 5
d 3
what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:
RDD[(String, org.apache.spark.sql.Row)]
For now, I have the following code:
var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}
How can I calculate that sum between RDDs?

If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:
val result = rdd1.fullOuterJoin(rdd2).map {
case (id, (left, right)) =>
(id, left.getOrElse(0) + right.getOrElse(0))
}
This applies if your RDD is of type (String, Int).

Why external sorter works in spark

I made processing data code in scala&spark and somehow it's so slow. I guess it's because of 'ExternalSort'. As you can see my code below, There is no reason to sort data but spark did.
I have more than 6,000,000 rows in RDD and try to cluster data with column name 'ID' (which are less than 20 types, so each ID group would be more than 300,000 rows)
I know It's pretty large data but other process were not slow. Any idea of this?
val ListByID = allData.map { x => (x.getAs[String]("ID"), List(x)) }.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
val goalData = ListByID.map({ rowList =>
val list = rowList._2
val ID = rowList._1
val SD = list.head.getAs[String]("SD")
val ANOTEHR_ID_CNT = list.map{ row=> row.getAs[String]("ANOTHER_ID")}.distinct.length
Row(
ID, ID, list.length,
list.count { row => row.getAs[Int]("FLAGA")==1 },
list.count { row => row.getAs[Int]("FLAGB")==1 },
SD, ANOTEHR_ID_CNT)
})

Following part:
allData.map{...}.reduceByKey{ (a: List[Row], b: List[Row]) => List(a, b).flatten }
is just a significantly more expensive implementation of groupByKey. It not only puts more pressure on GC by applying map-side aggregations but may also create huge number of temporary objects. If single group doesn't fit into memory then out-of-memory error is inevitable.
Next you group data and drag all the fields when all you do later is counting. It could be easily handled with simple aggregation.
Reduce by ID and ANOTHER_ID counting FLAGA=1, FLAGB=1 and keeping single SD
Reduce 1. by ID, sum FLAGA=1, FLAGB=1, 1 (distinct ANOTHER_ID), keep arbitrary SD.
Finally if you start with DataFrame why move data to less efficient format at all? With pseudocode:
df.groupBy("ID").agg(
count($"*"),
count(when($"FLAGA" === 1, 1)),
count(when($"FLAGB" === 1, 1))
countDistinct("ANOTHER_ID"),
first("SD")
)

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.

Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))

On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

search rdd for value from another rdd - scala

You have to broadcast one rdd by collecting it. You can broadcast the smaller rdd to improve performance. val bcastRdd = sc.broadcast(rdd2.collect) rdd1.map( case (id, customer) => customer.foreach(c => bcastRdd.value.filter(_ == c._1).count()!=0))

Related

How do you use the salted key technique to reduce groupBy skew in spark for RDDs

Spark Enhance Join between Terabytes of Datasets

Calculating a variable inside RDD after full outer join in Scala

Why external sorter works in spark

Join two RDD in spark

Categories

Resources