Spark Scala: Distance between elements of RDDs [closed] - scala

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have 2 RRDs with time series. Like
rdd1.take(5)
[(1, 25.0)
(2, 50.23)
(3, 65.0)
(4, 7.23)
(5, 12.0)]
and
rdd2.take(5)
[(1, 85.0)
(2, 3.23)
(3, 9.0)
(4, 23.23)
(5, 65.0)]
I would like to find the disctance between each element of the first rdd and each element of the second and get next
result.take(5)
[((1,1): (25.0-85.0)**2),
((1,2): (25.0 - 3.23)**2),
.....
((1,5): (25.0 - 65.23)**2),
.....
((2,1): (50.23 - 85.0)**2),
.....
((5,5): (12.0 - 65.0)**2),
]
The number of elements can be from 10 000 to billions.
Please, help me.

#Mohit is right, you are looking for the cartesian product of your two RDDs, then you should map and compute your distance.
Here is an example :
val rdd1 = sc.parallelize(List((1, 25.0), (2, 50.23), (3, 65.0), (4, 7.23), (5, 12.0)))
val rdd2 = sc.parallelize(List((1, 85.0), (2, 3.23), (3, 9.0), (4, 23.23), (5, 65.0)))
val result = rdd1.cartesian(rdd2).map {
case ((a,b),(c,d)) => ((a,c),math.pow((b - d),2))
}
Now, let's see how it looks like :
result.take(10).foreach(println)
# ((1,1),3600.0)
# ((1,2),473.93289999999996)
# ((1,3),256.0)
# ((1,4),3.1328999999999985)
# ((1,5),1600.0)
# ((2,1),1208.9529000000002)
# ((2,2),2209.0)
# ((2,3),1699.9128999999998)
# ((2,4),728.9999999999998)
# ((2,5),218.1529000000001)

What you are looking for is Cartesian Product. This gives you the product (or pairing) between each element of RDD1 with RDD2.
Since you are dealing with billion-size dataset, make sure your infrastructure supports it.
A similar question may help you further.

Related

I am new to Scala. I used to work in pySpark. How I can convert these lines of code into scala? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
from sklearn.preprocessing import LabelEncoder
y_train = train_['country_destination']
train_user.drop(['country_destination', 'id'], axis=1, inplace=True)
x_train = train_df.values
label_encoder = LabelEncoder()
encoded_y_train = label_encoder.fit_transform(y_train)
In above mentioned code, I was trying to encode labels and features.
You can do so using the stringIndexer()
import org.apache.spark.ml.feature.StringIndexer
val df = spark.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
val indexed = indexer.fit(df).transform(df)
indexed.show()

How to flatten the results of an RDD.groupBy() from (key, [values]) into (key, values)?

From an RDD of key-value pairs like
[(1, 3), (2, 4), (2, 6)]
I would like to obtain an RDD of tuples like
[(1, 3), (2, 4, 6)]
where the first element of each tuple is the key in the original RDD, and the next element(s) are all values associated with that key in the original RDD
I have tried this
rdd.groupByKey().mapValues(lambda x:[item for item in x]).collect()
which gives
[(1, [3]), (2, [4, 6])]
but it is not quite what I want. I don't manage to "explode" the list of items in each tuple of the result.
rdd.groupByKey().map(lambda x: (x[0],*tuple(x[1]))).collect()
Best I came up with is
rdd.groupByKey().mapValues(lambda x:[a for a in x]).map(lambda x: tuple([x[0]]+x[1])).collect()
Could it be made more compact or efficient?

Pyspark:How to calculate avg and count in a single groupBy? [duplicate]

This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 4 years ago.
I would like to calculate avg and count in a single group by statement in Pyspark. How can I do that?
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William', 1.7, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
#This only shows avg but also I need count right next to it. How can I do that?
df.groupBy("Profession").agg({"Age":"avg"}).show()
df.show()
Thank you.
For the same column:
from pyspark.sql import functions as F
df.groupBy("Profession").agg(F.mean('Age'), F.count('Age')).show()
If you're able to use different columns:
df.groupBy("Profession").agg({'Age':'avg', 'Gender':'count'}).show()

How to join two spark RDD

I have 2 spark RDD, the 1st one contains a mapping between some indices and ids which are strings and the 2nd one contains tuples of related indices
val ids = spark.sparkContext.parallelize(Array[(Int, String)](
(1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")
val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
(1, 3), (2, 3), (4, 5))).toDF("index1", "index2")
I want to join somehow these RDD (or merge or sql or any best spark practice) to have at the end related ids instead:
The result of my combined RDD should return:
("a", "c"), ("b", "c"), ("d", "e")
Any idea how I can achieve this operation in an optimal way without loading any of the RDD into a memory map (because in my scenarios, these RDD can potentially load millions of records)
You can approach this by creating a two views from DataFrame as following
relationships.createOrReplaceTempView("relationships");
ids.createOrReplaceTempView("ids");
Next run the following SQL query to generate the required result which performs inner join between relationships and ids view to generate the required result
import sqlContext.sql;
val result = spark.sql("""select t.index1, id.idx from
(select id.idx as index1, rel.index2
from relationships rel
inner join
ids id on rel.index1=id.index) t
inner join
ids id
on id.index=t.index2
""");
result.show()
Another approach using DataFrame without creating views
relationships.as("rel").
join(ids.as("ids"), $"ids.index" === $"rel.index1").as("temp").
join(ids.as("ids"), $"temp.index2"===$"ids.index").
select($"temp.idx".as("index1"), $"ids.idx".as("index2")).show

The questions about Spark dataframe operations [duplicate]

This question already has an answer here:
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 5 years ago.
if I create a dataframe like this:
val df1 = sc.parallelize(List((1, 1), (1, 1), (1, 1), (1, 2), (1, 2), (1, 3), (2, 1), (2, 2), (2, 2), (2, 3)).toDF("key1","key2")
Then I group by "key1" and "key2", and count "key2".
val df2 = df1.groupBy("key1","key2").agg(count("key2") as "k").sort(col("k").desc)
My question is how to filter this dataframe and leave the top 2 num of the "k" from each "key1"?
if I don't use window functions ,what should I solve this problem?
This can be done using window-function, using row_number() (or also rank()/dense_rank(), depending on your requirements):
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
df2
.withColumn("rnb", row_number().over(Window.partitionBy($"key1").orderBy($"k".desc)))
.where($"rnb" <= 2).drop($"rnb")
.show()
EDIT:
Here a solution using RDD (which do not require a HiveContext):
df2
.rdd
.groupBy(_.getAs[Int]("key1"))
.flatMap{case (_,rows) => {
rows.toSeq
.sortBy(_.getAs[Long]("k")).reverse
.take(2)
.map{case Row(key1:Int,key2:Int,k:Long) => (key1,key2,k)}
}
}
.toDF("key1","key2","k")
.show()