spark scala cartesian product of each element in a column - scala

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,

It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

Related

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+
Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

Scala filter out rows where any column2 matches column1

Hi Stackoverflow,
I want to remove all rows in a dataframe where column A matches any of the distinct values in column B. I would expect this code block to do exactly that, but it seems to remove values where column B is null as well, which is weird since the filter should only consider column A anyway. How can I fix this code to perform the expected behavior, which is remove all rows in a dataframe where column A matches any of the distinct values in column B.
import spark.implicits._
val df = Seq(
(scala.math.BigDecimal(1) , null),
(scala.math.BigDecimal(2), scala.math.BigDecimal(1)),
(scala.math.BigDecimal(3), scala.math.BigDecimal(4)),
(scala.math.BigDecimal(4), null),
(scala.math.BigDecimal(5), null),
(scala.math.BigDecimal(6), null)
).toDF("A", "B")
// correct, has 1, 4
val to_remove = df
.filter(
df.col("B").isNotNull
).select(
df("B")
).distinct()
// incorrect, returns 2, 3 instead of 2, 3, 5, 6
val final = df.filter(!df.col("A").isin(to_remove.col("B")))
// 4 != 2
assert(4 === final.collect().length)
isin function accepts a list. However, in your code, you're passing Dataset[Row]. As per documentation https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.Column#isin%28scala.collection.Seq%29
it's declared as
def isin(list: Any*): Column
You first need to extract the values into Sequence and then use that in isin function. Please, note that this may have performance implications.
scala> val to_remove = df.filter(df.col("B").isNotNull).select(df("B")).distinct().collect.map(_.getDecimal(0))
to_remove: Array[java.math.BigDecimal] = Array(1.000000000000000000, 4.000000000000000000)
scala> val finaldf = df.filter(!df.col("A").isin(to_remove:_*))
finaldf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [A: decimal(38,18), B: decimal(38,18)]
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+
Change filter condition !df.col("A").isin(to_remove.col("B")) to !df.col("A").isin(to_remove.collect.map(_.getDecimal(0)):_*)
Check below code.
val finaldf = df
.filter(!df
.col("A")
.isin(to_remove.map(_.getDecimal(0)).collect:_*)
)
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+

How can I build a CoordinateMatrix in Spark using a DataFrame?

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:
|--------------|--------------|--------------|
| userId | itemId | rating |
|--------------|--------------|--------------|
Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.
But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.
In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?
A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.
from pyspark.mllib.linalg.distributed import CoordinateMatrix
cmat=CoordinateMatrix(df.rdd.map(tuple))
It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.
With Spark 2.4.0, I am showing the whole example that I hope to meet your need.
Create dataframe using dictionary and pandas:
my_dict = {
'userId': [1,2,3,4,5,6],
'itemId': [101,102,103,104,105,106],
'rating': [5.7, 8.8, 7.9, 9.1, 6.6, 8.3]
}
import pandas as pd
pd_df = pd.DataFrame(my_dict)
df = spark.createDataFrame(pd_df)
See the dataframe:
df.show()
+------+------+------+
|userId|itemId|rating|
+------+------+------+
| 1| 101| 5.7|
| 2| 102| 8.8|
| 3| 103| 7.9|
| 4| 104| 9.1|
| 5| 105| 6.6|
| 6| 106| 8.3|
+------+------+------+
Create CoordinateMatrix from dataframe:
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coorRDD = df.rdd.map(lambda x: MatrixEntry(x[0], x[1], x[2]))
coorMatrix = CoordinateMatrix(coorRDD)
Now see the data type of result:
type(coorMatrix)
pyspark.mllib.linalg.distributed.CoordinateMatrix

Group By two different keys in two different DataFrames using Spark Scala without join

I am new with Spark Scala and I want to compute a similarity variable using two dataframes or RDD. I don't have a common key between both of them, I did a cartesian join but the joined Df is huge. Is it possible to compute a new variable from both DF without joining them?
eg:
df1.show
+----+------------+------+
| id1|        food| level|
+----+------------+------+
|id11|       pasta| first|
|id11|       pizza|second|
|id11|   ice cream| first|
|id12|     spanish| first|
|id12|   ice cream|second|
|id13| fruits| first|
+----+------------+------+
df2.show
+----+---------+
| id2|     food|
+----+---------+
|id21|    pizza|
|id21|   fruits|
|id22|    pasta|
|id22|    pizza|
|id22|ice cream|
+----+---------+
For each id1 from df1, I want to loop food variable from df2 grouping by id2.
I want to get this ouput:
+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id21|               1|id11 and id21 have only "pizza' in common
|id11|id22|               3|
|id12|id21|               0|
|id12|id22|               1|
|id13|id21|               1|
|id13|id22|               0|
+----+----+----------------+
Is it possible to compute this using a map sentence on RDD?
Thank you
You can convert both data frames to rdd, use the cartesian method to calculate the similarity between each id pair and then reconstruct the data frame:
case class similarity(id1: String, id2: String, count_similarity: Int)
val rdd1 = df1.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)
val rdd2 = df2.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)
rdd1.cartesian(rdd2).map{
case (x, y) => similarity(x._1, y._1, x._2.intersect(y._2).size)
}.toDF.orderBy("id1").show
+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id22| 3|
|id11|id21| 1|
|id12|id21| 0|
|id12|id22| 1|
|id13|id21| 1|
|id13|id22| 0|
+----+----+----------------+
Would this work for you?
df1.registerTempTable("temp_table_1")
df2.registerTempTable("temp_table_2")
spark.sql(
"""SELECT id1, id2, count(*) AS count_similarity FROM temp_table_1 AS t1
| JOIN temp_table_2 AS t2 ON (t1.food = t2.food)
| GROUP BY id1, id2
| ORDER BY id1, id2""".stripMargin
).show