I have a pyspark dataframe as:
+--------+------+
|numbers1|words1|
+--------+------+
| 1| word1|
| 1| word2|
| 1| word3|
| 2| word4|
| 2| word5|
| 3| word6|
| 3| word7|
| 3| word8|
| 3| word9|
+--------+------+
I want to produce another dataframe that would generate all pairs of words in each group. So the result for the above would be:
ID wordA wordB
1 word1 word2
1 word1 word3
1 word2 word3
2 word4 word5
3 word6 word7
3 word6 word8
3 word6 word9
3 word7 word8
3 word7 word9
3 word8 word9
I know I can run this with Python with these codes:
from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
.apply(pd.Series).stack().reset_index(level=0,name='words')
But now I need to implement this with just Spark APIs and without itertools library. How can I rewrite this script without combinations and using dataframe or RDD?
Here is my trial with the dataframe.
import pyspark.sql.functions as f
df.join(df.withColumnRenamed('words1', 'words2'), ['numbers1'], 'outer') \
.filter('words1 < words2').show(10, False)
+--------+------+------+
|numbers1|words1|words2|
+--------+------+------+
|1 |word1 |word3 |
|1 |word1 |word2 |
|1 |word2 |word3 |
|2 |word4 |word5 |
|3 |word6 |word9 |
|3 |word6 |word8 |
|3 |word6 |word7 |
|3 |word7 |word9 |
|3 |word7 |word8 |
|3 |word8 |word9 |
+--------+------+------+
Here is a solution using combinations in a UDF. It uses the same logic as the Pandas code you showed.
from itertools import combinations
from pyspark.sql import types as T, functions as F
df_agg = df.groupBy("numbers1").agg(F.collect_list("words1").alias("words_list"))
#F.udf(
T.ArrayType(
T.StructType(
[
T.StructField("wordA", T.StringType(), True,),
T.StructField("wordB", T.StringType(), True,),
]
)
)
)
def combi(words_list):
return list(combinations(words_list, 2))
df_agg = df_agg.withColumn("combinations", combi(F.col("words_list")))
new_df = df_agg.withColumn("combination", F.explode("combinations")).select(
"numbers1",
F.col("combination.wordA").alias("wordA"),
F.col("combination.wordB").alias("wordB"),
)
new_df.show()
+--------+------+------+
|numbers1| wordA| wordB|
+--------+------+------+
| 1| word1| word2|
| 1| word1| word3|
| 1| word2| word3|
| 3| word6| word7|
| 3| word6| word8|
| 3| word6| word9|
| 3| word7| word8|
| 3| word7| word9|
| 3| word8| word9|
| 2| word4| word5|
+--------+------+------+
Related
I have a PySpark dataframe-
df1 = spark.createDataFrame([
("u1", 10),
("u1", 20),
("u2", 10),
("u2", 10),
("u2", 30),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
It looks like-
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |10 |
|u1 |20 |
|u2 |10 |
|u2 |10 |
|u2 |30 |
+-------+----+
I want to give row index in such a way that the indexing restarts for each group on user_id(sorted in ascending order) and var1(sorted in descending order).
The desired output should look like-
+-------+----+-----+
|user_id|var1|order|
+-------+----+-----+
|u1 |10 | 1|
|u1 |20 | 2|
|u2 |10 | 1|
|u2 |10 | 2|
|u2 |30 | 3|
+-------+----+-----+
How do I achieve this?
It's just a row number operation:
from pyspark.sql import functions as F, Window
df2 = df1.withColumn(
'order',
F.row_number().over(Window.partitionBy('user_id').orderBy('var1'))
)
df2.show()
+-------+----+-----+
|user_id|var1|order|
+-------+----+-----+
| u1| 10| 1|
| u1| 20| 2|
| u2| 10| 1|
| u2| 10| 2|
| u2| 30| 3|
+-------+----+-----+
I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.
I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+
In my Scala program, I am dealing with a problem to combine the result from multiple levels of GroupBy.
The dataset that I am using is quite big. As a small sample, I have a dataframe that looks like this:
+---+---+----+-----+-----+
| F| L| Loy|Email|State|
+---+---+----+-----+-----+
| f1| l1|loy1| null| s1|
| f1| l1|loy1| e1| s1|
| f2| l2|loy2| e2| s2|
| f2| l2|loy2| e3| null|
| f1| l1|null| e1| s3|
+---+---+----+-----+-----+
For the first level groupBy I use the following script, to obtain the result based on the same (F, L, Loy) columns:
df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State")).show
The result is like this:
+---+---+----+--------+-----+
| F| L| Loy| Email|State|
+---+---+----+--------+-----+
| f1| l1|null| [e1]| [s3]|
| f2| l2|loy2|[e2, e3]| [s2]|
| f1| l1|loy1| [e1]| [s1]|
+---+---+----+--------+-----+
The problem that I am dealing with is that how can I perform the second level groupBy, which is based on the condition (F, L, Email) and takes as an input F and L as a String while Email column as an Array[String]. This groupBy should return a result as followed:
+---+---+----+--------+---------+
| F| L| Loy| Email| State|
+---+---+----+--------+---------+
| f1| l1|loy1| [e1]| [s3, s1]|
| f2| l2|loy2|[e2, e3]| [s2]|
+---+---+----+--------+---------+
The main goal is to reduce the number of entries as much as possible by applying groupBy in different levels. I am quite new to Scala and any help would be appreciated :)
Just use concat_ws() with null separator which will remove the Array of state to simple elements and then collect_set will get you again the array to states. Check this out.
scala> val df = Seq( ("f1","l1","loy1",null,"s1"),("f1","l1","loy1","e1","s1"),("f2","l2","loy2","e2","s2"),("f2","l2","loy2","e3",null),("f1","l1",null,"e1","s3")).toDF("F","L","loy","email","state")
df: org.apache.spark.sql.DataFrame = [F: string, L: string ... 3 more fields]
scala> df.show(false)
+---+---+----+-----+-----+
|F |L |loy |email|state|
+---+---+----+-----+-----+
|f1 |l1 |loy1|null |s1 |
|f1 |l1 |loy1|e1 |s1 |
|f2 |l2 |loy2|e2 |s2 |
|f2 |l2 |loy2|e3 |null |
|f1 |l1 |null|e1 |s3 |
+---+---+----+-----+-----+
scala> val df2 = df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State"))
df2: org.apache.spark.sql.DataFrame = [F: string, L: string ... 3 more fields]
scala> df2.show(false)
+---+---+----+--------+-----+
|F |L |Loy |Email |State|
+---+---+----+--------+-----+
|f1 |l1 |null|[e1] |[s3] |
|f2 |l2 |loy2|[e2, e3]|[s2] |
|f1 |l1 |loy1|[e1] |[s1] |
+---+---+----+--------+-----+
scala> df2.groupBy("F","L","email").agg(max('loy).as("loy"),collect_set(concat_ws("",'state)).as("state")).show
+---+---+--------+----+--------+
| F| L| email| loy| state|
+---+---+--------+----+--------+
| f2| l2|[e2, e3]|loy2| [s2]|
| f1| l1| [e1]|loy1|[s3, s1]|
+---+---+--------+----+--------+
scala>
I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |
You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+