how I can use frequencies of elements of tuples from another table? - pyspark

I have a complicated question. I have the following column which is a pair of tuples and the other column is their frequencies.
Col1 Col2
('A','B') 5
('C','C') 4
('F','D') 8
I also have the other column which is the frequency of element of tuples and their frequency:
Col3 Col4
'A' 2
'B' 5
'C' 1
'F' 2
'D' 3
I need to make a new column from frequencies. For each tuple (A,B) I need to have frequancy of A , frequancy of B and frequancy of the tuple.
out put:
Col1 new_col
('A','B') 2,5,5
('C','C') 1,1,4
('F','D') 2,3,8

creation of the data based on your example
dataset 1:
b = "Col1 Col2".split()
a = [
(["A", "B"], 5),
(["C", "C"], 4),
(["F", "D"], 8),
]
df1 = spark.createDataFrame(a, b)
df1.show()
+------+----+
| Col1|Col2|
+------+----+
|[A, B]| 5|
|[C, C]| 4|
|[F, D]| 8|
+------+----+
dataset 2 :
b = "Col3 Col4".split()
a = [
["A", 2],
["B", 5],
["C", 1],
["F", 2],
["D", 3],
]
df2 = spark.createDataFrame(a, b)
df2.show()
+----+----+
|Col3|Col4|
+----+----+
| A| 2|
| B| 5|
| C| 1|
| F| 2|
| D| 3|
+----+----+
preparation of df1
df1 = df1.withColumn("value1", df1["col1"].getItem(0)).withColumn(
"value2", df1["col1"].getItem(1)
)
df1.show()
+------+----+------+------+
| Col1|Col2|value1|value2|
+------+----+------+------+
|[A, B]| 5| A| B|
|[C, C]| 4| C| C|
|[F, D]| 8| F| D|
+------+----+------+------+
join of the dataframes
df3 = (
df1.join(
df2.alias("value1"), on=F.col("value1") == F.col("value1.col3"), how="left"
)
.join(df2.alias("value2"), on=F.col("value2") == F.col("value2.col3"), how="left")
.select(
"col1",
"value1.col4",
"value2.col4",
"col2",
)
)
df3.show()
+------+----+----+----+
| col1|col4|col4|col2|
+------+----+----+----+
|[A, B]| 2| 5| 5|
|[F, D]| 2| 3| 8|
|[C, C]| 1| 1| 4|
+------+----+----+----+

Related

Get groups with duplicated values in PySpark

For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.
According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)

Convert Array to Columns and replace values

I have data of following format:
+-----+---------------+
| name| Data|
+-----+---------------+
|Alpha| [A, B, C]|
| Beta| [A, B, C, D]|
|Gamma|[A, B, C, D, E]|
+-----+---------------+
How to transform it into ?
+-----+----+-----+-----+-----+-----+
| name| A| B| C| D| E|
+-----+----+-----+-----+-----+-----+
|Alpha| 1| 1| 1| 0| 0|
| Beta| 1| 1| 1| 1| 0|
|Gamma| 1| 1| 1| 1| 1|
+-----+----+-----+-----+-----+-----+
Thanks to #Jarrod Baker for help in similar transformation earlier
Here is the code that i have:
val df = Seq(
("Alpha", Array("A", "B", "C")),
("Beta", Array("A", "B", "C", "D")),
("Gamma", Array("A", "B", "C", "D", "E")),
).toDF("name", "Data")
df.show()
val arrayDataSize = df.withColumn("arr_size", size(col("Data"))).agg(max("arr_size") as "maxSize")
val newDF = df.select(($"name") +: (0 until arrayDataSize.first.getInt(0)).map(i => {($"Data") (i).contains("A").alias("A") }): _*)
newDF.show()
+-----+----+-----+-----+-----+-----+
| name| A| A| A| A| A|
+-----+----+-----+-----+-----+-----+
|Alpha|true|false|false| null| null|
| Beta|true|false|false|false| null|
|Gamma|true|false|false|false|false|
+-----+----+-----+-----+-----+-----+
Thanks in advance for your help.
You can use the RelationalGroupedDataset's pivot method to achieve what you want. To create such a Dataset, you need to use groupBy on a Dataset.
It would look something like this:
import spark.implicits._
val df = Seq(
("Alpha", Seq("A", "B", "C")),
("Beta", Seq("A", "B", "C", "D")),
("Gamma", Seq("A", "B", "C", "D", "E"))
).toDF("name", "Data")
val output = df
.select(df("name"), explode(col("Data")).alias("Data"))
.groupBy("name")
.pivot("Data")
.count()
output.show()
+-----+---+---+---+----+----+
| name| A| B| C| D| E|
+-----+---+---+---+----+----+
| Beta| 1| 1| 1| 1|null|
|Gamma| 1| 1| 1| 1| 1|
|Alpha| 1| 1| 1|null|null|
+-----+---+---+---+----+----+
As you can see, we're first explode-ing our Sequences into separate rows. This allows us to treat each element in each sequence as a separate "entity".
Then, we're using groupBy to get our RelationalGroupedDataset, after which we pivot and count the occurences.

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.
Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()
Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

use RDD list as parameter for dataframe filter operation

I have the following code snippet.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = SparkContext()
spark = SparkSession.builder.appName("test").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", StringType(), True),
StructField("f", StringType(), True)])
arr = [("Alice", "1", "2", None, "red", None, None), \
("Bob", "1", None, None, None, None, "apple"), \
("Charlie", "2", "3", None, None, None, "orange")]
df = spark.createDataFrame(arr, schema)
df.show()
#+-------+---+----+----+----+----+------+
#| name| a| b| c| d| e| f|
#+-------+---+----+----+----+----+------+
#| Alice| 1| 2|null| red|null| null|
#| Bob| 1|null|null|null|null| apple|
#|Charlie| 2| 3|null|null|null|orange|
#+-------+---+----+----+----+----+------+
Now, I have a RDD which is like:
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']])
My goal is to find names which have empty subsets of attributes, that is, in the example above:
{'c,d,e': ['Bob', 'Charlie'], 'f': ['Alice']}
Now, I came to a rather naive solution that is to collect the list and then cycle through the subsets querying the dataframe.
def build_filter_condition(l):
return ' AND '.join(["({} is NULL)".format(x) for x in l])
res = {}
for alist in lrdd.collect():
cond = build_filter_condition(alist)
p = df.select("name").where(cond)
if p and p.count() > 0:
res[','.join(alist)] = p.rdd.map(lambda x: x[0]).collect()
print(res)
Which works but it's highly inefficient.
Consider also that the target attributes schema is something like 10000 attributes, leading to over 600 disjoint lists in lrdd.
So, my question is:
how to efficiently use the content of a distributed collection as parameter for querying a sql dataframe?
Any hint is appreciated.
Thank you very much.
You should reconsider the format of your data. Instead of having so many columns you should explode it to get more lines to allow distributed computations:
import pyspark.sql.functions as psf
df = df.select(
"name",
psf.explode(
psf.array(
*[psf.struct(
psf.lit(c).alias("feature_name"),
df[c].alias("feature_value")
) for c in df.columns if c != "name"]
)
).alias("feature")
).select("name", "feature.*")
+-------+------------+-------------+
| name|feature_name|feature_value|
+-------+------------+-------------+
| Alice| a| 1|
| Alice| b| 2|
| Alice| c| null|
| Alice| d| red|
| Alice| e| null|
| Alice| f| null|
| Bob| a| 1|
| Bob| b| null|
| Bob| c| null|
| Bob| d| null|
| Bob| e| null|
| Bob| f| apple|
|Charlie| a| 2|
|Charlie| b| 3|
|Charlie| c| null|
|Charlie| d| null|
|Charlie| e| null|
|Charlie| f| orange|
+-------+------------+-------------+
We'll do the same with lrdd but we'll change it a bit first:
subsets = spark\
.createDataFrame(lrdd.map(lambda l: [l]), ["feature_set"])\
.withColumn("feature_name", psf.explode("feature_set"))
+-----------+------------+
|feature_set|feature_name|
+-----------+------------+
| [a, b]| a|
| [a, b]| b|
| [c, d, e]| c|
| [c, d, e]| d|
| [c, d, e]| e|
| [f]| f|
+-----------+------------+
Now we can join these on feature_name and filter on the feature_set and name whose feature_value are exclusively null. IF the lrdd table is not too big you should broadcast it
df_join = df.join(psf.broadcast(subsets), "feature_name")
res = df_join.groupBy("feature_set", "name").agg(
psf.count("*").alias("count"),
psf.sum(psf.isnull("feature_value").cast("int")).alias("nb_null")
).filter("nb_null = count")
+-----------+-------+-----+-------+
|feature_set| name|count|nb_null|
+-----------+-------+-----+-------+
| [c, d, e]|Charlie| 3| 3|
| [f]| Alice| 1| 1|
| [c, d, e]| Bob| 3| 3|
+-----------+-------+-----+-------+
You can always groupBy feature_set afterwards
You can try this approach.
First crossjoin both dataframes
from pyspark.sql.types import *
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']]).
map(lambda x: ("key", x))
schema = StructType([StructField("K", StringType()),
StructField("X", ArrayType(StringType()))])
df2 = spark.createDataFrame(lrdd, schema).select("X")
df3 = df.crossJoin(df2)
result of crossjoin
+-------+---+----+----+----+----+------+---------+
| name| a| b| c| d| e| f| X|
+-------+---+----+----+----+----+------+---------+
| Alice| 1| 2|null| red|null| null| [a, b]|
| Alice| 1| 2|null| red|null| null|[c, d, e]|
| Alice| 1| 2|null| red|null| null| [f]|
| Bob| 1|null|null|null|null| apple| [a, b]|
|Charlie| 2| 3|null|null|null|orange| [a, b]|
| Bob| 1|null|null|null|null| apple|[c, d, e]|
| Bob| 1|null|null|null|null| apple| [f]|
|Charlie| 2| 3|null|null|null|orange|[c, d, e]|
|Charlie| 2| 3|null|null|null|orange| [f]|
+-------+---+----+----+----+----+------+---------+
Now filter out the rows using a udf
from pyspark.sql.functions import udf, struct, collect_list
def foo(data):
d = list(filter(lambda x: data[x], data['X']))
print(d)
if len(d)>0:
return(False)
else:
return(True)
udf_foo = udf(foo, BooleanType())
df4 = df3.filter(udf_foo(struct([df3[x] for x in df3.columns]))).select("name", 'X')
df4.show()
+-------+---------+
| name| X|
+-------+---------+
| Alice| [f]|
| Bob|[c, d, e]|
|Charlie|[c, d, e]|
+-------+---------+
Then use groupby and collect_list to get the desired output
df4.groupby("X").agg(collect_list("name").alias("name")).show()
+--------------+---------+
| name | X|
+--------------+---------+
| [ Alice] | [f]|
|[Bob, Charlie]|[c, d, e]|
+--------------+---------+

How can I get the column names which have top 3 largest values for each row in Pyspark?

Sample Dataframe
id a1 a2 a3 a4 a5 a6
0 5 23 4 1 4 5
1 6 43 2 2 98 43
2 3 56 3 1 23 3
3 2 2 6 3 5 2
4 5 6 7 2 7 5
I need like this....
top1 top2 top3
a2 a1 a6
a5 a2 a6
from pyspark.sql.functions import col, udf, array, sort_array
from pyspark.sql.types import StringType
df = sc.parallelize([(0, 5, 23, 4, 1, 4, 5),
(1, 6, 43, 2, 2, 98, 43),
(2, 3, 56, 3, 1, 23, 3),
(3, 2, 2, 6, 3, 5, 2),
(4, 5, 6, 7, 2, 7, 5)]).\
toDF(["id","a1","a2","a3","a4","a5","a6"])
df_col = df.columns
df = df.\
withColumn("top1_val", sort_array(array([col(x) for x in df_col[1:]]), asc=False)[0]).\
withColumn("top2_val", sort_array(array([col(x) for x in df_col[1:]]), asc=False)[1]).\
withColumn("top3_val", sort_array(array([col(x) for x in df_col[1:]]), asc=False)[2])
def modify_values(r, max_col):
l = []
for i in range(len(df_col[1:])):
if r[i]== max_col:
l.append(df_col[i+1])
return l
modify_values_udf = udf(modify_values, StringType())
df1 = df.\
withColumn("top1", modify_values_udf(array(df.columns[1:-3]), "top1_val")).\
withColumn("top2", modify_values_udf(array(df.columns[1:-3]), "top2_val")).\
withColumn("top3", modify_values_udf(array(df.columns[1:-3]), "top3_val"))
df1.show()
Output is:
+---+---+---+---+---+---+---+--------+--------+--------+--------+--------+------------+
| id| a1| a2| a3| a4| a5| a6|top1_val|top2_val|top3_val| top1| top2| top3|
+---+---+---+---+---+---+---+--------+--------+--------+--------+--------+------------+
| 0| 5| 23| 4| 1| 4| 5| 23| 5| 5| [a2]|[a1, a6]| [a1, a6]|
| 1| 6| 43| 2| 2| 98| 43| 98| 43| 43| [a5]|[a2, a6]| [a2, a6]|
| 2| 3| 56| 3| 1| 23| 3| 56| 23| 3| [a2]| [a5]|[a1, a3, a6]|
| 3| 2| 2| 6| 3| 5| 2| 6| 5| 3| [a3]| [a5]| [a4]|
| 4| 5| 6| 7| 2| 7| 5| 7| 7| 6|[a3, a5]|[a3, a5]| [a2]|
+---+---+---+---+---+---+---+--------+--------+--------+--------+--------+------------+