inserting new columns when not in a list using pyspark - pyspark

I have a spark df:
col1 col2 col3 col4
A a1 null s
A a2 a2 g
A null a3 m
B a2 a2 g
B a3 a3 g
I want to insert new rows when BOTH col2 and col3 are missing elements from the list below:
list = ["a1", "a2", "a3", "b4", "c7"]
Since in this example both col2 and col3 are missing b4 and c7, i should have the following:
col1 col2 col3 col4
A a1 null s
A a2 a2 g
A null a3 m
B a2 a2 g
B a3 a3 g
A b4 b4 k
A c7 c7 k
B b4 b4 k
B c7 c7 k
The last 4 rows is what i want to add (this is an example but the actual df is bigger).
Any thoughts of how this may be coded???

Try this, using window functions and explode logic(spark2.4+):
df.show() #sample dataframe
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| A| a1|null| s|
#| A| a2| a2| g|
#| A|null| a3| m|
#| B| a2| a2| g|
#| B| a3| a3| g|
#+----+----+----+----+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("col1")
w1=Window().partitionBy("col1").orderBy(F.lit(1))
df\
.withColumn("col1_2",F.flatten(F.collect_list(F.array("col2","col3")).over(w)))\
.withColumn("list", F.array(*[F.lit(x) for x in list]))\
.withColumn("except",F.array_except("list","col1_2"))\
.withColumn("rowNum", F.row_number().over(w1))\
.withColumn("max",F.max("rowNum").over(w))\
.withColumn("col2", F.when(F.col("rowNum")==F.col("max"), F.array_union(F.array("col2"),F.col("except")))\
.otherwise(F.array(F.col("col2"))))\
.withColumn("col3", F.when(F.col("rowNum")==F.col("max"),F.array_union(F.array("col3"),F.col("except")))\
.otherwise(F.array(F.col("col3")))).select(*[x for x in df.columns])\
.withColumn("col5", F.when(F.size("col2")>1, F.expr("""array_repeat('k',size(col2)-1)"""))\
.otherwise(F.array("col4")))\
.withColumn("col4", F.when(F.size("col2")>1,F.flatten(F.array(F.array("col4"),"col5")))\
.otherwise(F.array("col4"))).drop("col5")\
.withColumn("zipped", F.explode(F.arrays_zip("col2","col3","col4")))\
.select("col1","zipped.*")\
.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| B| a2| a2| g|
#| B| a3| a3| g|
#| B| a1| a1| k|
#| B| b4| b4| k|
#| B| c7| c7| k|
#| A| a1|null| s|
#| A| a2| a2| g|
#| A|null| a3| m|
#| A| b4| b4| k|
#| A| c7| c7| k|
#+----+----+----+----+

Related

how I can use frequencies of elements of tuples from another table?

I have a complicated question. I have the following column which is a pair of tuples and the other column is their frequencies.
Col1 Col2
('A','B') 5
('C','C') 4
('F','D') 8
I also have the other column which is the frequency of element of tuples and their frequency:
Col3 Col4
'A' 2
'B' 5
'C' 1
'F' 2
'D' 3
I need to make a new column from frequencies. For each tuple (A,B) I need to have frequancy of A , frequancy of B and frequancy of the tuple.
out put:
Col1 new_col
('A','B') 2,5,5
('C','C') 1,1,4
('F','D') 2,3,8
creation of the data based on your example
dataset 1:
b = "Col1 Col2".split()
a = [
(["A", "B"], 5),
(["C", "C"], 4),
(["F", "D"], 8),
]
df1 = spark.createDataFrame(a, b)
df1.show()
+------+----+
| Col1|Col2|
+------+----+
|[A, B]| 5|
|[C, C]| 4|
|[F, D]| 8|
+------+----+
dataset 2 :
b = "Col3 Col4".split()
a = [
["A", 2],
["B", 5],
["C", 1],
["F", 2],
["D", 3],
]
df2 = spark.createDataFrame(a, b)
df2.show()
+----+----+
|Col3|Col4|
+----+----+
| A| 2|
| B| 5|
| C| 1|
| F| 2|
| D| 3|
+----+----+
preparation of df1
df1 = df1.withColumn("value1", df1["col1"].getItem(0)).withColumn(
"value2", df1["col1"].getItem(1)
)
df1.show()
+------+----+------+------+
| Col1|Col2|value1|value2|
+------+----+------+------+
|[A, B]| 5| A| B|
|[C, C]| 4| C| C|
|[F, D]| 8| F| D|
+------+----+------+------+
join of the dataframes
df3 = (
df1.join(
df2.alias("value1"), on=F.col("value1") == F.col("value1.col3"), how="left"
)
.join(df2.alias("value2"), on=F.col("value2") == F.col("value2.col3"), how="left")
.select(
"col1",
"value1.col4",
"value2.col4",
"col2",
)
)
df3.show()
+------+----+----+----+
| col1|col4|col4|col2|
+------+----+----+----+
|[A, B]| 2| 5| 5|
|[F, D]| 2| 3| 8|
|[C, C]| 1| 1| 4|
+------+----+----+----+

Weighted running total with conditional in Pyspark/Hive

I have product, brand, percentage and price columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. I want to weigh them by price. If the price of the products above the current row are more than the current row, I want to down-weigh it by multiplying it by 0.8. How can I do it in PySpark or using using spark.sql? The answer without using multiplying with weight is here.
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5],
'price':[0.6, 1, 0.5, 0.8, 1, 0.5]})
df = spark.createDataFrame(df)
What I am looking for
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 32 30
a4 b3 8 null 80
a5 b2 7 24 58
a6 b1 5 40 45
Update:
I have added the below data points to help clarify the problem. As can be seen, one row can be multiplied by 0.8 in one row and by 1.0 in another row.
product brand pct price pct_same_brand pct_different_brand
a1 b1 30 0.6 null null
a2 b2 20 1.3 null 30
a3 b1 10 0.5 30*0.8 20
a4 b3 8 0.8 null 60
a5 b2 7 0.5 20*0.8 48
a6 b1 6 0.8 30*1 + 10*1 35
a7 b2 5 1.5 20*1 + 7*1 54
Update2: In the data that I provided above, the weight per row is the same number (0.8 or 1) but it can also be 1 and 0.8 (0.8 for some of the rows and 1 for other rows)
Example in the below data frame, the multiplier for the last row , for example, should be 0.8 for a6 and 1.0 for the rest of brand b1. :
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6', 'a7', 'a8', 'a9', 'a10'],
'brand':['b1','b2','b1', 'b3', 'b2','b1','b2', 'b1', 'b1', 'b1'],
'pct': [30, 20, 10, 8, 7,6,5,4,3,2],
'price':[0.6, 1.3, 0.5, 0.8, 0.5, 0.8, 1.5, 0.5, 0.65, 0.7]
})
df = spark.createDataFrame(df)
You can add a weight column to facilitate calculation:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'weight',
F.when(
F.col('price') <= F.lag('price').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
),
0.8
).otherwise(1.0)
).withColumn(
'pct_same_brand',
F.col('weight')*F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0)) / F.col('weight')
)
df2.show()
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 40| 0.6| 1.0| null| null|
| a2| b2| 30| 1.0| 1.0| null| 40.0|
| a3| b1| 10| 0.5| 0.8| 32.0| 30.0|
| a4| b3| 8| 0.8| 1.0| null| 80.0|
| a5| b2| 7| 1.0| 0.8| 24.0| 58.0|
| a6| b1| 5| 0.5| 0.8| 40.0| 45.0|
+---+-----+---+-----+------+--------------+-------------------+
Output for the edited question:
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 30| 0.6| 1.0| null| null|
| a2| b2| 20| 1.3| 1.0| null| 30.0|
| a3| b1| 10| 0.5| 0.8| 24.0| 20.0|
| a4| b3| 8| 0.8| 1.0| null| 60.0|
| a5| b2| 7| 0.5| 0.8| 16.0| 48.0|
| a6| b1| 6| 0.8| 1.0| 40.0| 35.0|
| a7| b2| 5| 1.5| 1.0| 27.0| 54.0|
+---+-----+---+-----+------+--------------+-------------------+
If anyone has a similar question, this worked for me.
Basically, I used outer join of the dataframe with itself and assigned the weights. Finally, I used window functions.
df_copy = df.withColumnRenamed('a', 'asin')\
.withColumnRenamed('brand', 'brandd')\
.withColumnRenamed('pct', 'pct2')\
.withColumnRenamed('price', 'price2')
df2 = df.join(df_copy, on = [df.brand == df_copy.brandd], how = 'outer').orderBy('brand')
df3 = df2.filter(~((df2.a == df2.asin) & (df2.brand == df2.brandd))
& (df2.pct <= df2.pct2))
df3 = df3.withColumn('weight', F.when(df3.price2 > df3.price, 0.8).otherwise(1))
df4 = df3.groupBy(['a', 'brand', 'pct', 'price']).agg(F.sum(df3.pct2*df3.weight)
.alias('same_brand_pct'))
df5 = df.join(df4, on = ['a', 'brand', 'pct', 'price'], how = 'left')
df6 = df5.withColumn(
'pct_same_brand_unscaled',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand_unscaled'), F.lit(0))
).drop('pct_same_brand_unscaled')
gives:
+---+-----+---+-----+--------------+-------------------+
| a|brand|pct|price|same_brand_pct|pct_different_brand|
+---+-----+---+-----+--------------+-------------------+
| a1| b1| 30| 0.6| null| null|
| a2| b2| 20| 1.3| null| 30|
| a3| b1| 10| 0.5| 24.0| 20|
| a4| b3| 8| 0.8| null| 60|
| a5| b2| 7| 0.5| 16.0| 48|
| a6| b1| 6| 0.8| 40.0| 35|
| a7| b2| 5| 1.5| 27.0| 54|
| a8| b1| 4| 0.5| 38.8| 40|
| a9| b1| 3| 0.65| 48.8| 40|
|a10| b1| 2| 0.7| 51.8| 40|```

how to take lines from a dataframe that are not in another dataframe using spark/scala

I have a dataframe :
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| A | A2 |
| A | A2 |
| B | b2
| B | b2 |
| C | c2 |
| D | d2 |
| E | e2 |
| F | f2 |
And another dataframe
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| A | A2 |
| B | b2 |
| C | c2 |
I want have in result :
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| D | d2 |
| E | e2 |
| F | f2 |
I do that :
df1.join(df2,Seq("col1","col2"),"left")
But doesn't work for me .
Any idea ?
Thank you .
We can use .except or leftjoin for this case.
Example:
df.show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| A| A2|
//| A| A2|
//| B| b2|
//| B| b2|
//| C| c2|
//| D| d2|
//| E| e2|
//| F| f2|
//+----+----+
df1.show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| A| A2|
//| B| b2|
//| C| c2|
//+----+----+
df.except(df1).show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| E| e2|
//| F| f2|
//| D| d2|
//+----+----+
df.alias("d1").join(df1.alias("d2"),
(col("d1.Col1")===col("d2.Col1") &&(col("d1.Col2")===col("d2.Col2"))),"left").
filter(col("d2.Col2").isNull).
select("d1.*").
show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| D| d2|
//| E| e2|
//| F| f2|
//+----+----+
You can use except on both the df.
scala> df1.except(df2).show
+----+----+
|Col1|col2|
+----+----+
| E| e2|
| F| f2|
| D| d2|
+----+----+

use RDD list as parameter for dataframe filter operation

I have the following code snippet.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = SparkContext()
spark = SparkSession.builder.appName("test").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", StringType(), True),
StructField("f", StringType(), True)])
arr = [("Alice", "1", "2", None, "red", None, None), \
("Bob", "1", None, None, None, None, "apple"), \
("Charlie", "2", "3", None, None, None, "orange")]
df = spark.createDataFrame(arr, schema)
df.show()
#+-------+---+----+----+----+----+------+
#| name| a| b| c| d| e| f|
#+-------+---+----+----+----+----+------+
#| Alice| 1| 2|null| red|null| null|
#| Bob| 1|null|null|null|null| apple|
#|Charlie| 2| 3|null|null|null|orange|
#+-------+---+----+----+----+----+------+
Now, I have a RDD which is like:
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']])
My goal is to find names which have empty subsets of attributes, that is, in the example above:
{'c,d,e': ['Bob', 'Charlie'], 'f': ['Alice']}
Now, I came to a rather naive solution that is to collect the list and then cycle through the subsets querying the dataframe.
def build_filter_condition(l):
return ' AND '.join(["({} is NULL)".format(x) for x in l])
res = {}
for alist in lrdd.collect():
cond = build_filter_condition(alist)
p = df.select("name").where(cond)
if p and p.count() > 0:
res[','.join(alist)] = p.rdd.map(lambda x: x[0]).collect()
print(res)
Which works but it's highly inefficient.
Consider also that the target attributes schema is something like 10000 attributes, leading to over 600 disjoint lists in lrdd.
So, my question is:
how to efficiently use the content of a distributed collection as parameter for querying a sql dataframe?
Any hint is appreciated.
Thank you very much.
You should reconsider the format of your data. Instead of having so many columns you should explode it to get more lines to allow distributed computations:
import pyspark.sql.functions as psf
df = df.select(
"name",
psf.explode(
psf.array(
*[psf.struct(
psf.lit(c).alias("feature_name"),
df[c].alias("feature_value")
) for c in df.columns if c != "name"]
)
).alias("feature")
).select("name", "feature.*")
+-------+------------+-------------+
| name|feature_name|feature_value|
+-------+------------+-------------+
| Alice| a| 1|
| Alice| b| 2|
| Alice| c| null|
| Alice| d| red|
| Alice| e| null|
| Alice| f| null|
| Bob| a| 1|
| Bob| b| null|
| Bob| c| null|
| Bob| d| null|
| Bob| e| null|
| Bob| f| apple|
|Charlie| a| 2|
|Charlie| b| 3|
|Charlie| c| null|
|Charlie| d| null|
|Charlie| e| null|
|Charlie| f| orange|
+-------+------------+-------------+
We'll do the same with lrdd but we'll change it a bit first:
subsets = spark\
.createDataFrame(lrdd.map(lambda l: [l]), ["feature_set"])\
.withColumn("feature_name", psf.explode("feature_set"))
+-----------+------------+
|feature_set|feature_name|
+-----------+------------+
| [a, b]| a|
| [a, b]| b|
| [c, d, e]| c|
| [c, d, e]| d|
| [c, d, e]| e|
| [f]| f|
+-----------+------------+
Now we can join these on feature_name and filter on the feature_set and name whose feature_value are exclusively null. IF the lrdd table is not too big you should broadcast it
df_join = df.join(psf.broadcast(subsets), "feature_name")
res = df_join.groupBy("feature_set", "name").agg(
psf.count("*").alias("count"),
psf.sum(psf.isnull("feature_value").cast("int")).alias("nb_null")
).filter("nb_null = count")
+-----------+-------+-----+-------+
|feature_set| name|count|nb_null|
+-----------+-------+-----+-------+
| [c, d, e]|Charlie| 3| 3|
| [f]| Alice| 1| 1|
| [c, d, e]| Bob| 3| 3|
+-----------+-------+-----+-------+
You can always groupBy feature_set afterwards
You can try this approach.
First crossjoin both dataframes
from pyspark.sql.types import *
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']]).
map(lambda x: ("key", x))
schema = StructType([StructField("K", StringType()),
StructField("X", ArrayType(StringType()))])
df2 = spark.createDataFrame(lrdd, schema).select("X")
df3 = df.crossJoin(df2)
result of crossjoin
+-------+---+----+----+----+----+------+---------+
| name| a| b| c| d| e| f| X|
+-------+---+----+----+----+----+------+---------+
| Alice| 1| 2|null| red|null| null| [a, b]|
| Alice| 1| 2|null| red|null| null|[c, d, e]|
| Alice| 1| 2|null| red|null| null| [f]|
| Bob| 1|null|null|null|null| apple| [a, b]|
|Charlie| 2| 3|null|null|null|orange| [a, b]|
| Bob| 1|null|null|null|null| apple|[c, d, e]|
| Bob| 1|null|null|null|null| apple| [f]|
|Charlie| 2| 3|null|null|null|orange|[c, d, e]|
|Charlie| 2| 3|null|null|null|orange| [f]|
+-------+---+----+----+----+----+------+---------+
Now filter out the rows using a udf
from pyspark.sql.functions import udf, struct, collect_list
def foo(data):
d = list(filter(lambda x: data[x], data['X']))
print(d)
if len(d)>0:
return(False)
else:
return(True)
udf_foo = udf(foo, BooleanType())
df4 = df3.filter(udf_foo(struct([df3[x] for x in df3.columns]))).select("name", 'X')
df4.show()
+-------+---------+
| name| X|
+-------+---------+
| Alice| [f]|
| Bob|[c, d, e]|
|Charlie|[c, d, e]|
+-------+---------+
Then use groupby and collect_list to get the desired output
df4.groupby("X").agg(collect_list("name").alias("name")).show()
+--------------+---------+
| name | X|
+--------------+---------+
| [ Alice] | [f]|
|[Bob, Charlie]|[c, d, e]|
+--------------+---------+

How to compare two files using spark?

I want to compare two files if not matched extra records load into another file with the unmatched records.
Compare each and every fields in both file and count of records also.
Let's say you have two files:
scala> val a = spark.read.option("header", "true").csv("a.csv").alias("a"); a.show
+---+-----+
|key|value|
+---+-----+
| a| b|
| b| c|
+---+-----+
a: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> val b = spark.read.option("header", "true").csv("b.csv").alias("b"); b.show
+---+-----+
|key|value|
+---+-----+
| b| c|
| c| d|
+---+-----+
b: org.apache.spark.sql.DataFrame = [key: string, value: string]
It is unclear which sort of unmatched records you are looking for, but it is easy to find them by any definition with join:
scala> a.join(b, Seq("key")).show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "left_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| a| b| null|
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "right_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
| c| null| d|
+---+-----+-----+
scala> a.join(b, Seq("key"), "outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| c| null| d|
| b| c| c|
| a| b| null|
+---+-----+-----+
If you are looking for the records in b.csv that are not present in a.csv:
scala> val diff = a.join(b, Seq("key"), "right_outer").filter($"a.value" isNull).drop($"a.value")
scala> diff.show
+---+-----+
|key|value|
+---+-----+
| c| d|
+---+-----+
scala> diff.write.csv("diff.csv")