Efficiently calculate top k elements on PySpark GroupedData (not scala) - pyspark

I have a Dataframe of the form:
+---+---+----+
| A| B|dist|
+---+---+----+
| a1| b1| 1.0|
| a1| b2| 2.0|
| a2| b1|10.0|
| a2| b2|10.0|
| a2| b3| 2.0|
| a3| b1|10.0|
+---+---+----+
and, fixed max_rank=2, I want to obtain the following one
+---+---+----+----+
| A| B|dist|rank|
+---+---+----+----+
| a3| b1|10.0| 1|
| a2| b3| 2.0| 1|
| a2| b1|10.0| 2|
| a2| b2|10.0| 2|
| a1| b1| 1.0| 1|
| a1| b2| 2.0| 2|
+---+---+----+----+
The classical method to do that is the following
df = sqlContext.createDataFrame([("a1", "b1", 1.), ("a1", "b2", 2.), ("a2", "b1", 10.), ("a2", "b2", 10.), ("a2", "b3", 2.), ("a3", "b1", 10.)], schema=StructType([StructField("A", StringType(), True), StructField("B", StringType(), True),StructField("dist", FloatType(), True)]))
win = Window().partitionBy(df['A']).orderBy(df['dist'])
out = df.withColumn('rank', rank().over(win))
out = out.filter('rank<=2')
However this solution is inefficient due to the Window function that uses an OrderBy.
There is another solution for Pyspark? For example a method similar to .top(k, key=--) for RDD?
I found a similar answer here but uses scala instead of python.

Related

Pyspark- Fill an empty strings with a '0' if Data type is BIGINT/DOUBLE/Integer

I am trying to fill an empty strings with a '0' if column Data type is BIGINT/DOUBLE/Integer in a dataframe using pyspark
data = [("James","","Smith","36","M",3000,"1.2"),
("Michael","Rose"," ","40","M",4000,"2.0"),
("Robert","","Williams","42","M",4000,"5.0"),
("Maria","Anne"," ","39","F", ," "),
("Jen","Mary","Brown"," ","F",-1,"")
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("age", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("amount", DoubleType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
I am trying like this.
df.select( *[ F.when(F.dtype in ('integertype','doubletype') and F.col(column).ishaving(" "),'0').otherwise(F.col(column)).alias(column) for column in df.columns]).show()
Expected output:
+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
| James| | Smith| 36| M| 3000| 1.2|
| Michael| Rose| | 40| M| 4000| 2.0|
| Robert| |Williams| 42| M| 4000| 5.0|
| Maria| Anne| | 39| F| 0| 0|
| Jen| Mary| Brown| | F| -1| 0|
+---------+----------+--------+---+------+------+------+
You can utilise reduce to accomplish this , it makes the code more cleaner and easier to understand
Additionally create a to_fill list to match the columns based on your condition , which can be further modified based on your scenarios.
Data Preparation
data = [("James","","Smith","36","M",3000,1.2),
("Michael","Rose"," ","40","M",4000,2.0),
("Robert","","Williams","42","M",4000,5.0),
("Maria","Anne"," ","39","F",None,None),
("Jen","Mary","Brown"," ","F",-1,None)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("age", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("amount", DoubleType(), True)
])
sparkDF = sql.createDataFrame(data=data,schema=schema)
sparkDF.show()
+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
| James| | Smith| 36| M| 3000| 1.2|
| Michael| Rose| | 40| M| 4000| 2.0|
| Robert| |Williams| 42| M| 4000| 5.0|
| Maria| Anne| | 39| F| null| null|
| Jen| Mary| Brown| | F| -1| null|
+---------+----------+--------+---+------+------+------+
Reduce
to_fill = [ c for c,d in sparkDF.dtypes if d in ['int','bigint','double']]
# to_fill --> ['salary','amount']
sparkDF = reduce(
lambda df, x: df.withColumn(x, F.when(F.col(x).isNull(),0).otherwise(F.col(x))),
to_fill,
sparkDF,
)
sparkDF.show()
+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
| James| | Smith| 36| M| 3000| 1.2|
| Michael| Rose| | 40| M| 4000| 2.0|
| Robert| |Williams| 42| M| 4000| 5.0|
| Maria| Anne| | 39| F| 0| 0.0|
| Jen| Mary| Brown| | F| -1| 0.0|
+---------+----------+--------+---+------+------+------+
You can try this :
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
data = [("James", "", "Smith", "36", "", 3000, 1.2),
("Michael", "Rose", "", "40", "M", 4000, 2.0),
("Robert", "", "Williams", "42", "M", 4000, 5.0),
("Maria", "Anne", " ", "39", "F", None, None),
("Jen", "Mary", "Brown", " ", "F", -1, None)
]
schema = StructType([StructField("firstname", StringType(), True),StructField("middlename", StringType(), True),StructField("lastname", StringType(), True),StructField("age", StringType(), True),StructField("gender", StringType(), True),StructField("salary", IntegerType(), True),StructField("amount", DoubleType(), True)])
dfa = spark.createDataFrame(data=data, schema=schema)
dfa.show()
def removenull(dfa):
dfa = dfa.select([trim(col(c)).alias(c) for c in dfa.columns])
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i)=="", None ).otherwise(col(i)))
return dfa
removenull(dfa).show()
output:
+---------+----------+--------+----+------+------+------+
|firstname|middlename|lastname| age|gender|salary|amount|
+---------+----------+--------+----+------+------+------+
| James| null| Smith| 36| null| 3000| 1.2|
| Michael| Rose| null| 40| M| 4000| 2.0|
| Robert| null|Williams| 42| M| 4000| 5.0|
| Maria| Anne| null| 39| F| null| null|
| Jen| Mary| Brown|null| F| -1| null|
+---------+----------+--------+----+------+------+------+

Explode function is increasing job time in Spark DataFrame

I have a dataframe with one column arrs having an array of size close to 100000.
Now I need to explode this column to get unique rows for all the elements of Array.
Explode function of spark.sql is doing the job but is taking enough time
Any alternative of explode which I can try to optimize job.
dfs.printSchema()
println("Orginal DF")
dfs.show()
//Performing Explode operation
import org.apache.spark.sql.functions.{explode,col}
val opdfs=dfs.withColumn("explarrs",explode(col("arrs"))).drop("arrs")
println("Exploded DF")
opdfs.show()
Expected result should be as below but an alternative to this code which will optimize the job more efficiently.
Orginal DF
+----+------+----+--------------------+
|col1| col2|col3| arrs|
+----+------+----+--------------------+
| A|DFtest| K|[1, 2, 3, 4, 5, 6...|
+----+------+----+--------------------+
Exploded DF
+----+------+----+--------+
|col1| col2|col3|explarrs|
+----+------+----+--------+
| A|DFtest| K| 1|
| A|DFtest| K| 2|
| A|DFtest| K| 3|
| A|DFtest| K| 4|
| A|DFtest| K| 5|
| A|DFtest| K| 6|
| A|DFtest| K| 7|
| A|DFtest| K| 8|
| A|DFtest| K| 9|
| A|DFtest| K| 10|
| A|DFtest| K| 11|
| A|DFtest| K| 12|
| A|DFtest| K| 13|
| A|DFtest| K| 14|
| A|DFtest| K| 15|
| A|DFtest| K| 16|
| A|DFtest| K| 17|
| A|DFtest| K| 18|
| A|DFtest| K| 19|
| A|DFtest| K| 20|
+----+------+----+--------+
only showing top 20 rows
You can do the same without explode using flatMap method from Dataframe. For example, if you need to explode an array of integers you can proceed with something like:
val els = Seq(Row(Array(1, 2, 3)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els), StructType(Seq(StructField("data", ArrayType(IntegerType), false))))
df.show()
It gives:
+---------+
| data|
+---------+
|[1, 2, 3]|
+---------+
Using DataframeĀ“s flatmap:
df.flatMap(row => row.getAs[mutable.WrappedArray[Int]](0)).show()
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
The problem with this is that you need to put the right type of the elements of your array in the getAs function, in addition to the memory overhead. As I said in my comment there was a bug that was fixed: https://issues.apache.org/jira/browse/SPARK-21657
But if you canĀ“t upgrade your Spark version you can try the code above and compare.
If you want to add the other fields to your result you could do something like:
val els = Seq(Row(Array(1, 2, 3), "data1", "data2"), Row(Array(1, 2, 3, 4, 5, 6), "data10", "data20"))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els),
StructType(Seq(StructField("data", ArrayType(IntegerType), false), StructField("data1", StringType, false), StructField("data2", StringType, false))))
df.show()
df.flatMap{ row =>
val arr = row.getAs[mutable.WrappedArray[Int]](0)
arr.map { el =>
(row.getAs[String](1), row.getAs[String](2), el)
}
}.show()
It gives:
+------+------+---+
| _1| _2| _3|
+------+------+---+
| data1| data2| 1|
| data1| data2| 2|
| data1| data2| 3|
|data10|data20| 1|
|data10|data20| 2|
|data10|data20| 3|
|data10|data20| 4|
|data10|data20| 5|
|data10|data20| 6|
+------+------+---+
maybe it can help.

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

use RDD list as parameter for dataframe filter operation

I have the following code snippet.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = SparkContext()
spark = SparkSession.builder.appName("test").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", StringType(), True),
StructField("f", StringType(), True)])
arr = [("Alice", "1", "2", None, "red", None, None), \
("Bob", "1", None, None, None, None, "apple"), \
("Charlie", "2", "3", None, None, None, "orange")]
df = spark.createDataFrame(arr, schema)
df.show()
#+-------+---+----+----+----+----+------+
#| name| a| b| c| d| e| f|
#+-------+---+----+----+----+----+------+
#| Alice| 1| 2|null| red|null| null|
#| Bob| 1|null|null|null|null| apple|
#|Charlie| 2| 3|null|null|null|orange|
#+-------+---+----+----+----+----+------+
Now, I have a RDD which is like:
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']])
My goal is to find names which have empty subsets of attributes, that is, in the example above:
{'c,d,e': ['Bob', 'Charlie'], 'f': ['Alice']}
Now, I came to a rather naive solution that is to collect the list and then cycle through the subsets querying the dataframe.
def build_filter_condition(l):
return ' AND '.join(["({} is NULL)".format(x) for x in l])
res = {}
for alist in lrdd.collect():
cond = build_filter_condition(alist)
p = df.select("name").where(cond)
if p and p.count() > 0:
res[','.join(alist)] = p.rdd.map(lambda x: x[0]).collect()
print(res)
Which works but it's highly inefficient.
Consider also that the target attributes schema is something like 10000 attributes, leading to over 600 disjoint lists in lrdd.
So, my question is:
how to efficiently use the content of a distributed collection as parameter for querying a sql dataframe?
Any hint is appreciated.
Thank you very much.
You should reconsider the format of your data. Instead of having so many columns you should explode it to get more lines to allow distributed computations:
import pyspark.sql.functions as psf
df = df.select(
"name",
psf.explode(
psf.array(
*[psf.struct(
psf.lit(c).alias("feature_name"),
df[c].alias("feature_value")
) for c in df.columns if c != "name"]
)
).alias("feature")
).select("name", "feature.*")
+-------+------------+-------------+
| name|feature_name|feature_value|
+-------+------------+-------------+
| Alice| a| 1|
| Alice| b| 2|
| Alice| c| null|
| Alice| d| red|
| Alice| e| null|
| Alice| f| null|
| Bob| a| 1|
| Bob| b| null|
| Bob| c| null|
| Bob| d| null|
| Bob| e| null|
| Bob| f| apple|
|Charlie| a| 2|
|Charlie| b| 3|
|Charlie| c| null|
|Charlie| d| null|
|Charlie| e| null|
|Charlie| f| orange|
+-------+------------+-------------+
We'll do the same with lrdd but we'll change it a bit first:
subsets = spark\
.createDataFrame(lrdd.map(lambda l: [l]), ["feature_set"])\
.withColumn("feature_name", psf.explode("feature_set"))
+-----------+------------+
|feature_set|feature_name|
+-----------+------------+
| [a, b]| a|
| [a, b]| b|
| [c, d, e]| c|
| [c, d, e]| d|
| [c, d, e]| e|
| [f]| f|
+-----------+------------+
Now we can join these on feature_name and filter on the feature_set and name whose feature_value are exclusively null. IF the lrdd table is not too big you should broadcast it
df_join = df.join(psf.broadcast(subsets), "feature_name")
res = df_join.groupBy("feature_set", "name").agg(
psf.count("*").alias("count"),
psf.sum(psf.isnull("feature_value").cast("int")).alias("nb_null")
).filter("nb_null = count")
+-----------+-------+-----+-------+
|feature_set| name|count|nb_null|
+-----------+-------+-----+-------+
| [c, d, e]|Charlie| 3| 3|
| [f]| Alice| 1| 1|
| [c, d, e]| Bob| 3| 3|
+-----------+-------+-----+-------+
You can always groupBy feature_set afterwards
You can try this approach.
First crossjoin both dataframes
from pyspark.sql.types import *
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']]).
map(lambda x: ("key", x))
schema = StructType([StructField("K", StringType()),
StructField("X", ArrayType(StringType()))])
df2 = spark.createDataFrame(lrdd, schema).select("X")
df3 = df.crossJoin(df2)
result of crossjoin
+-------+---+----+----+----+----+------+---------+
| name| a| b| c| d| e| f| X|
+-------+---+----+----+----+----+------+---------+
| Alice| 1| 2|null| red|null| null| [a, b]|
| Alice| 1| 2|null| red|null| null|[c, d, e]|
| Alice| 1| 2|null| red|null| null| [f]|
| Bob| 1|null|null|null|null| apple| [a, b]|
|Charlie| 2| 3|null|null|null|orange| [a, b]|
| Bob| 1|null|null|null|null| apple|[c, d, e]|
| Bob| 1|null|null|null|null| apple| [f]|
|Charlie| 2| 3|null|null|null|orange|[c, d, e]|
|Charlie| 2| 3|null|null|null|orange| [f]|
+-------+---+----+----+----+----+------+---------+
Now filter out the rows using a udf
from pyspark.sql.functions import udf, struct, collect_list
def foo(data):
d = list(filter(lambda x: data[x], data['X']))
print(d)
if len(d)>0:
return(False)
else:
return(True)
udf_foo = udf(foo, BooleanType())
df4 = df3.filter(udf_foo(struct([df3[x] for x in df3.columns]))).select("name", 'X')
df4.show()
+-------+---------+
| name| X|
+-------+---------+
| Alice| [f]|
| Bob|[c, d, e]|
|Charlie|[c, d, e]|
+-------+---------+
Then use groupby and collect_list to get the desired output
df4.groupby("X").agg(collect_list("name").alias("name")).show()
+--------------+---------+
| name | X|
+--------------+---------+
| [ Alice] | [f]|
|[Bob, Charlie]|[c, d, e]|
+--------------+---------+

How to get Running sum of based on two columns using Spark scala RDD

I have data in RDD which have 4 columns like geog, product, time and price. I want to calculate the running sum based on geog and time.
Given Data
I need result like.
[
I need this spark-Scala-RDD. I am new to this Scala world, i can achieve this easily in SQL. i want do this in spark -Scala -RDD like using (map,flatmap).
Advance thanks for your help.
This is possible by defining a window function:
>>> val data = List(
("India","A1","Q1",40),
("India","A2","Q1",30),
("India","A3","Q1",21),
("German","A1","Q1",50),
("German","A3","Q1",60),
("US","A1","Q1",60),
("US","A2","Q2",25),
("US","A4","Q1",20),
("US","A5","Q5",15),
("US","A3","Q3",10)
)
>>> val df = sc.parallelize(data).toDF("country", "part", "quarter", "result")
>>> df.show()
+-------+----+-------+------+
|country|part|quarter|result|
+-------+----+-------+------+
| India| A1| Q1| 40|
| India| A2| Q1| 30|
| India| A3| Q1| 21|
| German| A1| Q1| 50|
| German| A3| Q1| 60|
| US| A1| Q1| 60|
| US| A2| Q2| 25|
| US| A4| Q1| 20|
| US| A5| Q5| 15|
| US| A3| Q3| 10|
+-------+----+-------+------+
>>> val window = Window.partitionBy("country").orderBy("part", "quarter")
>>> val resultDF = df.withColumn("agg", sum(df("result")).over(window))
>>> resultDF.show()
+-------+----+-------+------+---+
|country|part|quarter|result|agg|
+-------+----+-------+------+---+
| India| A1| Q1| 40| 40|
| India| A2| Q1| 30| 70|
| India| A3| Q1| 21| 91|
| US| A1| Q1| 60| 60|
| US| A2| Q2| 25| 85|
| US| A3| Q3| 10| 95|
| US| A4| Q1| 20|115|
| US| A5| Q5| 15|130|
| German| A1| Q1| 50| 50|
| German| A3| Q1| 60|110|
+-------+----+-------+------+---+
You can do this using Window functions, please take a look at the Databrick blog about Windows:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Hope this helps.
Happy Sparking! Cheers, Fokko
I think this will help others also. I tried in SCALA RDD.
val fileName_test_1 ="C:\\venkat_workshop\\Qintel\\Data_Files\\test_1.txt"
val rdd1 = sc.textFile(fileName_test_1).map { x => (x.split(",")(0).toString() ,
x.split(",")(1).toString(),
x.split(",")(2).toString(),
x.split(",")(3).toDouble
)
}.groupBy( x => (x._1,x._3) )
.mapValues
{
_.toList.sortWith
{
(a,b) => (a._4) > (b._4)
}.scanLeft("","","",0.0,0.0){
(a,b) => (b._1,b._2,b._3,b._4,b._4+a._5)
}.tail
}.flatMapValues(f => f).values