I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points:
But there is index in Spark DataFrame?
How can I group in level=0 like that?
I didn't find anything good in the documentation. If you have a hint, I'll be really grateful!
df.set_index('var1', inplace=True)
df['varGrouped'] = df.groupby(level=0)['var2'].min()
df.reset_index(inplace=True)
pandas_df.groupby(level=0) would group the pandas_df by the first index field (in case of multiindex data). Since there is only 1 index field based on the provided code, your code is a simple group by the var1 field. The same can be replicated in pyspark with a groupBy() and taking the min of var2.
However, the aggregation result is stored in a new column within the same dataframe. So, the number of rows don't depreciate. This can be replicated by using min window function.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_sdf. \
withColumn('grouped_var', func.min('var2').over(wd.partitionBy('var1')))
withColumn helps you add/replace columns.
Here's an example using sample data.
data_sdf.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 1| 3|
# | 2| 5|
# | 2| 4|
# +---+---+
data_sdf. \
withColumn('grouped_res', func.min('b').over(wd.partitionBy('a'))). \
show()
# +---+---+-----------+
# | a| b|grouped_res|
# +---+---+-----------+
# | 1| 2| 2|
# | 1| 3| 2|
# | 2| 5| 4|
# | 2| 4| 4|
# +---+---+-----------+
But there is index in Spark DataFrame?
i think the index in pandas doesn't exist in spark since spark is not designed to do row level manipulation.
How can I group in level=0 like that?
instead of group by level, you group directly by the columns which identifies the granularity level.
Related
I am having 5 input files say A,B,C,D,E. I want to load these files to a pyspark rdd and do some processing. Finally I want to save the output in a folder with the corresponding filename as folder name. Is this possible in spark cluster mode?
As the rdd/dataframe is essentially a bunch of rows distributed across multiple partitions you will lose track of the origin of the data after it is read in from multiple sources. So, my simplistic solution is to assign an additional value to the row which tracks its origin. Using the dataframe API:
from pyspark.sql.functions import lit, col
from pyspark.sql import DataFrame
from functools import reduce
fnames = ['file_A.csv','file_B.csv','file_C.csv']
dfs = []
# 1. Read in data from individual sources and assign the filename as a string to a column
for fname in fnames:
dfs.append(spark.read.format('csv')
.option("header", "true")
.load(fname)
.withColumn('origin',lit(fname))
)
# 2. Combine data
df = reduce(DataFrame.unionAll,dfs)
# +---+---+---+----------+
# | A| B| C| origin|
# +---+---+---+----------+
# | 1| 1| 1|file_A.csv|
# | 1| 1| 1|file_A.csv|
# | 1| 1| 1|file_A.csv|
# | 0| 0| 0|file_B.csv|
# | 0| 0| 0|file_B.csv|
# | 0| 0| 0|file_B.csv|
# | 2| 2| 2|file_C.csv|
# | 2| 2| 2|file_C.csv|
# | 2| 2| 2|file_C.csv|
# +---+---+---+----------+
# 3. Do processing
# ...
# 4. Subset the combined data by origin and write out each subset to file
for fname in fnames:
out_fname = '_new.'.join(fname.split('.'))
df.filter(col('origin')==fname)\
.write.format('csv')\
.option('header',True)\
.save(out_fname)
In PySpark, there's the concept of coalesce(colA, colB, ...) which will, per row, take the first non-null value it encounters from those columns. However, I want coalesce(rowA, rowB, ...) i.e. the ability to, per column, take the first non-null value it encounters from those rows. I want to coalesce all rows within a group or window of rows.
For example, given the following dataset, I want to coalesce rows per category and ordered ascending by date.
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| null| 1|
| A| 2020-05-02| 2| null|
| A| 2020-05-03| 3| null|
| B| 2020-05-01| null| null|
| B| 2020-05-02| 4| null|
| C| 2020-05-01| 5| 2|
| C| 2020-05-02| null| 3|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
What I should get as the output is...
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| 2| 1|
| B| 2020-05-01| 4| null|
| C| 2020-05-01| 5| 2|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
First, I'll give the answer. Then, I'll point out the important bits.
from pyspark.sql import Window
from pyspark.sql.functions import col, dense_rank, first
df = ... # dataframe from question description
window = (
Window
.partitionBy("category")
.orderBy(col("date").asc())
)
window_unbounded = (
window
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
cols_to_merge = [col for col in df.columns if col not in ["category", "date"]]
merged_cols = [first(col, True).over(window_unbounded).alias(col) for col in cols_to_merge]
df_merged = (
df
.select([col("category"), col("date")] + merged_cols)
.withColumn("rank_col", dense_rank().over(window))
.filter(col("rank_col") == 1)
.drop("rank_col")
)
The row-wise analogue to coalesce is the aggregation function first. Specifically, we use first with ignorenulls = True so that we find the first non-null value.
When we use first, we have to be careful about the ordering of the rows it's applied to. Because groupBy doesn't allow us to maintain order within the groups, we use a Window.
The window itself must be unbounded on both ends rather than the default unbounded preceding to current row, else we'll end up with the first aggregation potentially running on subsets of our groups.
After we aggregate over the window, we alias the column back to its original name to keep the column names consistent.
We use a single select statement of cols rather than a for loop with df.withColumn(col, ...) because the select statement greatly reduces the query plan depth. Should you use the looped withColumn, you might hit a stack overflow error if you have too many columns.
Finally, we run a dense_rank over our window --- this time using the window with the default range --- and filter to only the first ranked rows. We use dense rank here, but we could use any ranking function, whatever fits our needs.
I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+
Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+
I have requirement to save my resultant spark sql dataframe to Oracle table.
For which I need to use the Update scripts so that I can go and update the corresponding records in Oracle table.
I knew that the primary key column in Oracle table is only column 'c1' which is first column in my dataframe.
val df = sc.parallelize(Seq((1, 10,"a"), (2,20,"b"), (3, 30,"c"),(4, 40,"d"),(5, 50,"e"),(6, 60,"f"))).toDF("c1","c2","c3")
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 10| a|
| 2| 20| b|
| 3| 30| c|
| 4| 40| d|
| 5| 50| e|
| 6| 60| f|
+---+---+---+
So since I have 6 records in the dataframe I need to generate 6 Update scripts like below.
UPDATE Oracle_table_name SET c2 = 10,c3='a' WHERE c1 =1
And similarly for c1 = 2,3,4,5,6.
Note - I knew that all these c1(values 1 to 6) are already existing in Oracle table, so that is the reason I want to only generate 'Update' Statements.
How can i implement this ?