pyspark: counting number of occurrences of each distinct values - pyspark

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2

I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+

I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

Related

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!
You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

PySpark: Remove leading numbers and full stop from dataframe column

I'm trying to remove numbers and full stops that lead the names of horses in a betting dataframe.
The format is like this:
Horse Name
Horse Name
I would like the resulting df column to just have the horses name.
I've tried splitting the column at the full stop but am not getting the required result.
import pyspark.sql.functions as F
runners_returns = runners_returns.withColumn('runner_name', F.split(F.col('runner_name'), '.'))
Any help is greatly appreciated
With a Dataframe like the following.
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+
You can do remove the leading numbers and periods like this.
import pyspark.sql.functions as F
df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

PySpark DataFrame multiply columns based on values in other columns

Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated
we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+