how could I merge the column that was duplicated in pyspark? [duplicate]

how could I merge the column that was duplicated in pyspark? [duplicate] - pyspark

This question already has answers here:
combine text from multiple rows in pyspark
(3 answers)
Closed 2 years ago.
I have a dataframe as below:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1| ssssssss|
| 2| ssssssss|
| 3| aaaaaaaa|
| 4| aaaaaaaa|
+--------------------+--------------------+
After using df.dropDuplicates(['statement']), I got this:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1| ssssssss|
| 3| aaaaaaaa|
+--------------------+--------------------+
But actually, I want to keep the _id value as below:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1, 2| ssssssss|
| 3, 4| aaaaaaaa|
+--------------------+--------------------+
How could I do?

Finally find my answer in combine text from multiple rows in pyspark
sdf.groupBy('lstatement').agg(F.collect_list('_id').alias("_id")).show()

Related

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!

You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

Apply function on all rows of dataframe [duplicate]

This question already has answers here:
Process all columns / the entire row in a Spark UDF
(2 answers)
Closed 3 years ago.
I want to apply a function on all rows of DataFrame.
Example:
|A |B |C |
|1 |3 |5 |
|6 |2 |0 |
|8 |2 |7 |
|0 |9 |4 |
Myfunction(df)
Myfunction(df: DataFrame):{
//Apply sum of columns on each row
}
Wanted output:
1+3+5 = 9
6+2+0 = 8
...
How can that be done is Scala please? i followed this but got no luck.

It's simple. You don't need to write any function for this, all you can do is to create a new column by summing up all the columns you want.
scala> df.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 4|
| 1| 2| 5|
+---+---+---+
scala> df.withColumn("sum",col("A")+col("B")+col("C")).show
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 1| 2| 3| 6|
| 1| 2| 4| 7|
| 1| 2| 5| 8|
+---+---+---+---+
Edited:
Well you can run map function on each row and get the sum using row index/field name.
scala> df.map(x=>x.getInt(0) + x.getInt(1) + x.getInt(2)).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
scala> df.map(x=>x.getInt(x.fieldIndex("A")) + x.getInt(x.fieldIndex("B")) + x.getInt(x.fieldIndex("C"))).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+

Map is the solution if you want to apply a function to every row of a dataframe. For every Row, you can return a tuple and a new RDD is made.
This is perfect when working with Dataset or RDD but not really for Dataframe. For your use case and for Dataframe, I would recommend just adding a column and use columns objects to do what you want.
// Using expr
df.withColumn("TOTAL", expr("A+B+C"))
// Using columns
df.withColumn("TOTAL", col("A")+col("B")+col("C"))
// Using dynamic selection of all columns
df.withColumn("TOTAL", df.colums.map(col).reduce((c1, c2) => c1 + c2))
In that case, you'll be very interested in this question.
UDF is also a good solution and is better explained here.
If you don't want to keep source columns, you can replace .withColumn(name, value) with .select(value.alias(name))

pyspark: counting number of occurrences of each distinct values

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2

I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+

I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

How to convert an array of dataFrames into a single data frame? [duplicate]

This question already has answers here:
Spark unionAll multiple dataframes
(4 answers)
Closed 4 years ago.
I have an array of data frames called "dataFrames" and looks like this:
dataFrames(0)
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 1|Apple, Watermelon | 1| 2|
| 1|Banana, StrawBerry | 2| 2|
+----------+--------------------+---------+-------------+
dataFrames(1)
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 2|Naranjas, Fresas | 7| 2|
| 2|Pineapple, Apples | 9| 2|
+----------+--------------------+---------+-------------+
Well, I need to get a single dataframe like this:
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 1|Apple, Watermelon | 1| 2|
| 1|Banana, StrawBerry | 2| 2|
| 2|Naranjas, Fresas | 7| 2|
| 2|Pineapple, Apples | 9| 2|
+----------+--------------------+---------+-------------+
For this example the length of the array is 1, but the array could any size.
It is possible to achive this... or i need to store the dataframes into a hive table?
Thanks in advance

You can reduce a sequence or array of DataFrames together using unionAll:
val dfs = Array(df1, df2, df3)
val all = dfs.reduce(_ unionAll _)

How to append column values in Spark SQL?

I have the below table:
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
What I am trying to achieve is to append the genre values together where the id and name are the same. Like this:
+-------+---------+---------------------------+
|movieId|movieName| genre |
+-------+---------+---------------------------+
| 1| example1| action|thriller|romance |
| 2| example2| action|fantastic |
+-------+---------+---------------------------+

Use groupBy and collect_list to get a list of all items with the same movie name. Then combine these to a string using concat_ws (if the order is important, first use sort_array). Small example with given sample dataframe:
val df2 = df.groupBy("movieId", "movieName")
.agg(collect_list($"genre").as("genre"))
.withColumn("genre", concat_ws("|", sort_array($"genre")))
Gives the result:
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |action|fantastic |
+-------+---------+-----------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how could I merge the column that was duplicated in pyspark? [duplicate] - pyspark

Finally find my answer in combine text from multiple rows in pyspark sdf.groupBy('lstatement').agg(F.collect_list('_id').alias("_id")).show()

Related

Get last n items in pyspark

Apply function on all rows of dataframe [duplicate]

pyspark: counting number of occurrences of each distinct values

How to convert an array of dataFrames into a single data frame? [duplicate]

How to append column values in Spark SQL?

Categories

Resources