I work with a spark Dataframe and I try to create a new table with aggregation using groupby :
My data example :
and this is the desired result :
I tried this code data.groupBy("id1").agg(countDistinct("id2").alias("id2"), sum("value").alias("value"))
Anyone can help please ? Thank you
Try using below code -
from pyspark.sql.functions import *
df = spark.createDataFrame([('id11', 'id21', 1), ('id11', 'id22', 2), ('id11', 'id23', 3), ('id12', 'id21', 2), ('id12', 'id23', 1), ('id13', 'id23', 2), ('id13', 'id21', 8)], ["id1", "id2","value"])
Aggregated Data -
df.groupBy("id1").agg(count("id2"),sum("value")).show()
Output -
+----+----------+----------+
| id1|count(id2)|sum(value)|
+----+----------+----------+
|id11| 3| 6|
|id12| 2| 3|
|id13| 2| 10|
+----+----------+----------+
Here's a solution of how to groupBy with multiple columns using PySpark:
import pyspark.sql.functions as F
from pyspark.sql.functions import col
df.groupBy("id1").agg(F.count(col("id2")).alias('id2_count'),
F.sum(col('value')).alias("value_sum")).show()
Related
I'm trying in vain to use a Pyspark substring function inside of an UDF. Below is my code snippet -
from pyspark.sql.functions import substring
def my_udf(my_str):
try:
my_sub_str = substring(my_str,1, 2)
except Exception:
pass
else:
return (my_sub_str)
apply_my_udf = udf(my_udf)
df = input_data.withColumn("sub_str", apply_my_udf(input_data.col0))
The sample data is-
ABC1234
DEF2345
GHI3456
But when I print the df, I don't get any value in the new column "sub_str" as shown below -
[Row(col0='ABC1234', sub_str=None), Row(col0='DEF2345', sub_str=None), Row(col0='GHI3456', sub_str=None)]
Can anyone please let me know what I'm doing wrong?
You don't need a udf to use substring, here's a cleaner and faster way:
>>> from pyspark.sql import functions as f
>>> df.show()
+-------+
| data|
+-------+
|ABC1234|
|DEF2345|
|GHI3456|
+-------+
>>> df.withColumn("sub_str", f.substring("data", 1, 2)).show()
+-------+-------+
| data|sub_str|
+-------+-------+
|ABC1234| AB|
|DEF2345| DE|
|GHI3456| GH|
+-------+-------+
If you need to use udf for that, you could also try something like:
input_data = spark.createDataFrame([
(1,"ABC1234"),
(2,"DEF2345"),
(3,"GHI3456")
], ("id","col0"))
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
udf1 = udf(lambda x:x[0:2],StringType())
df.withColumn('sub_str',udf1('col0')).show()
+---+-------+-------+
| id| col0|sub_str|
+---+-------+-------+
| 1|ABC1234| AB|
| 2|DEF2345| DE|
| 3|GHI3456| GH|
+---+-------+-------+
However, as Mohamed Ali JAMAOUI wrote - you could do without udf easily here.
I wanted to show all the filtered results of similar matched string.
codes:
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
.builder\
.appName('Operations')\
.getOrCreate()
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
(1,2,'3'),
(2,2,'yes')],
['a', 'b', 'c']
)
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
(4,6,'yes'),
(5,7,'yes')],
['a', 'b', 'c']
)
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
sdataframe_temp\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
Output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
Expected output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
| 4| 6|
+---+---+
| 5| 7|
+---+---+
Can anyone please give some suggestions or improvement?
Here's a way using unionByName:
df = (sdataframe_temp1
.unionByName(sdataframe_temp2)
.where("c == 'yes'")
.drop('c'))
df.show()
+---+---+
| a| b|
+---+---+
| 2| 2|
| 4| 6|
| 5| 7|
+---+---+
you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
or
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.
I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.
I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.
I tried this but I guess it's not support.
Any other way to do this?
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
left.join(right,"col1")
Throws:
org.apache.spark.sql.AnalysisException: cannot resolve '(col1
=col1)' due to data
type mismatch: differing types in '(col1 =
col1)' (int and array).;;
One option is to create an UDF for building your join condition:
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
val checkValue = udf {
(array: WrappedArray[Int], value: Int) => array.contains(value)
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")
result.show
+----+------+----+
|col1| col1|col2|
+----+------+----+
| 1|[1, 2]| Yes|
| 2|[1, 2]| Yes|
| 3| [3]| No|
+----+------+----+
The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val left = Seq(1, 2, 3).toDF("col1")
val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")
val joined = left.join(right, expr("array_contains(col1_array, col1)")).show
+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
| 1| [1, 2]| Yes|
| 2| [1, 2]| Yes|
| 3| [3]| No|
+----+----------+----+
Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.
You could use explode on you Array column before the join. Explode creates a new line for each element in the array :
right = right.withColumn("exploded_col",explode(right("col1")))
right.show()
+------+----+--------------+
| col1|col2|exploded_col_1|
+------+----+--------------+
|[1, 2]| Yes| 1|
|[1, 2]| Yes| 2|
| [3]| No| 3|
+------+----+--------------+
Then you can easily join with your first dataset.
I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)"
but I need only the value as I will use it for another part of my code.
So, ideally only all_values=[0,1,2,3,4]
all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values
[Row(no_children=0),
Row(no_children=1),
Row(no_children=2),
Row(no_children=3),
Row(no_children=4)]
This takes around 15secs to run, is that normal?
Thank you very much!
You can use collect_set from functions module to get a column's distinct values.Here,
from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
| 0|
| 3|
| 2|
| 4|
| 1|
| 4|
+-----------+
>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]
You could do something like this to get only the values
list = [r.no_children for r in all_values]
list
[0, 1, 2, 3, 4]
Try this:
all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()