Azure Databricks analyze if the columns names are lower case, using islower() function - pyspark

This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:

islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()

Related

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case
You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:
import org.apache.spark.sql.functions.{col, when}
dataframe
.withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
.withColumn("draft", when(col("ttype") === "draft", col("tamt")))
.drop("tamt", "ttype")
As this solution does not trigger shuffle, your processing will be faster than using pivot.
It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:
import org.apache.spark.sql.functions.{col, when}
val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))
newColumnNames
.foldLeft(dataframe)((df, columnName) => {
df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
})
.drop("tamt", "ttype")
Use groupBy,pivot & agg functions. Check below code.
Added inline comments.
scala> df.show(false)
+----------+------+----+
|tdate |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)
+----------+------+-----+
|tdate |cheque|draft|
+----------+------+-----+
|2020-10-18|7000 |null |
|2020-10-15|null |5000 |
+----------+------+-----+

How to add prefix and suffix values for a column in spark dataframe using scala [duplicate]

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Using PySpark integer column as argument

I am trying to parse a PySpark column which has an "=" sign inside. The two functions I've created for this purpose work individually:
DF=DF.withColumn("findEqual",instr(columnName,"="))
and also when I create a column of Substring
DF=DF.withColumn("parsedString",substring(columnName,2,18))
However, when I combine the two functions:
DF=DF.withColumn("parsedString",2,instr(columnName,"="))
I receive an error:
TypeError: int() argument must be a string or a number, not 'Column'
The issue seems to be that "findEqual" isn't seen by PySpark as an integer, rather an "integer object".
Thanks for your help!
You are using functions defined on strings not on pyspark columns, you can convert them using udf:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def instr (x, s):
return s in x
instr_udf = lambda s: udf(lambda x: instr(x, s), BooleanType())
DF=DF.withColumn("findEqual",instr_udf("=")("columnName"))
and
substring_udf = udf(substring, StringType())
I would not recommand using UDFs when the functions already exist in pyspark:
DF=DF.withColumn("findEqual",DF.columnName.like('%=%'))
DF=DF.withColumn("parsedString",DF.columnName[2:18])

full process of Joining 2 diffrent filed from a table and creating new filed in spark sql [duplicate]

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Modifying a column after groupBy in SPARK (using SCALA) [duplicate]

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+