Related
I'm trying to add new columns based on the input Name and Date columns as below:
Input:
+------+-----------+
|Name |Date |
+------+-----------+
|PETER |1986-May-29|
+------+-----------+
Expected Output:
+---------+-----------+
|Character| New_Date|
+---------+-----------+
| P|1986-May-29|
| E|1986-May-30|
| T|1986-May-31|
| E|1986-Jun-01|
| R|1986-Jun-02|
+---------+-----------+
df_withchars = df.withColumn("Character", F.explode(F.split('Name','')))\
.filter(F.col('Character') != '')
df_withchars.withColumn('New_Date', (lambda x: F.date_add(x['Date'], 1) for i in range(len(x[0])))).show()
Tried the above code and throwing NameError: name 'x' is not defined
You can use split to create an array from the column and then posexplode to explode this array.
posexplode is similar to explode function but it returns one additional column - the position/index of an item. That means it will give you the number from 0 to 4 in your particular example. You can add this number to the date using date_add function.
Before we start lets import relevant functions:
from pyspark.sql.types import StructField, StructType, DateType, StringType
from pyspark.sql.functions import split, col, date_add, posexplode
from datetime import date
Then create a data frame:
sample_data = [('PETER', date.today())]
sample_schema = StructType([
StructField('Name', StringType(), True),
StructField('Date', DateType(), True),
])
df = spark.createDataFrame(data=sample_data, schema=sample_schema)
Final step is to use split and posexplode functions to get the index of each character, and then add the index to date.
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_add('Date', col('pos'))).show()
Note that in the above code, I've used regex pattern (?!$) (reference). You can use empty string "" as well. However, it will return an additional empty item in the resulting array in SplittedName column, which will requires adding another step just to remove this empty, not relevant item.
The result:
EDIT:
To get the resulting date in the desired format, you just need to use date_format (don't forget to import the function first) as follows:
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_format(date_add('Date', col('pos')), "yyyy-MMM-dd")).show()
New result:
How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)
I have a pyspark dataframe, and I wish to get the mean and std for all columns, and rename the columns name and type, what is the easiest way to implement this, currently below is my code:
test_mean=test.groupby('id').agg({'col1': 'mean',
'col2': 'mean',
'col3':'mean'
})
test_std=test.groupby('id').agg({'col1': 'std',
'col2': 'std',
'col3':'std'
})
##rename one columns by one columns
## type cast decimal to float
May I know how to improve it?
Thanks.
You can try with Col experssioons:
from pyspark.sql import functions as F
expr1 = F.std(F.col('col1').cast('integer').alias('col1'))
expr2 = F.std(F.col('col2').cast('integer').alias('col2'))
test \
.groupBy(id) \
.agg(
expr1,
expr2
)
How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))
How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))