Replacing whitespace in all column names in spark Dataframe - scala

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.
I know a single column can be renamed using withColumnRenamed() in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).
To automate this, i have tried:
val old_names = df.columns() // contains array of old column names
val new_names = { x =>
if(x.contains(" ") == true)
else x
} // array of new column names with removed whitespace.
Now, how to replace df's header with new_names

As best practice, you should prefer expressions and immutability.
You should use val and not var as much as possible.
Thus, it's preferable to use the foldLeft operator, in this case :
val newDf = df.columns
.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))

var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
You can encapsulate it in some method so it won't be too much pollution.

In Python, this can be done by the following code:
# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col
# Building a simple dataframe:
schema = StructType([
StructField("id name", StringType(), True),
StructField("cities venezuela", StringType(), True)
column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']
# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]*exprs).show()

You can do the exact same thing in python:
raw_data1 = raw_data
for col in raw_data.columns:
raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))

In Scala, here is another way achieving same -
import org.apache.spark.sql.types._
val df_with_newColumns = spark.createDataFrame(df.rdd,
StructType( => StructField(" ", ""),
s.dataType, s.nullable))))
Hope this helps !!

Here is the utility we are using.
def columnsStandardise(df: DataFrame): DataFrame = {
val dfcolumnsStandardise= df.toDF(df.columns map (_.toLowerCase().trim().replaceAll(" ","_")): _*)

I wanted to add also this solution
import re
for each in df.schema.names:
df = df.withColumnRenamed(each, re.sub(r'\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*','',each.replace(' ', '')))

I have being using the answer given by #kanielc to trim the leading and trailing spaces in the column headers and that works great when the number of columns are less. I had to load one csv file which had around 600 columns and execution of the code took a sufficient amount of time and was not meeting our expectations.
Earlier Code:
val finalSourceTable = intermediateSourceTable.columns
.foldLeft(intermediateSourceTable)((curr, n) => curr.withColumnRenamed(n, n.trim))
Changed Code:
val finalSourceTable = intermediateSourceTable
.toDF(intermediateSourceTable.columns map (_.trim()): _*)
The changed code worked like a charm and it was also fast compared to the earlier code.
Also we are maintaining the immutability by not using var variables.


How to add prefix and suffix values for a column in spark dataframe using scala [duplicate]

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit"k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}$"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
| row11| row12|
| row21| row22|
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
| row11| row12| row11_row12|
| row21| row22| row21_row22|
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df =, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df ='-', df.col1, df.col2))
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df ="map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF =['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
|East African|
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults =",", => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults =",", => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
return instance;
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F'|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df =, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName ="FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from
val newDf =
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df =,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col map df.columns map col: _*)
or: map (df.columns andThen col): _*)
or: map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5) map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]"_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = => df.col(name))
df =*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df =>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df =
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return => col(name)):_*)
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df ="header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
Please notice that maybe this is not the best solution to do it but it can be a starting point.

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")"steps"))[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
Follow the Documentation
Check out this link also
Not sure this was around when this question was asked but:
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28