spark withColumn value generation from all column values - scala

I want to add a column from all existing column values in the same row.
For example,
col1 col2 ... coln col_new
------------------ -------
True False ...False "col1-..."
False True ...True "col2-...-coln"
That is, when a value is True, then add its column name with "-" separator and keep doing the same until the last column. We don't know how many columns we will have.
How can I achieve this with withColumn() in Spark? (Scala)

If the columns are all BooleanTypes then you can write a udf function to get the new column as below
import org.apache.spark.sql.functions._
val columnNames = df.columns
def concatColNames = udf((array: collection.mutable.WrappedArray[Boolean]) => array.zip(columnNames).filter(x => x._1 == true).map(_._2).mkString("-"))
df.withColumn("col_new", concatColNames(array(df.columns.map(col): _*))).show(false)
If the columns are all StringTypes then you just need to modify the udf function as below
def concatColNames = udf((array: collection.mutable.WrappedArray[String]) => array.zip(columnNames).filter(x => x._1 == "True").map(_._2).mkString("-"))
You should get what you require

Related

Scala: How to return column name and value from a dataframe

I am trying to create a function which can scan a dataframe row by row and, for each row, spit out the non empty columns and the column names. But the challenge is that I dont know the number of columns or their names in the input dataframe.
A function something like GetNotEmptyCols(InputRow: Row): (Colname:String, ColValue:String)
As sample data, consider the following dataframes.
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
I have been trying to get the column value for each row object but dont know how to do that when I dont have the names of columns. I could extract the column names from the dataframe and pass it to the function as an additional variable but am hoping for a better approach since row object should have the column names and I should be able to extract them.
The output I am working to get is something like this:
DataFrameA.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColA", "tot"), ("ColC", "ink")
For row2: ("ColA","yes"),("ColB","yes")
For row3: ("ColC","many")
DataFrameB.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColD", "yes")
For row2: ()
Please find below my implementation for GetNonEmptyCols, which takes row along with columns -
import org.apache.spark.sql.{Row, SparkSession}
import scala.collection.mutable.ArrayBuffer
object StackoverFlowProblem {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
//Store column names in a variable, append to-be-added column 'index' as well
val columns = DataFrameA.columns :+ "index"
//Use monotonically_increasing_id() API to add row indices in the dataframe
DataFrameA.withColumn("index",monotonically_increasing_id()).foreach(a => GetNotEmptyCols(a,columns))
}
def GetNotEmptyCols(inputRow: Row, columns:Array[String]): Unit ={
val rowIndex = inputRow.getAs[Long]("index")
val a = inputRow.length
val nonEmptyCols = ArrayBuffer[(String,String)]()
for(i <- 0 until a-1){
val value = inputRow.getAs[String](i)
if(!value.isEmpty){
val name = columns(i)
nonEmptyCols += Tuple2(name,value)
}
}
println(s"For row $rowIndex: " + nonEmptyCols.mkString(","))
}
}
This will print the below output for your first Dataframe(I have used zero-based indexing for row printing) -
For row 0: (ColA,tot),(ColC,ink)
For row 1: (ColA,yes),(ColB,yes)
For row 2: (ColC,many)
I found a solution by myself. I can use getValueMap method to create a map of column names and the values which i can return and then convert it to a list.
def returnNotEmptyCols(inputRow: Row): Map[String,String] = {
val colValues = inputRow.getValuesMap[String](inputRow.schema.fieldNames)
.filter(x => x._2!= null && x._2!= "")
colValues
}
returnNotEmptyCols(rowA1).map{case(k,v) => (k, v)}toList

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Adding a constant value to columns in a Spark dataframe

I have a Spark data frame which will be like below
id person age
1 naveen 24
I want add a constant "del" to each column value except the last column in the dataframe like below,
id person age
1del naveendel 24
Can someone assist me how to implement this in Spark df using Scala
You can use the lit and concat functions:
import org.apache.spark.sql.functions._
// add suffix to all but last column (would work for any number of cols):
val colsWithSuffix = df.columns.dropRight(1).map(c => concat(col(c), lit("del")) as c)
def result = df.select(colsWithSuffix :+ $"age": _*)
result.show()
// +----+---------+---+
// |id |person |age|
// +----+---------+---+
// |1del|naveendel|24 |
// +----+---------+---+
EDIT: to also accommodate null values, you can wrap the column with coalesce before appending the suffix - replace the like calculating colsWithSuffix with:
val colsWithSuffix = df.columns.dropRight(1)
.map(c => concat(coalesce(col(c), lit("")), lit("del")) as c)