Spark Dataframe select based on column index - scala

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.

You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]

#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames

Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Related

Dynamically generate code having filter, withColumnRenamed and coalesce condition Scala Spark

I have a piece of code which I want to generate dynamically. I want to take below columns in the form of a list or Sequence and perform filter operation with coalesce inside, drop and withColumnRenamed statements.
Here the list of columns that I want to accept dynamically (here as a string).
val cols = "a|tmp_a,b|tmp_b"
The code looks something like this:
val df1 = df2.filter(!(coalesce(col("a"), lit(0)) === coalesce(col("tmp_a"), lit(0))) || !(upper(col("b")) === upper(col("tmp_b"))))
.drop("a")
.drop("b")
.withColumnRenamed("tmp_a", "a")
.withColumnRenamed("tmp_b", "b")
If more columns are added to cols, how can the code be adapted dynamically? New column pairs should use the same filter condition as the "b|tmp_b" above.
Given an input with the pairs of column names, you can create the two types of filter conditions (below the first column pair uses the first filter pattern and the rest uses the second). After the dataframe is filtered, the drop and withColumnRenamed can be applied using a foldLeft.
val cols = "a|tmp_a,b|tmp_b,c|tmp_c".split(",").map(_.split("\\|"))
val filterCondHead = !(coalesce(col(cols.head(0)), lit(0)) === coalesce(col(cols.head(1)), lit(0)))
val filterCondTail = cols.tail.map(c => !(upper(col(c(0))) === upper(col(c(1))))).reduce(_ || _)
val df2 = df.filter(filterCondHead || filterCondTail)
val df3 = cols.foldLeft(df2){ case(df, c) =>
df.drop(c(0)).withColumnRenamed(c(1), c(0))
}

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.
One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+
There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

How to add corresponding Integer values in 2 different DataFrames

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}

Replicate Spark Row N-times

I want to duplicate a Row in a DataFrame, how can I do that?
For example, I have a DataFrame consisting of 1 Row, and I want to make a DataFrame with 100 identical Rows. I came up with the following solution:
var data:DataFrame=singleRowDF
for(i<-1 to 100-1) {
data = data.unionAll(singleRowDF)
}
But this introduces many transformations and it seems my subsequent actions become very slow. Is there another way to do it?
You can add a column with a literal value of an Array with size 100, and then use explode to make each of its elements create its own row; Then, just get rid of this "dummy" column:
import org.apache.spark.sql.functions._
val result = singleRowDF
.withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
.selectExpr(singleRowDF.columns: _*)
You could pick out the single row, make a list with a hundred elements, populated with that row and convert it back into a dataframe.
import org.apache.spark.sql.DataFrame
val testDf = sc.parallelize(Seq(
(1,2,3), (4,5,6)
)).toDF("one", "two", "three")
def replicateDf(n: Int, df: DataFrame) = sqlContext.createDataFrame(
sc.parallelize(List.fill(n)(df.take(1)(0)).toSeq),
df.schema)
val replicatedDf = replicateDf(100, testDf)
You could use a flatMap, or a for-comprehension, like it is described here.
I encourage you to use DataSets every time you can, but if it's not possible, the last example in the link works with DataFrames as well:
val df = Seq(
(0, "Lorem ipsum dolor", 1.0, List("prp1", "prp2", "prp3"))
).toDF("id", "text", "value", "properties")
val df2 = for {
row <- df
p <- row.getAs[Seq[String]]("properties")
} yield (row.getAs[Int]("id"), row.getAs[String]("text"), row.getAs[Double]("value"), p)
Also keep in mind that explode is deprecated, see here.