How to select columns that exist in case classes from DataFrame - scala

Given a spark DataFrame with columns "id", "first", "last", "year"
val df=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF("id", "first", "last", "year")
and case class
case class IdAndLastName(
id: Int,
last:String )
I would like to only select columns in case class which are id and last. In other words, I would like to have this output df.select("id","last") by using case class. I am avoiding hardcoding the attributes. Could you please help me how can I achieve this in a compact way.

You can create explictly an encoder for the case class (usually this happens implicitly here). Then you can get the field names from the encoder and use them in the select statement:
val fieldnames = Encoders.product[IdAndLastName].schema.fieldNames
df.select(fieldnames.head, fieldnames.tail:_*).show()
Output:
+---+-----+
| id| last|
+---+-----+
| 1| Doe|
| 2| Fish|
| 4|Wayne|
+---+-----+

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions.col
val cols = Encoders.product[IdAndLastName].schema.fieldNames.map(col)
df.select(cols: _*).show()

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

spark expression rename the column list after aggregation

I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).

Selecting several columns from spark dataframe with a list of columns as a start

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)
Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)
I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>

Scala: How to add a column with the value of a changed field that was changed between two tables

I have two tables with the same schema (A and B) where every unique ID in table A also exists in table B in a 1 to 1 way. I want to add a column to table B with the name of the column whose value is different between the tables for each row. There is only one difference per row.
For example:
Table A:
{ "id1": 1,"id2": "a","name": "bob","state": "nj"}
{"id1": 2,"id2": "b","name": "sue","state": "ma"}
Table B:
{"id1": 1,"id2": "a","name": "bob","state": "fl"}
{"id1": 2,"id2": "b","name": "susan","state": "ma"}
After comparing them, I want Table B to look like this:
{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}
{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}
I can't find any functions that do this in Spark Scala's data frames. Is there something that I missed?
EDIT: I am working with hundreds to thousands of columns
Here's a way to achieve this without having to "spell-out" the columns, and without a UDF (only using built-in functions):
import org.apache.spark.sql.functions._
import spark.implicits._
// list of columns to compare
val comparableColumns = A.columns.tail // without id
// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
case (result, col) => when(
result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
).otherwise(result)
}
// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
.withColumn("changed_field", changedFieldCol)
.select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)
result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1 |a |bob |fl |state |
// |2 |b |susan|ma |name |
// +---+---+-----+-----+-------------+
You can compare the fields in an UDF which generates the appropriate string:
import spark.implicits._
val df_a = Seq(
(1, "a", "bob", "nj"),
(2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")
val df_b = Seq(
(1, "a", "bob", "fl"),
(2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")
val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
val changedState = if (aState != bState) Some("state") else None
val changedName = if (aName != bName) Some("name") else None
Seq(changedName, changedState).flatten.mkString(",")
}
)
df_b.as("b")
.join(
df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()
gives
+---+---+------+-----+--------------+
|id1|id2| name|state|changed_fields|
+---+---+------+-----+--------------+
| 1| a| bob| fl| state|
| 2| b|susane| ma| name|
+---+---+------+-----+--------------+
EDIT:
Here a more generic version which compares all fields at once:
val compareFields = udf((a:Row,b:Row) => {
assert(a.schema==b.schema)
a.schema
.indices
.map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
.flatten
.mkString(",")
}
)
df_b.as("b")
.join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
.withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
.select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
.show()