Spark Delta merge add Source column value to Target column value - scala

I want the updated value in the target's column to be the sum of source value + target value
example:
%scala
import org.apache.spark.sql.functions._
import io.delta.tables._
// Create example delta table
val dept = Seq(("Finance",10), ("Marketing",20),("Sales",30), ("IT",40) )
val deptColumns = Seq("dept_name","dept_emp_count")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.write.format("delta").mode("overwrite").saveAsTable("dept_table")
//create example stage dataframee
val staged_df = spark.sql("select * from dept_table").withColumn("dept_emp_count", lit(1))
//How to do this merge?
DeltaTable.forName(spark, "dept_table").as("events")
.merge(staged_df.as("updates"), "events.dept_name = updates.dept_name")
.whenMatched()
.updateExpr(Map(
"dept_emp_count" -> lit("events.dept_emp_count") + lit("updates.dept_emp_count"))) // How do I write this line?
.execute()

The value in that update Map is the SQL expression, so instead of the lit("events.dept_emp_count") + lit("updates.dept_emp_count") you just need to write "events.dept_emp_count + updates.dept_emp_count"

Related

How to join Dataframe with one-column dataset without using column names in dataset

Let's consider:
val columnNames: Seq[String] = Seq[String]("col_1") // column present in DataFrame df
df.join(usingColumns = columnNames, right = ds)) // ds is some dataset that has exactly one column.
// the problem is about the fact that I don't know name of this column? I only know that
// df.col("col_1") and ds.col(???)` has the same types.
Is it possible to do this join?
you can use something like :
package utils
object Extensions {
implicit class DataFrameExtensions(df: DataFrame) {
def selecti(indices: Int*) = {
val cols = df.columns
df.select(indices.map(cols(_)):_*)
}
}
}
then use this to select column by numbers :
import utils.Extensions._
df.selecti(1,2,3)
Assuming that the "col_1" from the first dataframe will always join to the single column in the ds dataframe you can just rename the column in the ds data frame with a single column like below. Then your join using names only need reference "col_1"
// set the name of the column in ds to col_1
val ds2 = ds.toDF("col_1")
You can change the column name of the dataset to col_1:
val result = df.join(ds.withColumnRenamed(ds.columns(0), "col_1"), "col_1", "right")

Scala: How to return column name and value from a dataframe

I am trying to create a function which can scan a dataframe row by row and, for each row, spit out the non empty columns and the column names. But the challenge is that I dont know the number of columns or their names in the input dataframe.
A function something like GetNotEmptyCols(InputRow: Row): (Colname:String, ColValue:String)
As sample data, consider the following dataframes.
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
I have been trying to get the column value for each row object but dont know how to do that when I dont have the names of columns. I could extract the column names from the dataframe and pass it to the function as an additional variable but am hoping for a better approach since row object should have the column names and I should be able to extract them.
The output I am working to get is something like this:
DataFrameA.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColA", "tot"), ("ColC", "ink")
For row2: ("ColA","yes"),("ColB","yes")
For row3: ("ColC","many")
DataFrameB.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColD", "yes")
For row2: ()
Please find below my implementation for GetNonEmptyCols, which takes row along with columns -
import org.apache.spark.sql.{Row, SparkSession}
import scala.collection.mutable.ArrayBuffer
object StackoverFlowProblem {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
//Store column names in a variable, append to-be-added column 'index' as well
val columns = DataFrameA.columns :+ "index"
//Use monotonically_increasing_id() API to add row indices in the dataframe
DataFrameA.withColumn("index",monotonically_increasing_id()).foreach(a => GetNotEmptyCols(a,columns))
}
def GetNotEmptyCols(inputRow: Row, columns:Array[String]): Unit ={
val rowIndex = inputRow.getAs[Long]("index")
val a = inputRow.length
val nonEmptyCols = ArrayBuffer[(String,String)]()
for(i <- 0 until a-1){
val value = inputRow.getAs[String](i)
if(!value.isEmpty){
val name = columns(i)
nonEmptyCols += Tuple2(name,value)
}
}
println(s"For row $rowIndex: " + nonEmptyCols.mkString(","))
}
}
This will print the below output for your first Dataframe(I have used zero-based indexing for row printing) -
For row 0: (ColA,tot),(ColC,ink)
For row 1: (ColA,yes),(ColB,yes)
For row 2: (ColC,many)
I found a solution by myself. I can use getValueMap method to create a map of column names and the values which i can return and then convert it to a list.
def returnNotEmptyCols(inputRow: Row): Map[String,String] = {
val colValues = inputRow.getValuesMap[String](inputRow.schema.fieldNames)
.filter(x => x._2!= null && x._2!= "")
colValues
}
returnNotEmptyCols(rowA1).map{case(k,v) => (k, v)}toList

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Compare 2 dataframes and filter results based on date column in spark

I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28