Select values from a dataframe column - scala

I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?
Current code
Using Spark 1.4.0 and Scala 2.11.6
myDF = some dataframe with n rows by m columns
def difference(col: Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1"),
difference($"Column2"),
difference($"Column3"),
difference($"Column4")
)
}
When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.

After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.
def difference(col:Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1").alias("newColumn1"),
difference($"Column2").alias("newColumn2"),
difference($"Column3").alias("newColumn3"),
difference($"Column4").alias("newColumn4")
)
}

Related

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}
You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear
spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

Getting max value out of a dataframe column of timestamp in scala/spark

I am working with a spark dataframe where it contains the entire timestamp values from the Column 'IMG_CREATED_DT'.I have used collectAsList() and toString() method to get the values as List and converting in to String. But I am not getting how to fetch the max value out of it.Please guide me on this.
val query_new =s"""(select IMG_CREATED_DT from
${conf.get(UNCAppConstants.DB2_SCHEMA)}.$table)"""
println(query_new)
val db2_op=ConnectionUtilities_v.createDataFrame(src_props,srcConfig.url,query_new)
val t3 = db2_op.select("IMG_CREATED_DT").collectAsList().toString
How to get the max value out of t3.
You can calculate the max value form dataframe itself. Try the following sample.
val t3 = db2_op.agg(max("IMG_CREATED_DT").as("maxVal")).take(1)(0).get(0)

how to get the row corresponding to the minimum value of some column in spark scala dataframe

i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value
Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

change a dataframe row value with dynamic number of columns spark scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.
Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))
I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))