I'm writing methods in Scala that take in Column arguments and return a column. Within them, I'm looking to compare the value of the columns (ranging from integers to dates) using logic similar to the below, but have been encountering an error message.
The lit() is for example purposes only. In truth I'm passing columns from a DataFrame.select() into a method to do computation. I need to compare using those columns.
val test1 = lit(3)
val test2 = lit(4)
if (test1 > test2) {
print("tuff")
}
Error message.
Error : <console>:96: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
if (test1 > test2) {
What is the correct way to compare Column objects in Spark? The column documentation lists the > operator as being valid for comparisons.
Edit: Here's a very contrived example of usage, assuming the columns passed into the function are dates that need to be compared for business reasons with a returned integer value that also has some business significance.
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)
Where computedColumn would be
def computedColumn(col1 : Column, col2: Column) : Column = {
val returnCol : Column = lit(0)
if (col1 > col2) {
returnCol = lit(4)
}
}
Except in actually usage there is a lot more if/else logic that needs to happen in computedColumn, with the final result being a returned Column that will be added to the select's output.
You can use when to do a conditional comparison:
someDataFrame.select(
$"SomeColumn",
when($"SomeColumn1" > $"SomeColumn2", 4).otherwise(0).as("MyComputedColumn")
)
If you prefer to write a function:
def computedColumn(col1 : Column, col2: Column) : Column = {
when(col1 > col2, 4).otherwise(0)
}
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)
Related
Table structure -
create table test
(month integer,
year integer,
thresholds decimal(18,2)
);
Static insert for simulation -
insert into test(month,year,threshold) values(4,2021,100),(5,2021,98),(6,2021,99);
If I query postgres database using anorm, it works for regular queries. However on adding aggregate functions like max, the RowParser is not able to find the alias column.
val queryString =
"""select max(month) as monthyear from test
| where (month || '-' || year)
| = {inQuery}""".stripMargin
val inQuery1 = "'5-2021'"
The below on method causes the issue -
val latestInBenchmark = SQL(queryString).on("inQuery" -> inQuery1) // removing the on resolves the problem
logger.info("query for latest period ---> " + latestInBenchmark)
val latestYearMonthInInterval = database.withConnection(implicit conn => {
latestInBenchmark.as(SqlParser.int("monthyear").*)
})
Removing the on rectifies the problem and SqlParser.int(column-name) works as expected.
This also does not affect queries that use count aggregate function.
Error encountered :
(Validated intervals with sorting -> ,Failure(anorm.AnormException: 'monthyear' not found, available columns: monthyear, monthyear))
[error] c.b.ThresholdController - 'monthyear' not found, available columns: monthyear, monthyear
The error you have is a bit misleading but it means the query either returns a row with a null value or no row.
In your case I think the issue is the WHERE clause: you have put single quotes around the value but Anorm will do it by itself when using .on(...) or Anorm interpolation.
Thus, replace:
val inQuery1 = "'5-2021'"
By:
val inQuery1 = "5-2021"
I have a list of strings, which represents the names of various columns I want to add together to make another column:
val myCols = List("col1", "col2", "col3")
I want to convert the list to columns, then add the columns together to make a final column. I've looked for a number of ways to do this, and the closest I can come to the answer is:
df.withColumn("myNewCol", myCols.foldLeft(lit(0))(col(_) + col(_)))
I get a compile error where it says it is looking for a string, when all I really want is a column. What's wrong? How to fix it?
When I tried it out in spark-shell it gave me the error that says exactly what the error is and where.
scala> myCols.foldLeft(lit(0))(col(_) + col(_))
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
myCols.foldLeft(lit(0))(col(_) + col(_))
^
Just think of the first pair that is given to the function of foldLeft. It's going to be lit(0) of type Column and col1 of type String. There's no col function that accepts a Column.
Try reduce instead:
myCols.map(col).reduce(_ + _)
From the official documentation of reduce:
Applies a binary operator to all elements of this collection, going right to left.
the result of inserting op between consecutive elements of this collection, going right to left:
op(x_1, op(x_2, ..., op(x_{n-1}, x_n)...))
where x1, ..., xn are the elements of this collection.
Here is how you can add columns dynamically based on the column names on a List. When all columns are numeric the result is a number. The 1st variable on foldLeft is of same type as return. foldLeft would work as much as reduce.
val employees = //a dataframe with 2 numeric columns "salary","exp"
val initCol = lit(0)
val cols = Seq("salary","exp")
val col1 = cols.foldLeft(initCol)((x,y) => x + col(y))
employees.select(col1).show()
I'm trying to validate datatype of DataFrame before entering the loop, wherein I'm trying to do SQL calculation, but datatype validation is not going through and it is not getting inside the loop. The operation needs to be performed on only numeric columns.
How can this be solved? Is this the right way to handle datatype validation?
//get datatype of dataframe fields
val datatypes = parquetRDD_subset.schema.fields
//check if datatype of column is String and enter the loop for calculations.
for (val_datatype <- datatypes if val_datatype.dataType =="StringType")
{
val dfs = x.map(field => spark.sql(s"select * from table"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
}
You are comparing the dataType to a string which will never be true (for me the compiler complains that they are unrelated). dataType is an object which is a subtype of org.apache.spark.sql.types.DataType.
Try replacing your for with
for (val_datatype <- datatypes if val_datatype.dataType.isInstanceOf[StringType])
In any case, your for loop does nothing but declare the vals, it doesn't do anything with them.
I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))
I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?
Current code
Using Spark 1.4.0 and Scala 2.11.6
myDF = some dataframe with n rows by m columns
def difference(col: Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1"),
difference($"Column2"),
difference($"Column3"),
difference($"Column4")
)
}
When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.
After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.
def difference(col:Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1").alias("newColumn1"),
difference($"Column2").alias("newColumn2"),
difference($"Column3").alias("newColumn3"),
difference($"Column4").alias("newColumn4")
)
}