first and last methods : scala, spark - scala

In Pyspark, we have :
The first() function returns the first element present in the column, when the ignoreNulls is set to True, it returns the first non-null element.
The last() function returns the last element present in the column, when ignoreNulls is set to True, it further returns the last non-null element.
I would like to know, if we have equivalent methods for scala spark env.
Thank you in advance.

Yes, It is available in Scala Spark same as PySpark.
df.select(functions.first("col1",ignoreNulls = true),
functions.last("col2",ignoreNulls = true))
.show(false)

Yes.
A quick look at the documentation gives you first and last: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html#first(columnName:String):org.apache.spark.sql.Column
def first(columnName: String): Column
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Spark is developed using Scala, so any scala methods you want to use in Spark are available.

Related

Spark dataframe date_add function with case when not working

I have a spark DataFrame in which I have a where condition to add number of dates in the existing date column based on some condition.
My code is something like below
F.date_add(df.transDate,
F.when(F.col('txn_dt') == '2016-01-11', 9999).otherwise(10)
)
since date_add() function accepts second argument as int, but my code returns as Column, it throws error.
How to collect value from case when condition?
pyspark.sql.functions.when() returns a Column, which is why your code is producing the TypeError: 'Column' object is not callable
You can get the desired result by moving the when to the outside, like this:
F.when(
F.col('txn_dt') == '2016-01-11',
F.date_add(df.transDate, 9999)
).otherwise(F.date_add(df.transDate, 10))

Math.abs function in Select statement in Scala

I have the following code in Scala:
val FilteredPSPDF = PSPDF.select("accountname","amount", "currency", "datestamp","orderid","transactiontype")
However, I have some values in column "amount" which are negative and I need to change them to positive values. Is it possible to do this arithmetic function within the Select statement? How do I go about this?
There's an abs function available in Spark SQL
You can either use selectExpr instead of select
PSPDF.selectExpr("accountname","abs(amount) as amount", "currency", "datestamp","orderid","transactiontype")
or use select's overloaded version that takes columns types:
PSPDF.select($"accountname", abs($"amount").as("amount"), $"currency", $"datestamp", $"orderid", $"transactiontype")
You can use when and negate inbuilt function as
import org.apache.spark.sql.functions._
val FilteredPSPDF = PSPDF.select(col("accountname"), when(col("amount") < 0, negate(col("amount"))).otherwise(col("amount")), col("currency"), col("datestamp"),col("orderid"),col("transactiontype"))

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html

Select values from a dataframe column

I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?
Current code
Using Spark 1.4.0 and Scala 2.11.6
myDF = some dataframe with n rows by m columns
def difference(col: Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1"),
difference($"Column2"),
difference($"Column3"),
difference($"Column4")
)
}
When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.
After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.
def difference(col:Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1").alias("newColumn1"),
difference($"Column2").alias("newColumn2"),
difference($"Column3").alias("newColumn3"),
difference($"Column4").alias("newColumn4")
)
}