Hash function in spark - scala

I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:
import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?
The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?

It is Murmur based on the source code:
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* #group misc_funcs
* #since 2.0.0
*/
#scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}

If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.
You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as
df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()

Related

Selecting columns of Dataframe in Spark Scala

If you want to select the first column of a dataframe this can be done:
df.select(df.columns(0))
df.columns(0) returns a string, so by giving the name of the column, the select is able to get the column correctly.
Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:
df.select(df.columns.split(0,3):_*)
The _* operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3)) to the select statement. However this doesn't work and it is necessary to do this:
import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))
What is going on?
I think in the question you meant slice instead of split.
And as for your question,
df.columns.slice(0,3):_* is meant to be passed to functions with *-parameters (varargs), i.e. if you call select(columns:_*) then there must be a function defined with varargs, e.g. def select(cols: String*).
But there can only be one such function defined - no overloading here is possible.
Example on why it's not possible to define two different functions with same vararg-parameter declaration:
def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3
And in Spark, that one function is defined not for Strings but for Columns:
def select(cols: Column*)
For Strings, the method is declared like this:
def select(col: String, cols: String*)
I suggest you to stick to Columns, like you do now, but with some syntax sugar:
df.select(df.columns.slice(0,3).map(col):_*))
Or if there's a need to pass column names as Strings, then you can use selectExpr:
df.selectExpr(df.columns.slice(0,3):_*)

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Spark - Column expression

I came across the following express, I know what does it mean - department("name"). I am curious to know, what it is resolved to. please share your inputs .
department("name") - it is used to refer the column with the name "name". Hope I am correct ? But , what it is resolved to, it seems like auxiliary constructor
From https://spark.apache.org/docs/2.4.5/api/java/index.html?org/apache/spark/sql/DataFrameWriter.html,
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
department("name") is just syntactic sugar for calling apply function:
department.apply("name") which returns Column
from Spark API, Dataset object:
/**
* Selects column based on the column name and returns it as a [[Column]].
*
* #note The column name can also reference to a nested column like `a.b`.
*
* #group untypedrel
* #since 2.0.0
*/
def apply(colName: String): Column = col(colName)

change a dataframe row value with dynamic number of columns spark scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.
Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))
I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)