Spark: apply sliding() to each row without UDF - scala

I have a Dataframe with several columns. The i-th column contains strings. I want to apply the string sliding(n) function to each string in the column. Is there a way to do so without using user-defined functions?
Example:
My dataframe is
var df = Seq((0, "hello"), (1, "hola")).toDF("id", "text")
I want to apply the sliding(3) function to each element of column "text" to obtain a dataframe corresponding to
Seq(
(0, ("hel", "ell", "llo"))
(1, ("hol", "ola"))
)
How can I do this?

For spark version >= 2.4.0, this can be done using the inbuilt functions array_repeat, transform and substring.
import org.apache.spark.sql.functions.{array_repeat, transform, substring}
//Repeat the array `n` times
val repeated_df = df.withColumn("tmp",array_repeat($"text",length($"text")-3+1))
//Get the slices with transform higher order function
val res = repeated_df.withColumn("str_slices",
expr("transform(tmp,(x,i) -> substring(x from i+1 for 3))")
)
//res.show()
+---+-----+---------------------+---------------+
|id |text |tmp |str_slices |
+---+-----+---------------------+---------------+
|0 |hello|[hello, hello, hello]|[hel, ell, llo]|
|1 |hola |[hola, hola] |[hol, ola] |
+---+-----+---------------------+---------------+

Related

Apply a transformation to all the columns with the same data type on Spark

I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").
Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataSeq = Seq(
(1246984, 993922, "test_1"),
(246984, 993922, "test_2"),
(246984, 993922, "test_3"))
val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
+-------+------+------+
| int_1| int_2| str_3|
+-------+------+------+
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
+-------+------+------+
val columns =
df.dtypes.map{
case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
case (c, t) => col(c)
}
val df2 = df.select(columns:_*)
df2.show
+---------+-------+------+
| int_1| int_2| str_3|
+---------+-------+------+
|1,246,984|993,922|test_1|
| 246,984|993,922|test_2|
| 246,984|993,922|test_3|
+---------+-------+------+

Key corresponding to max value in a spark map column

If I have a spark map column from string to double, is there an easy way to generate a new column with the key corresponding to the maximum value?
I was able to achieve it using collection functions as illustrated below:
import org.apache.spark.sql.functions._
val mockedDf = Seq(1, 2, 3)
.toDF("id")
.withColumn("optimized_probabilities_map", typedLit(Map("foo"->0.34333337, "bar"->0.23)))
val df = mockedDf
.withColumn("optimizer_probabilities", map_values($"optimized_probabilities_map"))
.withColumn("max_probability", array_max($"optimizer_probabilities"))
.withColumn("max_position", array_position($"optimizer_probabilities", $"max_probability"))
.withColumn("optimizer_ruler_names", map_keys($"optimized_probabilities_map"))
.withColumn("optimizer_ruler_name", $"optimizer_ruler_names"( $"max_position"))
However, this solution is unnecessarly long and not very efficient. There is also a possible precision issue since I am comparing doubles when using array_position. I wonder if there is a better way to do this without UDFs, maybe using an expression string.
Sine you can use Spark 2.4+, one way is to use Spark-SQL builtin function aggregate where we iterate through all map_keys and then compare the corresponding map_values with the buffered values acc.val and then update acc.name accordingly:
mockedDf.withColumn("optimizer_ruler_name", expr("""
aggregate(
map_keys(optimized_probabilities_map),
(string(NULL) as name, double(NULL) as val),
(acc, y) ->
IF(acc.val is NULL OR acc.val < optimized_probabilities_map[y]
, (y as name, optimized_probabilities_map[y] as val)
, acc
),
acc -> acc.name
)
""")).show(false)
+---+--------------------------------+--------------------+
|id |optimized_probabilities_map |optimizer_ruler_name|
+---+--------------------------------+--------------------+
|1 |[foo -> 0.34333337, bar -> 0.23]|foo |
|2 |[foo -> 0.34333337, bar -> 0.23]|foo |
|3 |[foo -> 0.34333337, bar -> 0.23]|foo |
+---+--------------------------------+--------------------+
Another solution would be to explode the map column and then use Window function to get the max value like this:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id")
val df = mockedDf.select($"id", $"optimized_probabilities_map", explode($"optimized_probabilities_map"))
.withColumn("max_value", max($"value").over(w))
.where($"max_value" === $"value")
.drop("value", "max_value")

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

How to concatenate multiple columns into single column (with no prior knowledge on their number)?

Let say I have the following dataframe:
agentName|original_dt|parsed_dt| user|text|
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0|
I wish to create a new dataframe with one more column that has the concatenation of all the elements of the row:
agentName|original_dt|parsed_dt| user|text| newCol
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0| [qwertyuiop, 0,0, 16102, 0]
Note: This is a just an example. The number of columns and names of them is not known. It is dynamic.
TL;DR Use struct function with Dataset.columns operator.
Quoting the scaladoc of struct function:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
There are two variants: string-based for column names or using Column expressions (that gives you more flexibility on the calculation you want to apply on the concatenated columns).
From Dataset.columns:
columns: Array[String] Returns all column names as an array.
Your case would then look as follows:
scala> df.withColumn("newCol",
struct(df.columns.head, df.columns.tail: _*)).
show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
I think this works perfect for your case
here is with an example
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
("qwertyuiop", 0, 0, 16102.0, 0)
)
).toDF("agentName","original_dt","parsed_dt","user","text")
val result = data.withColumn("newCol", split(concat_ws(";", data.schema.fieldNames.map(c=> col(c)):_*), ";"))
result.show()
+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+
Hope this helped!
In general, you can merge multiple dataframe columns into one using array.
df.select($"*",array($"col1",$"col2").as("newCol")) \\$"*" will capture all existing columns
Here is the one line solution for your case:
df.select($"*",array($"agentName",$"original_dt",$"parsed_dt",$"user", $"text").as("newCol"))
You can use udf function to concat all the columns into one. All you have to do is define a udf function and pass all the columns you want to concat to the udf function and call the udf function using .withColumn function of dataframe
Or
You can use concat_ws(java.lang.String sep, Column... exprs) function available for dataframe.
var df = Seq(("qwertyuiop",0,0,16102.0,0))
.toDF("agentName","original_dt","parsed_dt","user","text")
df.withColumn("newCol", concat_ws(",",$"agentName",$"original_dt",$"parsed_dt",$"user",$"text"))
df.show(false)
Will give you output as
+----------+-----------+---------+-------+----+------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------+
|qwertyuiop|0 |0 |16102.0|0 |qwertyuiop,0,0,16102.0,0|
+----------+-----------+---------+-------+----+------------------------+
That will get you the result you want
There may be syntax errors in my answer. This is useful if you are using java<8 and spark<2.
String columns=null
For ( String columnName : dataframe.columns())
{
Columns = columns == null ? columnName : columns+"," + columnName;
}
SqlContext.sql(" select *, concat_ws('|', " +columns+ ") as complete_record " +
"from data frame ").show();