use length function in substring in spark - scala

I am trying to use the length function inside a substring function in a DataFrame
but it gives error
val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))
below is the error
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
I am using 2.1.

Function "expr" can be used:
val data = List("first", "second", "third")
val df = sparkContext.parallelize(data).toDF("value")
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)"))
result.show(false)
output:
+------+------+
|value |cutted|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+

You could also use $"COLUMN".substr
val substrDF = testDF.withColumn("newcol", $"col".substr(lit(1), length($"col")-1))
Output:
val testDF = sc.parallelize(List("first", "second", "third")).toDF("col")
val result = testDF.withColumn("newcol", $"col".substr(org.apache.spark.sql.functions.lit(1), length($"col")-1))
result.show(false)
+------+------+
|col |newcol|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+

You get that error because you the signature of substring is
def substring(str: Column, pos: Int, len: Int): Column
The len argument that you are passing is a Column, and should be an Int.
You may probably want to implement a simple UDF to solve that problem.
val strTail = udf((str: String) => str.substring(1))
testDF.withColumn("newCol", strTail($"col"))

If all you want is to remove the last character of the string, you can do that without UDF as well. By using regexp_replace :
testDF.show
+---+----+
| id|name|
+---+----+
| 1|abcd|
| 2|qazx|
+---+----+
testDF.withColumn("newcol", regexp_replace($"name", ".$" , "") ).show
+---+----+------+
| id|name|newcol|
+---+----+------+
| 1|abcd| abc|
| 2|qazx| qaz|
+---+----+------+

You have to use the SUBSTR function to achieve this.
val substrDF = testDF.withColumn("newcol", 'col.substr(lit(1), length('col)-1))
The first parameter is the position from which you want the data to be trimmed, the second parameter is the length of the trimmed field.
(startPos: Int,len: Int)

Related

Spark creating a new column based on a mapped value of an existing column

I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?
The problem is that parameters to an UDF should all be of column type. One solution would be to convert binList into a column and pass it to the UDF similar to the current code.
However, it is simpler to adjust the UDF slightly and turn it into a def. In this way you can easily pass other non-column type data:
def convertValueToBinRangeUDF(binList: List[Bin]) = udf((value:String) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
Usage:
val dfBin = valuesCounts.withColumn("bin_range", convertValueToBinRangeUDF(binList)($"columnName"))
Try this -
scala> case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
| val binMin = binMax - binWidth
|
| // only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
| def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
|
| def rangeAsString(): String = {
| val sb = new StringBuilder()
| sb.append(binMin).append(" - ").append(binMax)
| sb.toString()
| }
| }
defined class Bin
scala> val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
binList: List[Bin] = List(Bin(10,10), Bin(20,10), Bin(30,10), Bin(40,10), Bin(50,10))
scala> spark.udf.register("convertValueToBinRangeUDF", (value: String) => {
| val number = BigDecimal(value)
| val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
| bin.rangeAsString()
| })
res13: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
//-- Testing with one record
scala> val dfOriginal = spark.sql(s""" select "9" as `high_scores`, "1" as count """)
dfOriginal: org.apache.spark.sql.DataFrame = [high_scores: string, count: string]
scala> dfOriginal.createOrReplaceTempView("dfOriginal")
scala> val dfBin = spark.sql(s""" select high_scores, count, convertValueToBinRangeUDF(high_scores) as bin_range from dfOriginal """)
dfBin: org.apache.spark.sql.DataFrame = [high_scores: string, count: string ... 1 more field]
scala> dfBin.show(false)
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
|9 |1 |0 - 10 |
+-----------+-----+---------+
Hope this will help.

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null
The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")
You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+
I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+

Select the last element of an Array in a DataFrame

I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.
Below is an example of what I'm trying to accomplish:
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
actualDF.show(false)
actualDF.printSchema()
+-------+-----------------------+
|name |hit_songs |
+-------+-----------------------+
|beatles|[help, hey, jude] |
|romeo |[eres, mia] |
|elvis |[this, is, an, example]|
+-------+-----------------------+
root
|-- name: string (nullable = true)
|-- hit_songs: array (nullable = true)
| |-- element: string (containsNull = true)
The end goal for the output would be the following, to select the last "string" in the hit_songs array.
I'm not worried about what the schema would look like afterwards.
+-------+---------+
|name |hit_songs|
+-------+---------+
|beatles|jude |
|romeo |mia |
|elvis |example |
+-------+---------+
You can use the size function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply (explicitly or implicitly):
import org.apache.spark.sql.functions._
import spark.implicits._
actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))
Or:
actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))
Since spark 2.4+, you can use element_at which supports negative indexing. As you can see in this documentation quote:
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.
With that, here's how to get the last element:
import org.apache.spark.sql.functions.element_at
actualDF.withColumn("hit_songs", element_at($"hit_songs", -1))
Reproducible example:
First let's prepare a sample dataframe with an array column:
val columns = Seq("col1")
val data = Seq((Array(1,2,3)))
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF(columns:_*)
which looks like this:
scala> df.show()
+---------+
| col1|
+---------+
|[1, 2, 3]|
+---------+
Then, apply element_at to get the last element as follows:
scala> df.withColumn("last_value", element_at($"col1", -1)).show()
+---------+----------+
| col1|last_value|
+---------+----------+
|[1, 2, 3]| 3|
+---------+----------+
Here's one approach:
val actualDF = Seq(
("beatles", Seq("help", "hey", "jude")),
("romeo", Seq("eres", "mia")),
("elvis", Seq("this", "is", "an", "example"))
).toDF("name", "hit_songs")
import org.apache.spark.sql.functions._
actualDF.withColumn("total_songs", size($"hit_songs")).
select($"name", $"hit_songs"($"total_songs" - 1).as("last_song"))
// +-------+---------+
// | name|last_song|
// +-------+---------+
// |beatles| jude|
// | romeo| mia|
// | elvis| example|
// +-------+---------+
You are looking for the SparkSQL function slice. or this PySpark Source.
Your implementation in Scala slice($"hit_songs", -1, 1)(0) where -1 is the starting position (last index) and 1 is the length, and (0) extracts the first string from resulting array of exactly 1 element.
Full Example:
import org.apache.spark.sql.functions._
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
val newDF = actualDF.withColumn("last_song", slice($"hit_songs", -1, 1)(0))
display(newDF)
Output:
+---------+------------------------------+-------------+
| name | hit_songs | last_song |
+---------+------------------------------+-------------+
| beatles | ["help","hey","jude"] | jude |
| romeo | ["eres","mia"] | mia |
| elvis | ["this","is","an","example"] | example |
+---------+------------------------------+-------------+
You can also use an UDF like:
val lastElementUDF = udf((array: Seq[String]) => array.lastOption)
actualDF.withColumn("hit_songs", lastElementUDF($"hit_songs"))
array.lastOption will return None or Some, and array.last will throw an exception if the array is empty.

Get minimum value from an Array in a Spark DataFrame column

I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))
You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful
Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+
You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array