Iterate boolean comparison over two DataFrames? - scala

I have created two DF sets, one with a generic number list and another with a specific number list. I want to iterate over the first list and compare it to the second list; if GenericList[X] is equal to any number in SpecificNumber list, I want a return of True and if not, False.
I have tried to utilize a if loop, something similar to for ( num <- List ) print (list) if .....
scala> val genericList = List(5,6,7,8,9,10)
scala> val df = genericList.toDF
scala> val specificList = List(5,-3,8)

Try with .exists and .contains functions to check the number.
scala> val genericList = List(5,6,7,8,9,10)
scala> val specificList = List(5,-3,8)
scala> genericList.exists(specificList.contains)
res1: Boolean = true
In Dataframe API:
scala> val genericList = List(5,6,7,8,9,10)
scala> val df = genericList.toDF
scala> val specificList = List(5,-3,8)
scala> df.withColumn("check",'value.isin(specificList:_*)).show()
+-----+-----+
|value|check|
+-----+-----+
| 5| true|
| 6|false|
| 7|false|
| 8| true|
| 9|false|
| 10|false|
+-----+-----+

Related

Pass arguments to a udf from columns present in a list of strings

I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+

Use rlike with regex column in spark 1.5.1

I want to filter dataframe based on applying regex values in one of the columns to another column.
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
The result of filtering dataframe using RegexColumm should give rows with id 1 and 3.
Is there a way to do this in spark 1.5.1? Don't want to use UDF as this might cause scalability issues, looking for spark native api.
You can convert df -> rdd then by traversing through row we can match the regex and filter out only the matching data without using any UDF.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
UPDATE:
Instead of .map we can use .mapPartition (map vs mapPartiiton):
val rdd = df.rdd.mapPartitions(
partitions => {
partitions.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
})
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+
You can use like above, I think this is what you are lioking for. Please do let me know if it helps you.

Spark creating a new column based on a mapped value of an existing column

I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?
The problem is that parameters to an UDF should all be of column type. One solution would be to convert binList into a column and pass it to the UDF similar to the current code.
However, it is simpler to adjust the UDF slightly and turn it into a def. In this way you can easily pass other non-column type data:
def convertValueToBinRangeUDF(binList: List[Bin]) = udf((value:String) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
Usage:
val dfBin = valuesCounts.withColumn("bin_range", convertValueToBinRangeUDF(binList)($"columnName"))
Try this -
scala> case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
| val binMin = binMax - binWidth
|
| // only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
| def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
|
| def rangeAsString(): String = {
| val sb = new StringBuilder()
| sb.append(binMin).append(" - ").append(binMax)
| sb.toString()
| }
| }
defined class Bin
scala> val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
binList: List[Bin] = List(Bin(10,10), Bin(20,10), Bin(30,10), Bin(40,10), Bin(50,10))
scala> spark.udf.register("convertValueToBinRangeUDF", (value: String) => {
| val number = BigDecimal(value)
| val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
| bin.rangeAsString()
| })
res13: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
//-- Testing with one record
scala> val dfOriginal = spark.sql(s""" select "9" as `high_scores`, "1" as count """)
dfOriginal: org.apache.spark.sql.DataFrame = [high_scores: string, count: string]
scala> dfOriginal.createOrReplaceTempView("dfOriginal")
scala> val dfBin = spark.sql(s""" select high_scores, count, convertValueToBinRangeUDF(high_scores) as bin_range from dfOriginal """)
dfBin: org.apache.spark.sql.DataFrame = [high_scores: string, count: string ... 1 more field]
scala> dfBin.show(false)
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
|9 |1 |0 - 10 |
+-----------+-----+---------+
Hope this will help.

How to append collection as new column to DataFrame with many columns?

I'd like to append (add) a new column to an existing dataframe with multiple columns.
val a = Seq(
("10", "MILLER", "1300", "2017-11-03"),
("30", "Martin", "1250", "2017-11-21")).toDF("dept_no","emp_name","sal","date")
scala> a.show
+-------+--------+----+----------+
|dept_no|emp_name| sal| date|
+-------+--------+----+----------+
| 10| MILLER|1300|2017-11-03|
| 30| Martin|1250|2017-11-21|
+-------+--------+----+----------+
With the above dataframe I'd like to add every element of a collection (be it a regular Scala collection or another single-column DataFrame), e.g.
val lst = List("10", "Susan")
How to add the elements of lst above to the rows of a dataframe (one element per row)?
Let's convert lst to a DataFrame:
val lst = List("10", "Susan").toDF
You can use zip method of RDD:
import org.apache.spark.sql.Row
val data = a.rdd.zip(lst.rdd).map { case (l, r) => Row.fromSeq(l.toSeq ++ r.toSeq) }
import org.apache.spark.sql.types.StructType
val schema = StructType(a.schema.fields ++ lst.schema.fields)
val solution = spark.createDataFrame(data, schema)
scala> solution.show
+-------+--------+----+----------+-----+
|dept_no|emp_name| sal| date|value|
+-------+--------+----+----------+-----+
| 10| MILLER|1300|2017-11-03| 10|
| 30| Martin|1250|2017-11-21|Susan|
+-------+--------+----+----------+-----+

Dataframe replace each row null values with unique epoch time

I have 3 rows in dataframes and in 2 rows, the column id has got null values. I need to loop through the each row on that specific column id and replace with epoch time which should be unique and should happen in dataframe itself. How can it be done?
For eg:
id | name
1 a
null b
null c
I wanted this dataframe which converts null to epoch time.
id | name
1 a
1435232 b
1542344 c
df
.select(
when($"id").isNull, /*epoch time*/).otherwise($"id").alias("id"),
$"name"
)
EDIT
You need to make sure the UDF precise enough - if it is only has millisecond resolution you will see duplicate values. See my example below that clearly illustrates my approach works:
scala> def rand(s: String): Double = Math.random
rand: (s: String)Double
scala> val udfF = udf(rand(_: String))
udfF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(StringType)))
scala> res11.select(when($"id".isNull, udfF($"id")).otherwise($"id").alias("id"), $"name").collect
res21: Array[org.apache.spark.sql.Row] = Array([0.6668195187088702,a], [0.920625293516218,b])
Check this out
scala> val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c"))
s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c))
scala> val df = s1.toDF("id","name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> val epoch = java.time.Instant.now.getEpochSecond
epoch: Long = 1539084285
scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show
+----------+----+
| id|name|
+----------+----+
| 1| a|
|1539084285| b|
|1539084285| c|
+----------+----+
scala>
EDIT1:
I used milliseconds, then also I get same values. Spark doesn't capture nano seconds in time portion. It is possible that many rows could get the same milliseconds. So your assumption of getting unique values based on epoch would not work.
scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli
getEpoch: (x: String)Long
scala> val myudfepoch = udf( getEpoch(_:String):Long )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539087300957| b|
|1539087300957| c|
+-------------+----+
scala>
The only possibility is to use the monotonicallyIncreasingId, but that values may not be of same length all the time.
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539090186541| b|
|1539090186543| c|
+-------------+----+
scala>
EDIT2:
I'm able to trick the System.nanoTime and get the increasing ids, but they will not be sequential, but the length can be maintained. See below
scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12)
getEpoch: (x: String)String
scala> val myudfepoch = udf( getEpoch(_:String):String )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+------------+----+
| id|name|
+------------+----+
| 1| a|
|186127230392| b|
|186127230399| c|
+------------+----+
scala>
Try this out when running in clusters and adjust the take(12), if you get duplicate values.