Spark (scala) - Iterate over DF column and count number of matches from a set of items - scala

So I can now iterate over a column of strings in a dataframe and check whether any of the strings contain any items in a large dictionary (see here, thanks to #raphael-roth and #tzach-zohar). The basic udf (not including broadcasting the dict list) for that is:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The next thing I am trying to do is also COUNT the number of matches that occur from the dict set, in the most efficient way possible (i'm dealing with very large datasets and dict files).
I have been trying to use findAllMatchIn in the udf, using both count and map:
val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))
But this returns a list of iterators (empty and non-empty) I get a type mismatch (found Iterator, required Boolean). I am not sure how to count the non-empty iterators (count and size and length don't work).
Any idea what i'm doing wrong? Is there a better / more efficient way to achieve what I am trying to do?

you can just change a little bit of the answers from your other question as
import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
Given the dataframe as
+---+---------+
|id |words |
+---+---------+
|1 |foo |
|2 |barriofoo|
|3 |gitten |
|4 |baa |
+---+---------+
and dict file as
val dict = Set("foo","bar","baaad")
You should have output as
+---+---------+----------+
| id| words|word_check|
+---+---------+----------+
| 1| foo| 1|
| 2|barriofoo| 2|
| 3| gitten| 0|
| 4| baa| 0|
+---+---------+----------+
I hope the answer is helpful

Related

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

mock spark column functions in scala

My code is using monotonically_increasing_id function is scala
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
I want to mock it in my unit test so that it returns integers 0, 1, 2, 3, ...
In my spark-shell it returns the desired result.
scala> df.show
+----------+------+
|first_name|row_id|
+----------+------+
| oleg| 0|
| maxim| 1|
+----------+------+
But in my scala applications the results are different.
How can I mock column functions?
Mocking such a function so that it produces a sequence is not simple. Indeed, spark is a parallel computing engine and accessing the data in sequence is therefore complicated.
Here is a solution you could try.
Let's define a function that zips a dataframe:
def zip(df : DataFrame, name : String) = {
df.withColumn(name, monotonically_increasing_id)
}
Then let's rewrite the function we want to test using this zip function by default:
def fun(df : DataFrame,
zipFun : (DataFrame, String) => DataFrame = zip) : DataFrame = {
zipFun(df, "id_row")
}
// let 's see what it does
fun(spark.range(5).toDF).show()
+---+----------+
| id| id_row|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
| 4|8589934594|
+---+----------+
It's the same as before, let's write a new function that uses zipWithIndex from the RDD API. It's a bit tedious because we have to go back and forth between the two APIs.
def zip2(df : DataFrame, name : String) = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
fun(spark.range(5).toDF, zip2)
+---+------+
| id|id_row|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+------+
You can adapt zip2, for instance multiplying i by 2, to get what you want.
Based on answer from #Oli I came up with the following workaround:
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
.withColumn("test_id", row_number().over(Window.orderBy("row_id")))
It solves my problem but I'm still interested in mocking column functions.
I mock my spark functions with this code :
val s = typedLit[Timestamp](Timestamp.valueOf("2021-05-07 15:00:46.394"))
implicit val ds = DefaultAnswer(CALLS_REAL_METHODS)
withObjectMocked[functions.type] {
when(functions.current_timestamp()).thenReturn(s)
// spark logic
}

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.
import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html
You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful
If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).
Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

Spark (scala) dataframes - Check whether strings in column contain any items from a set

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).
Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+
Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.
if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

groupByKey in Spark dataset

Please help me understand the parameter we pass to groupByKey when it is used on a dataset
scala> val data = spark.read.text("Sample.txt").as[String]
data: org.apache.spark.sql.Dataset[String] = [value: string]
scala> data.flatMap(_.split(" ")).groupByKey(l=>l).count.show
In the above code, please help me understand what (l=>l) means in groupByKey(l=>l).
l =>l says use the whole string(in your case that's every word as you're tokenizing on space) will be used as a key. This way you get all occurrences of each word in same partition and you can count them.
- As you probably seen in other articles, it is preferable to use reduceByKey in this case so you don't need to collect all values for each key in memory before counting.
Always a good place to start is the API Docs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
You need a function that derives your key from the dataset's data.
In your example, your function takes the whole string as is and uses it as the key. A different example will be, for a Dataset[String], to use as a key the first 3 characters of your string and not the whole string:
scala> val ds = List("abcdef", "abcd", "cdef", "mnop").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> ds.show
+------+
| value|
+------+
|abcdef|
| abcd|
| cdef|
| mnop|
+------+
scala> ds.groupByKey(l => l.substring(0,3)).keys.show
+-----+
|value|
+-----+
| cde|
| mno|
| abc|
+-----+
group of key "abc" will have 2 values.
Here is the difference on how the key gets transformed vs the (l => l) so you can see better:
scala> ds.groupByKey(l => l.substring(0,3)).count.show
+-----+--------+
|value|count(1)|
+-----+--------+
| cde| 1|
| mno| 1|
| abc| 2|
+-----+--------+
scala> ds.groupByKey(l => l).count.show
+------+--------+
| value|count(1)|
+------+--------+
| abcd| 1|
| cdef| 1|
|abcdef| 1|
| mnop| 1|
+------+--------+