Spark (scala) dataframes - Check whether strings in column contain any items from a set - scala

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).

Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+

Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.

if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

Related

Scala/Spark: Checking for null elements in an array column but IntelliJ suggests not to use null?

I have a column called responseTimes which is of arrayType:
ArrayType(IntegerType,true)
I'm trying to add another column to count the number of null or not-set values in this array:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.withColumn("totalNulls", when(contains_null(col("responseTimes")),
lit(1)).otherwise(0))
Although this gives me the right output, IntelliJ keeps telling me to avoid the use of null in my UDF which makes me think this is bad. Is there any other way to do it? Also, is it possible without using UDFs?
The reason is very simple , it is because of the rules of spark udf, well spark deals with null in a different distributed way, I don't know if you know the array_contains built-in function in spark sql.
If UDFs are needed, follow these rules:
Scala code should deal with null values gracefully and shouldn’t error out if there are null values.
Scala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant.
Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Please refer to this link if you like tp read more: https://mungingdata.com/apache-spark/dealing-with-null/
You can rewrite your UDF to use Option. In scala, Option(null) gives None, so you can do :
val contains_null = udf((xs: Seq[Integer]) => xs.exists(e => Option(e).isEmpty))
However, if you are using Spark 2.4+, it is more suitable to use Spark built-in functions for this. To check if an array column contains null elements, use exists as suggested by #mck's answer.
If you want to get the count of nulls in array you can combine filter and size function :
df.withColumn("totalNulls", size(expr("filter(responseTimes, x -> x is null)")))
A better way is probably to use higher order function exists to check isnull for each array element:
// sample dataframe
val df = spark.sql("select array(1,null,2) responseTimes union all select array(3,4)")
df.show
+-------------+
|responseTimes|
+-------------+
| [1,, 2]|
| [3, 4]|
+-------------+
// check whether there exists null elements in the array
val df2 = df.withColumn("totalNulls", expr("int(exists(responseTimes, x -> isnull(x)))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+
You can also use array_max together with transform:
val df2 = df.withColumn("totalNulls", expr("int(array_max(transform(responseTimes, x -> isnull(x))))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+

I need to skip three rows from the dataframe while loading from a CSV file in scala

I am loading my CSV file to a data frame and I can do that but I need to skip the starting three lines from the file.
I tried .option() command by giving header as true but it is ignoring the only first line.
val df = spark.sqlContext.read
.schema(Myschema)
.option("header",true)
.option("delimiter", "|")
.csv(path)
I thought of giving header as 3 lines but I couldn't find the way to do that.
alternative thought: skip those 3 lines from the data frame
Please help me with this. Thanks in Advance.
A generic way to handle your problem would be to index the dataframe and filter the indices that are greater than 2.
Straightforward approach:
As suggested in another answer, you may try adding an index with monotonically_increasing_id.
df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")
Yet, that's only going to work if the first 3 rows are in the first partition. Moreover, as mentioned in the comments, this is the case today but this code may break completely with further versions or spark and that would be very hard to debug. Indeed, the contract in the API is just "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive". It is therefore not very sage to assume that they will always start from zero. There might even be other cases in the current version in which that does not work (I'm not sure though).
To illustrate my first concern, have a look at this:
scala> spark.range(4).withColumn("Index",monotonically_increasing_id()).show()
+---+----------+
| id| Index|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
+---+----------+
We would only remove two rows...
Safe approach:
The previous approach will work most of the time though but to be safe, you can use zipWithIndex from the RDD API to get consecutive indices.
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema
.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
zipWithIndex(df, "index").where('index > 2).drop("index")
We can check that it's safer:
scala> zipWithIndex(spark.range(4).toDF("id"), "index").show()
+---+-----+
| id|index|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
You can try this option
df.withColumn("Index",monotonically_increasing_id())
.filter(col("Index") > 2)
.drop("Index")
You may try changing wrt to your schema.
import org.apache.spark.sql.Row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Read CSV
val file = sc.textFile("csvfilelocation")
//Remove first 3 lines
val data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
val rowRdd = data.map(x=>Row(x(0), x(1)))
//create dataframe by calling sqlcontext.createDataframe with rowRdd and your schema
val df = sqlContext.createDataFrame(rowRdd, schema)

Spark dataframe replace values of specific columns in a row with Nulls

I am facing a problem when trying to replace the values of specific columns of a Spark dataframe with nulls.
I have a dataframe with more than fifty columns of which two are key columns. I want to create a new dataframe with same schema and the new dataframe should have values from the key columns and null values in non-key columns.
I tried the following ways but facing issues:
//old_df is the existing Dataframe
val key_cols = List("id", "key_number")
val non_key_cols = old_df.columns.toList.filterNot(key_cols.contains(_))
val key_col_df = old_df.select(key_cols.head, key_cols.tail:_*)
val non_key_cols_df = old_df.select(non_key_cols.head, non_key_cols.tail:_*)
val list_cols = List.fill(non_key_cols_df.columns.size)("NULL")
val rdd_list_cols = spark.sparkContext.parallelize(Seq(list_cols)).map(l => Row(l:_*))
val list_df = spark.createDataFrame(rdd_list_cols, non_key_cols_df.schema)
val new_df = key_col_df.crossJoin(list_df)
This approach was good when I only have string type columns in the old_df. But I have some columns of double type and int type which is throwing error because the rdd is a list of null strings.
To avoid this I tried the list_df as an empty dataframe with schema as the non_key_cols_df but the result of crossJoin is an empty dataframe which I believe is because one dataframe is empty.
My requirement is to have the non_key_cols as a single row dataframe with Nulls so that I can do crossJoin with key_col_df and form the required new_df.
Also any other easier way to update all columns except key columns of a dataframe to nulls will resolve my issue. Thanks in advance
crossJoin is an expensive operation so you want to avoid it if possible.
An easier solution would be to iterate over all non-key columns and insert null with lit(null). Using foldLeft this can be done as follows:
val keyCols = List("id", "key_number")
val nonKeyCols = df.columns.filterNot(keyCols.contains(_))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c, lit(null)))
Input example:
+---+----------+---+----+
| id|key_number| c| d|
+---+----------+---+----+
| 1| 2| 3| 4.0|
| 5| 6| 7| 8.0|
| 9| 10| 11|12.0|
+---+----------+---+----+
will give:
+---+----------+----+----+
| id|key_number| c| d|
+---+----------+----+----+
| 1| 2|null|null|
| 5| 6|null|null|
| 9| 10|null|null|
+---+----------+----+----+
Shaido answer has small drawback - column type will be lost.
Can be fixed with schema usage, like this:
val nonKeyCols = df.schema.fields.filterNot(f => keyCols.contains(f.name))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c.name, lit(null).cast(c.dataType)))

Spark (scala) - Iterate over DF column and count number of matches from a set of items

So I can now iterate over a column of strings in a dataframe and check whether any of the strings contain any items in a large dictionary (see here, thanks to #raphael-roth and #tzach-zohar). The basic udf (not including broadcasting the dict list) for that is:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The next thing I am trying to do is also COUNT the number of matches that occur from the dict set, in the most efficient way possible (i'm dealing with very large datasets and dict files).
I have been trying to use findAllMatchIn in the udf, using both count and map:
val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))
But this returns a list of iterators (empty and non-empty) I get a type mismatch (found Iterator, required Boolean). I am not sure how to count the non-empty iterators (count and size and length don't work).
Any idea what i'm doing wrong? Is there a better / more efficient way to achieve what I am trying to do?
you can just change a little bit of the answers from your other question as
import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
Given the dataframe as
+---+---------+
|id |words |
+---+---------+
|1 |foo |
|2 |barriofoo|
|3 |gitten |
|4 |baa |
+---+---------+
and dict file as
val dict = Set("foo","bar","baaad")
You should have output as
+---+---------+----------+
| id| words|word_check|
+---+---------+----------+
| 1| foo| 1|
| 2|barriofoo| 2|
| 3| gitten| 0|
| 4| baa| 0|
+---+---------+----------+
I hope the answer is helpful

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically