scala dataframe filter array of strings - scala

Spark 1.6.2 and Scala 2.10 here.
I want to filter the spark dataframe column with an array of strings.
val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|L-00417|
| 3|L-00645|
| 4|L-99999|
| 5|L-00623|
+---+-------+
val df2 = sc.parallelize(Seq((1, "L-1"), (3, "L-2"), (4, "L-3"),(5, "L-00623"))).toDF("c3","c4")
+---+-------+
| c3| c4|
+---+-------+
| 1| L-1|
| 3| L-2|
| 4| L-3|
| 5|L-00623|
+---+-------+
val c2List = df1.select("c2").as[String].collect()
df2.filter(not($"c4").contains(c2List)).show()`
I am getting below error.
Unsupported literal type class [Ljava.lang.String; [Ljava.lang.String;#5ce1739c
Can anyone please help to fix this?

First, contains isn't suitable because you're looking for the opposite relationship - you want to check if c2List contains c4's value, and not the other way around.
You can use isin for that - which uses "repeated argument" (similar to Java's "varargs") of the values to match, so you'd want to "expand" c2List into a repeated argument, which can be done using the : _* operator:
df2.filter(not($"c4".isin(c2List: _*)))
Alternatively, with Spark 1.6 you can use an "left anti join", to join the two dataframes and get only values in df2 that did NOT match values in df1:
df2.join(df1, $"c2" === $"c4", "leftanti")
Unlike the previous, this option is not limited to the case where df1 is small enough to be collected.
Lastly, if you're using earlier Spark version, you can immitate leftanti using a left join and a filter:
df2.join(df1, $"c2" === $"c4", "left").filter($"c2".isNull).select("c3", "c4")

Related

How to efficiently select dataframe columns containing a certain value in Spark?

Suppose you have a dataframe in spark (string type) and you want to drop any column that contains "foo". In the example dataframe below, you would drop column "c2" and "c3" but keep "c1". However I'd like the solution to generalize to large numbers of columns and rows.
+-------------------+
| c1| c2| c3|
+-------------------+
| this| foo| hello|
| that| bar| world|
|other| baz| foobar|
+-------------------+
My solution is to scan every column in the dataframe then aggregate the results using the dataframe API and built in functions.
So, scanning each column could be done like this (I'm new to scala please excuse syntax mistakes):
df = df.select(df.columns.map(c => col(c).like("foo"))
Logically, I would have an intermediate dataframe like this:
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| false|
| false| false| false|
| false| false| true|
+--------------------+
Which would then be aggregated into a single row to read off which columns need to be dropped.
exprs = df.columns.map( c => max(c).alias(c))
drop = df.agg(exprs.head, exprs.tail: _*)
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| true|
+--------------------+
Now any column containing true can be dropped.
My question is: Is there better way to do this, performance wise? In this case, does spark stop scanning a column once it finds "foo"? Does it matter how data is stored (would parquet help?).
Thanks, I'm new here so please tell my how the question can be improved.
Depending on your data, for example, if you have a lot of foo values, the code below may perform more efficiently:
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).limit(1).isEmpty
}
df.drop(colsToDrop: _*)
UPDATE: Removed redundant .limit(1):
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).isEmpty
}
df.drop(colsToDrop: _*)
An answer following your logic (worked out correctly), but I think the other answer is better, more so for posterity and your improved ability with Scala. I am not sure the other answer is in fact performant, but neither is this. Not sure if parquet would help, difficult to gauge.
The other option is to write a loop on the driver and access every
column and then parquet would be of use due to columnar, stats and
push down.
import org.apache.spark.sql.functions._
def myUDF = udf((cols: Seq[String], cmp: String) => cols.map(code => if (code == cmp) true else false ))
val df = sc.parallelize(Seq(
("foo", "abc", "sss"),
("bar", "fff", "sss"),
("foo", "foo", "ddd"),
("bar", "ddd", "ddd")
)).toDF("a", "b", "c")
val res = df.select($"*", array(df.columns.map(col): _*).as("colN"))
.withColumn( "colres", myUDF( col("colN") , lit("foo") ) )
res.show()
res.printSchema()
val n = 3
val res2 = res.select( (0 until n).map(i => col("colres")(i).alias(s"c${i+1}")): _*)
res2.show(false)
val exprs = res2.columns.map( c => max(c).alias(c))
val drop = res2.agg(exprs.head, exprs.tail: _*)
drop.show(false)

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

Join 2 DataFrame based on lookup within a Column of collections - Spark,Scala

I have 2 dataframes as below,
val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
val y = Seq((5)).toDF("Y")
I want to join the two dataframes by looking up the value from y and searching the Seq/Array in x.select("X") if exists then join the complete Row with y
How can I acheive this is Spark?
Cheers!
Spark 2.4.3 you could use higher-order function spark
scala> val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
scala> val y = Seq((5)).toDF("Y")
scala> x.join(y,expr("array_contains(X, y)"),"left").show
+------+----+----+
| X|NAME| Y|
+------+----+----+
|[4, 5]| XXX| 5|
| [7]| XYX|null|
+------+----+----+
please confirm that's what you want to achieve?
You can use an UDF for the join, works for all spark versions:
val array_contains = udf((arr:Seq[Int],element:Int) => arr.contains(element))
x
.join(y, array_contains($"X",$"Y"),"left")
.show()
Another approach you can use is to explode your array into rows with the new temporary column. If you run the following code:
x.withColumn("temp", explode('X)).show()
it would show:
+------+----+----+
| X|NAME|temp|
+------+----+----+
|[4, 5]| XXX| 4|
|[4, 5]| XXX| 5|
| [7]| XYX| 7|
+------+----+----+
As you can see you can now just do join using temp and Y columns (and then drop temp):
x.withColumn("temp", explode('X))
.join(y, 'temp === 'Y)
.drop('temp)
This may fail by creating duplicate rows if X contains duplicates. In this case, you'd have to additionally call distinct:
x.withColumn("temp", explode('X))
.distinct()
.join(y, 'temp === 'Y, "left")
.drop('temp)
Since this approach is using spark native methods it will be a little bit faster than one using UDF, but arguably is less elegant.

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.
import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html
You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful
If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).
Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.