How to append an element to an array column of a Spark Dataframe? - scala

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.

import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html

You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful

If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).

Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

Related

How to efficiently select dataframe columns containing a certain value in Spark?

Suppose you have a dataframe in spark (string type) and you want to drop any column that contains "foo". In the example dataframe below, you would drop column "c2" and "c3" but keep "c1". However I'd like the solution to generalize to large numbers of columns and rows.
+-------------------+
| c1| c2| c3|
+-------------------+
| this| foo| hello|
| that| bar| world|
|other| baz| foobar|
+-------------------+
My solution is to scan every column in the dataframe then aggregate the results using the dataframe API and built in functions.
So, scanning each column could be done like this (I'm new to scala please excuse syntax mistakes):
df = df.select(df.columns.map(c => col(c).like("foo"))
Logically, I would have an intermediate dataframe like this:
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| false|
| false| false| false|
| false| false| true|
+--------------------+
Which would then be aggregated into a single row to read off which columns need to be dropped.
exprs = df.columns.map( c => max(c).alias(c))
drop = df.agg(exprs.head, exprs.tail: _*)
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| true|
+--------------------+
Now any column containing true can be dropped.
My question is: Is there better way to do this, performance wise? In this case, does spark stop scanning a column once it finds "foo"? Does it matter how data is stored (would parquet help?).
Thanks, I'm new here so please tell my how the question can be improved.
Depending on your data, for example, if you have a lot of foo values, the code below may perform more efficiently:
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).limit(1).isEmpty
}
df.drop(colsToDrop: _*)
UPDATE: Removed redundant .limit(1):
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).isEmpty
}
df.drop(colsToDrop: _*)
An answer following your logic (worked out correctly), but I think the other answer is better, more so for posterity and your improved ability with Scala. I am not sure the other answer is in fact performant, but neither is this. Not sure if parquet would help, difficult to gauge.
The other option is to write a loop on the driver and access every
column and then parquet would be of use due to columnar, stats and
push down.
import org.apache.spark.sql.functions._
def myUDF = udf((cols: Seq[String], cmp: String) => cols.map(code => if (code == cmp) true else false ))
val df = sc.parallelize(Seq(
("foo", "abc", "sss"),
("bar", "fff", "sss"),
("foo", "foo", "ddd"),
("bar", "ddd", "ddd")
)).toDF("a", "b", "c")
val res = df.select($"*", array(df.columns.map(col): _*).as("colN"))
.withColumn( "colres", myUDF( col("colN") , lit("foo") ) )
res.show()
res.printSchema()
val n = 3
val res2 = res.select( (0 until n).map(i => col("colres")(i).alias(s"c${i+1}")): _*)
res2.show(false)
val exprs = res2.columns.map( c => max(c).alias(c))
val drop = res2.agg(exprs.head, exprs.tail: _*)
drop.show(false)

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

Spark: Joining with array

I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.
I tried this but I guess it's not support.
Any other way to do this?
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
left.join(right,"col1")
Throws:
org.apache.spark.sql.AnalysisException: cannot resolve '(col1
=col1)' due to data
type mismatch: differing types in '(col1 =
col1)' (int and array).;;
One option is to create an UDF for building your join condition:
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
val checkValue = udf {
(array: WrappedArray[Int], value: Int) => array.contains(value)
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")
result.show
+----+------+----+
|col1| col1|col2|
+----+------+----+
| 1|[1, 2]| Yes|
| 2|[1, 2]| Yes|
| 3| [3]| No|
+----+------+----+
The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val left = Seq(1, 2, 3).toDF("col1")
val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")
val joined = left.join(right, expr("array_contains(col1_array, col1)")).show
+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
| 1| [1, 2]| Yes|
| 2| [1, 2]| Yes|
| 3| [3]| No|
+----+----------+----+
Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.
You could use explode on you Array column before the join. Explode creates a new line for each element in the array :
right = right.withColumn("exploded_col",explode(right("col1")))
right.show()
+------+----+--------------+
| col1|col2|exploded_col_1|
+------+----+--------------+
|[1, 2]| Yes| 1|
|[1, 2]| Yes| 2|
| [3]| No| 3|
+------+----+--------------+
Then you can easily join with your first dataset.

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

scala dataframe filter array of strings

Spark 1.6.2 and Scala 2.10 here.
I want to filter the spark dataframe column with an array of strings.
val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|L-00417|
| 3|L-00645|
| 4|L-99999|
| 5|L-00623|
+---+-------+
val df2 = sc.parallelize(Seq((1, "L-1"), (3, "L-2"), (4, "L-3"),(5, "L-00623"))).toDF("c3","c4")
+---+-------+
| c3| c4|
+---+-------+
| 1| L-1|
| 3| L-2|
| 4| L-3|
| 5|L-00623|
+---+-------+
val c2List = df1.select("c2").as[String].collect()
df2.filter(not($"c4").contains(c2List)).show()`
I am getting below error.
Unsupported literal type class [Ljava.lang.String; [Ljava.lang.String;#5ce1739c
Can anyone please help to fix this?
First, contains isn't suitable because you're looking for the opposite relationship - you want to check if c2List contains c4's value, and not the other way around.
You can use isin for that - which uses "repeated argument" (similar to Java's "varargs") of the values to match, so you'd want to "expand" c2List into a repeated argument, which can be done using the : _* operator:
df2.filter(not($"c4".isin(c2List: _*)))
Alternatively, with Spark 1.6 you can use an "left anti join", to join the two dataframes and get only values in df2 that did NOT match values in df1:
df2.join(df1, $"c2" === $"c4", "leftanti")
Unlike the previous, this option is not limited to the case where df1 is small enough to be collected.
Lastly, if you're using earlier Spark version, you can immitate leftanti using a left join and a filter:
df2.join(df1, $"c2" === $"c4", "left").filter($"c2".isNull).select("c3", "c4")