How to merge two columns of a `Dataframe` in Spark into one 2-Tuple? - scala

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?
var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )

You can use struct function which creates a tuple of provided columns:
import org.apache.spark.sql.functions.struct
val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)
+---+---+---------+
|a |b |NewColumn|
+---+---+---------+
|1 |2 |[1,2] |
|3 |4 |[3,4] |
|5 |3 |[5,3] |
+---+---+---------+

You can use a User-defined function udf to achieve what you want.
UDF definition
object TupleUDFs {
import org.apache.spark.sql.functions.udf
// type tag is required, as we have a generic udf
import scala.reflect.runtime.universe.{TypeTag, typeTag}
def toTuple2[S: TypeTag, T: TypeTag] =
udf[(S, T), S, T]((x: S, y: T) => (x, y))
}
Usage
df.withColumn(
"tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)
assuming "a" and "b" are the columns of type Int you want to put in a tuple.

You can merge multiple dataframe columns into one using array.
// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol"))

If you want to merge two dataframe columns into one column.
Just:
import org.apache.spark.sql.functions.array
df.withColumn("NewColumn", array("columnA", "columnB"))

Related

Spark Scala UDF to count number of array elements contained in another string column

I have a spark dataframe df with 2 columns, say A and B, where A is array of string type and B is a string.
For each row, I am trying to count how many elements in A are contained in B. The UDF I have written is as follows. I thought it should be easy but it breaks down in the subsequent action step.
val hasAddressInUDF = udf{(s: String, t: Array[String]) => t.filter(word => s.contains(word)).size}
Could anyone help? Thanks.
You should be able to use the Seq.count method for this one: for each element in the sequence you can count whether it is a substring of your B column.
import spark.implicits._
val df = Seq(
(Seq("potato", "meat", "car"), "potatoes with meat"),
(Seq("ice", "cream", "candy"), "pasta with cream"),
(Seq("crackers", "with", "hummus"), "tasty food")
).toDF("A", "B")
df.show(false)
+------------------------+------------------+
|A |B |
+------------------------+------------------+
|[potato, meat, car] |potatoes with meat|
|[ice, cream, candy] |pasta with cream |
|[crackers, with, hummus]|tasty food |
+------------------------+------------------+
def count_in_seq = udf((A: Seq[String], B: String) => A.count(elem => B.contains(elem)))
df.withColumn("counts", count_in_seq($"A", $"B")).show(false)
+------------------------+------------------+------+
|A |B |counts|
+------------------------+------------------+------+
|[potato, meat, car] |potatoes with meat|2 |
|[ice, cream, candy] |pasta with cream |1 |
|[crackers, with, hummus]|tasty food |0 |
+------------------------+------------------+------+
Hope this helps!

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

convert each row of dataframe to a map

I have a dataframe with columns A & B of type String. Let's assume the below dataframe
+--------+
|A | B |
|1a | 1b |
|2a | 2b |
I want to add a third column that creates a map of A & B column
+-------------------------+
|A | B | C |
|1a | 1b | {A->1a, B->1b} |
|2a | 2b | {A->2a, B->2b} |
I'm attempting to do it the following way. I have udf which takes in a dataframe and returns a map
val test = udf((dataFrame: DataFrame) => {
val result = new mutable.HashMap[String, String]
dataFrame.columns.foreach(col => {
result.put(col, dataFrame(col).asInstanceOf[String])
})
result
})
I'm calling this udf in following way which is throwing a RunTimeException as I'm trying to pass a DataSet as a literal
df.withColumn("C", Helper.test(lit(df.select(df.columns.head, df.columns.tail: _*)))
I don't want to pass df('a') df('b') to my helper udf as I want them to be generic list of columns that I could select.
any pointers?
map way
You can just use map inbuilt function as
import org.apache.spark.sql.functions._
val columns = df.columns
df.withColumn("C", map(columns.flatMap(x => Array(lit(x), col(x))): _*)).show(false)
which should give you
+---+---+---------------------+
|A |B |C |
+---+---+---------------------+
|1a |1b |Map(A -> 1a, B -> 1b)|
|2a |2b |Map(A -> 2a, B -> 2b)|
+---+---+---------------------+
Udf way
Or you can use define your udf as
//collecting column names to be used in the udf
val columns = df.columns
//definining udf function
import org.apache.spark.sql.functions._
def createMapUdf = udf((names: Seq[String], values: Seq[String])=> names.zip(values).toMap)
//calling udf function
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), array(col("A"), col("B")))).show(false)
I hope the answer is helpful
# Ramesh Maharjan - Your answers are already great, my answer is just make your UDF answer also in dynamic way using string interpolation.
Column D is giving that in dynamic way.
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*),
array(col("A"), col("B"))))
.withColumn("D", createMapUdf(array(columns.map(x => lit(x)): _*),
array(columns.map(x => col(s"$x") ): _* ))).show()

How do I combine two columns in a Spark SchemaRDD containing WrappedArrays into a 3rd column with the combined WrappedArray?

I have a DataFrame with two columns ( "features1" and "features2" ) containing WrappedArrays.
I need to combine the two columns into a third column containing the merged contents of the first two columns as a WrappedArray.
How do I do this?
I'm using Scala not PySpark
I didn't find another way than a udf, surprisingly
def catArray[A](a:Seq[A], b: Seq[A]): Seq[A] = a ++ b
val catArrayUdf = udf { catArray[Int] _ }
Then
scala> sc.parallelize(List((Seq(1,2),Seq(3,4))))
.toDF("A","B")
.withColumn("cat",catArray('A,'B))
.show(false)
+------+------+------------+
|A |B |cat |
+------+------+------------+
|[1, 2]|[3, 4]|[1, 2, 3, 4]|
+------+------+------------+
Maybe there is a shorter way to define the UDF based on ++ though.