How to add a new column to my DataFrame such that values of new column are populated by some other function in scala? - scala

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.

Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Related

Use rlike with regex column in spark 1.5.1

I want to filter dataframe based on applying regex values in one of the columns to another column.
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
The result of filtering dataframe using RegexColumm should give rows with id 1 and 3.
Is there a way to do this in spark 1.5.1? Don't want to use UDF as this might cause scalability issues, looking for spark native api.
You can convert df -> rdd then by traversing through row we can match the regex and filter out only the matching data without using any UDF.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
UPDATE:
Instead of .map we can use .mapPartition (map vs mapPartiiton):
val rdd = df.rdd.mapPartitions(
partitions => {
partitions.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
})
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+
You can use like above, I think this is what you are lioking for. Please do let me know if it helps you.

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Select a literal based on a column value in Spark

I have a map:
val map = Map("A" -> 1, "B" -> 2)
And I have a DataFrame. a column in the data frame contains the keys in the map. I am trying to select a column in a new DF that has the map values in it based on the key:
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
But this is resulting in the following error:
java.lang.RuntimeException: Unsupported literal type class scala.None$ None
I made sure that the column ColumnWithKey has As and Bs only and does not have empty values in it.
Is there another way to get the result I am looking for? Any help would be appreciated.
The Problem in this statement (besides syntax problems)
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
is that col(ColumnWithKey) will not take the value of a specific row, but is only given by the schema, i.e. has a constant value.
In your case I would suggest to join your map to your dataframe :
val map = Map("A" -> 1, "B" -> 2)
val df_map = map.toSeq.toDF("key","value")
val DfThatContainsTheKeyColumn = Seq(
"A",
"A",
"B",
"B"
).toDF("myCol")
DfThatContainsTheKeyColumn
.join(broadcast(df_map),$"mycol"===$"key")
.select(concat($"mycol",lit("|"),$"value").as("newColumn"))
.show()
gives
|newColumn|
+---------+
| A|1|
| A|1|
| B|2|
| B|2|
+---------+
You can use case classes to make it easy. This is an example:
Given this input
val givenMap = Map("A" -> 1, "B" -> 2)
import spark.implicits._
val df = Seq(
(1, "A"),
(2, "A"),
(3, "B"),
(4, "B")
).toDF("col_a", "col_b")
df.show()
Above code looks like:
+-----+-----+
|col_a|col_b|
+-----+-----+
| 1| A|
| 2| A|
| 3| B|
| 4| B|
+-----+-----+
givenMap: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2)
import spark.implicits._
df: org.apache.spark.sql.DataFrame = [col_a: int, col_b: string]
The code that you need will look like:
case class MyInput(col_a: Int, col_b: String)
case class MyOutput(col_a: Int, col_b: String, new_column: Int)
df.as[MyInput].map(row=> MyOutput(row.col_a, row.col_b, givenMap(row.col_b))).show()
With the case classes you can cast your df and use object notation to access to your column values within a .map. Above code will output:
+-----+-----+----------+
|col_a|col_b|new_column|
+-----+-----+----------+
| 1| A| 1|
| 2| A| 1|
| 3| B| 2|
| 4| B| 2|
+-----+-----+----------+
defined class MyInput
defined class MyOutput
You can lookup a map using key from a column as,
val map = Map("A" -> 1, "B" -> 2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
df.map{ row =>
val k = row.getAs[String]("key")
val v = map.getOrElse(k,0)
(k,v)
}.toDF("key", "value").show(false)
Result -
+---+-----+
|key|value|
+---+-----+
|A |1 |
+---+-----+
You can look up a map present inside a column using a literal key using Column.getItem, please see an example below.
val mapKeys = Array("A","B")
val mapValues = Array(1,2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
.withColumn("keys",lit(mapKeys))
.withColumn("values",lit(mapValues))
.withColumn("map",map_from_arrays($"keys",$"values"))
.withColumn("lookUpTheMap",$"map".getItem("A"))
//A dataframe with Map is created.
//A map is looked up using a hard coded String key.
df.show(false)
Result
+-----+---+------+------+----------------+------------+
|value|key|keys |values|map |lookUpTheMap|
+-----+---+------+------+----------------+------------+
|dummy|A |[A, B]|[1, 2]|[A -> 1, B -> 2]|1 |
+-----+---+------+------+----------------+------------+
To look up a map present inside a column based on another column containing the key - you can use an UDF or use map function on the dataframe the way I am showing below.
//A map is looked up using a Column key.
df.map{ row =>
val m = row.getAs[Map[String,Int]]("map")
val k = row.getAs[String]("key")
val v = m.getOrElse(k,0)
(m,k,v)
}.toDF("map","key", "value").show(false)
Result
+----------------+---+-----+
|map |key|value|
+----------------+---+-----+
|[A -> 1, B -> 2]|A |1 |
+----------------+---+-----+
I think a simpler option could be to use typedLit:
val map = typedLit(Map("A" -> 1, "B" -> 2))
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
map(col(ColumnWithKey))) as newColumn)

Comparing two array columns in Scala Spark

I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+

Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.
I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.
How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.
Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.
Kindly please advise with a working or pseudo code.
Thanks
Sanjay
Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
df1.show
// +---+----+---+------------+
// | x| y| z| foobar|
// +---+----+---+------------+
// | 1| 3.0| a| [3.0,291.0]|
// | 2|-1.0| b|[-2.0,-98.0]|
// | 3| 0.0| c| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- x: long (nullable = false)
// |-- y: double (nullable = false)
// |-- z: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)
This can be easily flattened later but usually there is no need for that.
Switch to RDD, reshape and rebuild DF:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
Seq(x * y, z.head.toInt * y)
val schema = StructType(df.schema.fields ++
Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
r.toSeq ++
foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | x| y| z| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| a| 3.0|291.0|
// | 2|-1.0| b|-2.0|-98.0|
// | 3| 0.0| c| 0.0| 0.0|
// +---+----+---+----+-----+
Assume that after your function there will be a sequence of elements, giving an example as below:
val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
df.show
+------------------+---+
| infoComb|age|
+------------------+---+
|Mike,1986,Toronto| 30|
| Andre,1980,Ottawa| 36|
| jill,1989,London| 27|
+------------------+---+
now what you can do with this infoComb is that you can start split the string and get more columns with:
df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
+-----+----------+-------+---+
| name|yearOfBorn| city|age|
+-----+----------+-------+---+
|Mike| 1986|Toronto| 30|
|Andre| 1980| Ottawa| 36|
| jill| 1989| London| 27|
+-----+----------+-------+---+
Hope this helps.
If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:
val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
.withColumn("newCol2", myFun2(myDf("originalColumn"))
.drop(myDf("originalColumn"))
where myFun is an udf defined like this:
def myFun= udf(
(originalColumnContent : String) => {
// do something with your original column content and return a new one
}
)
I opted to create a function to flatten one column and then just call it simultaneously with the udf.
First define this:
implicit class DfOperations(df: DataFrame) {
def flattenColumn(col: String) = {
def addColumns(df: DataFrame, cols: Array[String]): DataFrame = {
if (cols.isEmpty) df
else addColumns(
df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
cols.tail
)
}
val field = df.select(col).schema.fields(0)
val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)
addColumns(df, newCols).drop(col)
}
def withColumnMany(colName: String, col: Column) = {
df.withColumn(colName, col).flattenColumn(colName)
}
}
Then usage is very simple:
case class MyClass(a: Int, b: Int)
val df = sc.parallelize(Seq(
(0),
(1)
)).toDF("x")
val f = udf((x: Int) => MyClass(x*2,x*3))
df.withColumnMany("test", f($"x")).show()
// +---+------+------+
// | x|test_a|test_b|
// +---+------+------+
// | 0| 0| 0|
// | 1| 2| 3|
// +---+------+------+
This can be easily achieved by using pivot function
df4.groupBy("year").pivot("course").sum("earnings").collect()