myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee
I have a dataframe df of columns ("id", "current_date", "days") and I am trying to add the the "days" to "current_date" and create a new dataframe with new column called "new_date" using spark scala function date_add()
val newDF = df.withColumn("new_Date", date_add(df("current_date"), df("days").cast("Int")))
But looks like the function date_add only accepts Int values and not columns. How can get the desired output in such case? Are there any alternative functions i can use to get the desired output?
spark version: 1.6.0
scala version: 2.10.6
No need to use an UDF, you can do it using an SQL expression:
val newDF = df.withColumn("new_date", expr("date_add(current_date,days)"))
A small custom udf can be used to make this date arithmetic possible.
import org.apache.spark.sql.functions.udf
import java.util.concurrent.TimeUnit
import java.util.Date
import java.text.SimpleDateFormat
val date_add = udf((x: String, y: Int) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(y))
sdf.format(result)
} )
Usage:
scala> val df = Seq((1, "2017-01-01", 10), (2, "2017-01-01", 20)).toDF("id", "current_date", "days")
df: org.apache.spark.sql.DataFrame = [id: int, current_date: string, days: int]
scala> df.withColumn("new_Date", date_add($"current_date", $"days")).show()
+---+------------+----+----------+
| id|current_date|days| new_Date|
+---+------------+----+----------+
| 1| 2017-01-01| 10|2017-01-11|
| 2| 2017-01-01| 20|2017-01-21|
+---+------------+----+----------+
How can I do string.replace("fromstr", "tostr") on a scala dataframe.
As far as I can see withColumnRenamed performs replace on all columns and not just the headers.
withColumnRenamed renames column names only, data remains the same. If you need to change rows context, you can use one of the following:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val inputDf = Seq("to_be", "misc").toDF("c1")
val resultd1Df = inputDf
.withColumn("c2", regexp_replace($"c1", "^to_be$", "not_to_be"))
.select($"c2".as("c1"))
resultd1Df.show()
val resultd2Df = inputDf
.withColumn("c2", when($"c1" === "to_be", "not_to_be").otherwise($"c1"))
.select($"c2".as("c1"))
resultd2Df.show()
def replace(mapping: Map[String, String]) = udf(
(from: String) => mapping.get(from).orElse(Some(from))
)
val resultd3Df = inputDf
.withColumn("c2", replace(Map("to_be" -> "not_to_be"))($"c1"))
.select($"c2".as("c1"))
resultd3Df.show()
Input dataframe:
+-----+
| c1|
+-----+
|to_be|
| misc|
+-----+
Result dataframe:
+---------+
| c1|
+---------+
|not_to_be|
| misc|
+---------+
You can find the list of available Spark functions there
I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+
I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers.
+---+-----+
| id| dist|
+---+-----+
|1.0|[2.0]|
|2.0|[4.0]|
|3.0|[6.0]|
|4.0|[8.0]|
+---+-----+
I tried using several variants of the following udf but it returns a type mismatch error
val toInt4 = udf[Int, Vector]({ (a) => (a)})
val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")
I struggled for a while to get the answer from #ThomasLuechtefeld working. But was running into this very frustrating error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type.
Turns out I needed to import DenseVector from the ml package instead of the mllib package.
So this worked for me:
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._
val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) }
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0)))
Yes, the only difference is that first line. This should absolutely be a comment, but I don't have the reputation. Sorry!
I think it's easiest to do it by going to the RDD API and then back.
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext._
// The original data.
val input: DataFrame =
sc.parallelize(1 to 4)
.map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2)))
.toDF("id", "dist")
// Turn it into an RDD for manipulation.
val inputRDD: RDD[(Double, DenseVector)] =
input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist"))
// Change the DenseVector into an integer array.
val outputRDD: RDD[(Double, Array[Int])] =
inputRDD.mapValues(_.toArray.map(_.toInt))
// Go back to a DataFrame.
val output = outputRDD.toDF("id", "dist")
output.show
You get:
+---+----+
| id|dist|
+---+----+
|1.0| [2]|
|2.0| [4]|
|3.0| [6]|
|4.0| [8]|
+---+----+
In spark 2.0 you can do something like:
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val vectorHead = udf{ x:DenseVector => x(0) }
df.withColumn("firstValue", vectorHead(df("vectorColumn")))