How can I update one column value in an RDD[Row]? - scala

I use scala for spark, I want to update one column value in an RDD, my data format is like this:
[510116,8042,1,8298,20170907181326,1,3,lineno805]
[510116,8042,1,8152,20170907182101,1,3,lineno805]
[510116,8042,1,8154,20170907164311,1,3,lineno805]
[510116,8042,1,8069,20170907165031,1,3,lineno805]
[510116,8042,1,8061,20170907170254,1,3,lineno805]
[510116,8042,1,9906,20170907171417,1,3,lineno805]
[510116,8042,1,8295,20170907174734,1,3,lineno805]
my scala code is like this:
val getSerialRdd: RDD[Row]=……
I want to update the column which is contain data 20170907181326, I wish the data like follow format:
[510116,8042,1,8298,2017090718,1,3,lineno805]
[510116,8042,1,8152,2017090718,1,3,lineno805]
[510116,8042,1,8154,2017090716,1,3,lineno805]
[510116,8042,1,8069,2017090716,1,3,lineno805]
[510116,8042,1,8061,2017090717,1,3,lineno805]
[510116,8042,1,9906,2017090717,1,3,lineno805]
[510116,8042,1,8295,2017090717,1,3,lineno805]
and output the RDD type like RDD[Row].
How I can do this?

You can define an update method like this to update a field in the Row:
import org.apache.spark.sql.Row
def update(r: Row): Row = {
val s = r.toSeq
Row.fromSeq((s.take(4) :+ s(4).asInstanceOf[String].take(10)) ++ s.drop(5))
}
rdd.map(update(_)).collect
//res13: Array[org.apache.spark.sql.Row] =
// Array([510116,8042,1,8298,2017090718,1,3,lineno805],
// [510116,8042,1,8152,2017090718,1,3,lineno805],
// [510116,8042,1,8154,2017090716,1,3,lineno805],
// [510116,8042,1,8069,2017090716,1,3,lineno805],
// [510116,8042,1,8061,2017090717,1,3,lineno805],
// [510116,8042,1,9906,2017090717,1,3,lineno805],
// [510116,8042,1,8295,2017090717,1,3,lineno805])
A simpler approach would be to use DataFrame API and the substring function:
1) Create a data frame from the rdd:
val df = spark.createDataFrame(rdd, rdd.take(1)(0).schema)
// df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 6 more fields]
2) use substring to transform the column:
df.withColumn("_c4", substring($"_c4", 0, 10)).show
+------+----+---+----+----------+---+---+---------+
| _c0| _c1|_c2| _c3| _c4|_c5|_c6| _c7|
+------+----+---+----+----------+---+---+---------+
|510116|8042| 1|8298|2017090718| 1| 3|lineno805|
|510116|8042| 1|8152|2017090718| 1| 3|lineno805|
|510116|8042| 1|8154|2017090716| 1| 3|lineno805|
|510116|8042| 1|8069|2017090716| 1| 3|lineno805|
|510116|8042| 1|8061|2017090717| 1| 3|lineno805|
|510116|8042| 1|9906|2017090717| 1| 3|lineno805|
|510116|8042| 1|8295|2017090717| 1| 3|lineno805|
+------+----+---+----+----------+---+---+---------+
3) convert data frame to rdd is easy:
val getSerialRdd = df.withColumn("_c4", substring($"_c4", 0, 10)).rdd

In some cases you might want to update a row with a schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
def update(r: Row, i: Int, a: Any): Row = {
val s: Array[Any] = r
.toSeq
.toArray
.updated(i, a)
new GenericRowWithSchema(s, r.schema)
}
rdd.map(update(_)).show(false)

Related

Use rlike with regex column in spark 1.5.1

I want to filter dataframe based on applying regex values in one of the columns to another column.
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
The result of filtering dataframe using RegexColumm should give rows with id 1 and 3.
Is there a way to do this in spark 1.5.1? Don't want to use UDF as this might cause scalability issues, looking for spark native api.
You can convert df -> rdd then by traversing through row we can match the regex and filter out only the matching data without using any UDF.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
UPDATE:
Instead of .map we can use .mapPartition (map vs mapPartiiton):
val rdd = df.rdd.mapPartitions(
partitions => {
partitions.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
})
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+
You can use like above, I think this is what you are lioking for. Please do let me know if it helps you.

Spark withColumn - add column using non-Column type variable [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
How can I add a column to a data frame from a variable value?
I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", myArray)
Type mismatch, expected: Column, actual: Array[Int]
This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?
The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()
try this
import org.apache.spark.sql.functions.typedLit
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", typedLit(myArray))
:)
Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:
myList.toDF("myList").withColumn("myArray", lit(myArray)).
show
// +------+---------+
// |myList| myArray|
// +------+---------+
// | 1|[1, 2, 3]|
// | 2|[1, 2, 3]|
// | 3|[1, 2, 3]|
// +------+---------+
If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.
Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:
myList.toDF("myList").join(
myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// | 1| 1|
// | 2| 2|
// | 3| 3|
// +------+-------+
But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// | 0| a| x|
// | 1| b| y|
// | 2| c| z|
// +-----+------+-------+

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+

How to add columns into org.apache.spark.sql.Row inside of mapPartitions

I am a newbie at scala and spark, please keep that in mind :)
Actually, I have three questions
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
How can I add few columns into Row object(or create a new one)
How create DataFrame from created RDD
Thank you at advance
Usually there should be no need for that and it is better to use UDFs but here you are:
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
It should take Iterator[Row] and return Iterator[T] so in your case you should use something like this
import org.apache.spark.sql.Row
def transformRows(iter: Iterator[Row]): Iterator[Row] = ???
How can I add few columns into Row object(or create a new one)
There are multiple ways of accessing Row values including Row.get* methods, Row.toSeq etc. New Row can be created using Row.apply, Row.fromSeq, Row.fromTuple or RowFactory. For example:
def transformRow(row: Row): Row = Row.fromSeq(row.toSeq ++ Array[Any](-1, 1))
How create DataFrame from created RDD
If you have RDD[Row] you can use SQLContext.createDataFrame and provide schema.
Putting this all together:
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
val df = sc.parallelize(Seq(
(1.0, 2.0), (0.0, -1.0),
(3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
val newSchema = StructType(df.schema.fields ++ Array(
StructField("z", IntegerType, false), StructField("v", IntegerType, false)))
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
// +---+----+---+---+
// | x| y| z| v|
// +---+----+---+---+
// |1.0| 2.0| -1| 1|
// |0.0|-1.0| -1| 1|
// |3.0| 4.0| -1| 1|
// |6.0|-2.3| -1| 1|
// +---+----+---+---+

Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.
I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.
How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.
Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.
Kindly please advise with a working or pseudo code.
Thanks
Sanjay
Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
df1.show
// +---+----+---+------------+
// | x| y| z| foobar|
// +---+----+---+------------+
// | 1| 3.0| a| [3.0,291.0]|
// | 2|-1.0| b|[-2.0,-98.0]|
// | 3| 0.0| c| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- x: long (nullable = false)
// |-- y: double (nullable = false)
// |-- z: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)
This can be easily flattened later but usually there is no need for that.
Switch to RDD, reshape and rebuild DF:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
Seq(x * y, z.head.toInt * y)
val schema = StructType(df.schema.fields ++
Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
r.toSeq ++
foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | x| y| z| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| a| 3.0|291.0|
// | 2|-1.0| b|-2.0|-98.0|
// | 3| 0.0| c| 0.0| 0.0|
// +---+----+---+----+-----+
Assume that after your function there will be a sequence of elements, giving an example as below:
val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
df.show
+------------------+---+
| infoComb|age|
+------------------+---+
|Mike,1986,Toronto| 30|
| Andre,1980,Ottawa| 36|
| jill,1989,London| 27|
+------------------+---+
now what you can do with this infoComb is that you can start split the string and get more columns with:
df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
+-----+----------+-------+---+
| name|yearOfBorn| city|age|
+-----+----------+-------+---+
|Mike| 1986|Toronto| 30|
|Andre| 1980| Ottawa| 36|
| jill| 1989| London| 27|
+-----+----------+-------+---+
Hope this helps.
If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:
val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
.withColumn("newCol2", myFun2(myDf("originalColumn"))
.drop(myDf("originalColumn"))
where myFun is an udf defined like this:
def myFun= udf(
(originalColumnContent : String) => {
// do something with your original column content and return a new one
}
)
I opted to create a function to flatten one column and then just call it simultaneously with the udf.
First define this:
implicit class DfOperations(df: DataFrame) {
def flattenColumn(col: String) = {
def addColumns(df: DataFrame, cols: Array[String]): DataFrame = {
if (cols.isEmpty) df
else addColumns(
df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
cols.tail
)
}
val field = df.select(col).schema.fields(0)
val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)
addColumns(df, newCols).drop(col)
}
def withColumnMany(colName: String, col: Column) = {
df.withColumn(colName, col).flattenColumn(colName)
}
}
Then usage is very simple:
case class MyClass(a: Int, b: Int)
val df = sc.parallelize(Seq(
(0),
(1)
)).toDF("x")
val f = udf((x: Int) => MyClass(x*2,x*3))
df.withColumnMany("test", f($"x")).show()
// +---+------+------+
// | x|test_a|test_b|
// +---+------+------+
// | 0| 0| 0|
// | 1| 2| 3|
// +---+------+------+
This can be easily achieved by using pivot function
df4.groupBy("year").pivot("course").sum("earnings").collect()