Spark - How to convert map function output (Row,Row) tuple to one Dataframe - scala

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.

You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+

Related

Use rlike with regex column in spark 1.5.1

I want to filter dataframe based on applying regex values in one of the columns to another column.
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
The result of filtering dataframe using RegexColumm should give rows with id 1 and 3.
Is there a way to do this in spark 1.5.1? Don't want to use UDF as this might cause scalability issues, looking for spark native api.
You can convert df -> rdd then by traversing through row we can match the regex and filter out only the matching data without using any UDF.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
UPDATE:
Instead of .map we can use .mapPartition (map vs mapPartiiton):
val rdd = df.rdd.mapPartitions(
partitions => {
partitions.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
})
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+
You can use like above, I think this is what you are lioking for. Please do let me know if it helps you.

How to append collection as new column to DataFrame with many columns?

I'd like to append (add) a new column to an existing dataframe with multiple columns.
val a = Seq(
("10", "MILLER", "1300", "2017-11-03"),
("30", "Martin", "1250", "2017-11-21")).toDF("dept_no","emp_name","sal","date")
scala> a.show
+-------+--------+----+----------+
|dept_no|emp_name| sal| date|
+-------+--------+----+----------+
| 10| MILLER|1300|2017-11-03|
| 30| Martin|1250|2017-11-21|
+-------+--------+----+----------+
With the above dataframe I'd like to add every element of a collection (be it a regular Scala collection or another single-column DataFrame), e.g.
val lst = List("10", "Susan")
How to add the elements of lst above to the rows of a dataframe (one element per row)?
Let's convert lst to a DataFrame:
val lst = List("10", "Susan").toDF
You can use zip method of RDD:
import org.apache.spark.sql.Row
val data = a.rdd.zip(lst.rdd).map { case (l, r) => Row.fromSeq(l.toSeq ++ r.toSeq) }
import org.apache.spark.sql.types.StructType
val schema = StructType(a.schema.fields ++ lst.schema.fields)
val solution = spark.createDataFrame(data, schema)
scala> solution.show
+-------+--------+----+----------+-----+
|dept_no|emp_name| sal| date|value|
+-------+--------+----+----------+-----+
| 10| MILLER|1300|2017-11-03| 10|
| 30| Martin|1250|2017-11-21|Susan|
+-------+--------+----+----------+-----+

Spark withColumn - add column using non-Column type variable [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
How can I add a column to a data frame from a variable value?
I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", myArray)
Type mismatch, expected: Column, actual: Array[Int]
This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?
The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()
try this
import org.apache.spark.sql.functions.typedLit
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", typedLit(myArray))
:)
Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:
myList.toDF("myList").withColumn("myArray", lit(myArray)).
show
// +------+---------+
// |myList| myArray|
// +------+---------+
// | 1|[1, 2, 3]|
// | 2|[1, 2, 3]|
// | 3|[1, 2, 3]|
// +------+---------+
If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.
Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:
myList.toDF("myList").join(
myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// | 1| 1|
// | 2| 2|
// | 3| 3|
// +------+-------+
But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// | 0| a| x|
// | 1| b| y|
// | 2| c| z|
// +-----+------+-------+

How to add header and column to dataframe spark?

I have got a dataframe, on which I want to add a header and a first column
manually. Here is the dataframe :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema",true).csv("C:\\gg.csv").cache()
the content of the dataframe
12,13,14
11,10,5
3,2,45
The expected output is
define,col1,col2,col3
c1,12,13,14
c2,11,10,5
c3,3,2,45
What you want to do is:
df.withColumn("columnName", column) //here "columnName" should be "define" for you
Now you just need to create the said column (this might help)
Here is a solution that depends on Spark 2.4:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.Row
//First off the dataframe needs to be loaded with the expected schema
val spark = SparkSession.builder().appName().getOrCreate()
val schema = new StructType()
.add("col1",IntegerType,true)
.add("col2",IntegerType,true)
.add("col3",IntegerType,true)
val df = spark.read.format("csv").schema(schema).load("C:\\gg.csv").cache()
val rddWithId = df.rdd.zipWithIndex
// Prepend "define" column of type Long
val newSchema = StructType(Array(StructField("define", StringType, false)) ++ df.schema.fields)
val dfZippedWithId = spark.createDataFrame(rddWithId.map{
case (row, index) =>
Row.fromSeq(Array("c" + index) ++ row.toSeq)}, newSchema)
// Show results
dfZippedWithId.show
Displays:
+------+----+----+----+
|define|col1|col2|col3|
+------+----+----+----+
| c0| 12| 13| 14|
| c1| 11| 10| 5|
| c2| 3| 2| 45|
+------+----+----+----+
This is a mix of the documentation here and this example.

How to add columns into org.apache.spark.sql.Row inside of mapPartitions

I am a newbie at scala and spark, please keep that in mind :)
Actually, I have three questions
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
How can I add few columns into Row object(or create a new one)
How create DataFrame from created RDD
Thank you at advance
Usually there should be no need for that and it is better to use UDFs but here you are:
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
It should take Iterator[Row] and return Iterator[T] so in your case you should use something like this
import org.apache.spark.sql.Row
def transformRows(iter: Iterator[Row]): Iterator[Row] = ???
How can I add few columns into Row object(or create a new one)
There are multiple ways of accessing Row values including Row.get* methods, Row.toSeq etc. New Row can be created using Row.apply, Row.fromSeq, Row.fromTuple or RowFactory. For example:
def transformRow(row: Row): Row = Row.fromSeq(row.toSeq ++ Array[Any](-1, 1))
How create DataFrame from created RDD
If you have RDD[Row] you can use SQLContext.createDataFrame and provide schema.
Putting this all together:
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
val df = sc.parallelize(Seq(
(1.0, 2.0), (0.0, -1.0),
(3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
val newSchema = StructType(df.schema.fields ++ Array(
StructField("z", IntegerType, false), StructField("v", IntegerType, false)))
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
// +---+----+---+---+
// | x| y| z| v|
// +---+----+---+---+
// |1.0| 2.0| -1| 1|
// |0.0|-1.0| -1| 1|
// |3.0| 4.0| -1| 1|
// |6.0|-2.3| -1| 1|
// +---+----+---+---+