How to add corresponding Integer values in 2 different DataFrames - scala

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.

Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}

Related

Modify udf to display values beyond 99999 in databricks spark scala

created a dataset with below schema
org.apache.spark.sql.Dataset[Records] = [value: string, RowNo: int]
Here value field is fixed length position which I would like to convert it to individual columns and add RowNo as last column using a UDF.
def ReadFixWidthFileWithRDD(SrcFileType:String, rdd: org.apache.spark.rdd.RDD[(String, String)], inputFileLength: Int = 6): DataFrame = {
val postapendSchemaRowNo=StructType(Array(StructField("RowNo", StringType, true)))
val inputLength =List(inputFileLength)
val FileInfoList = FixWidth_Dictionary.get(SrcFileType).toList
val fileSchema = FileInfoList(0)._1
val fileColumnSize = FileInfoList(0)._2
val fileSchemaWithFileName = StructType(fileSchema++postapendSchemaRowNo)
val fileColumnSizeWithFileNameLength = fileColumnSize:::inputLength
val data = rdd
var retDF = spark.createDataFrame(data.map{ x =>;
lsplit(fileColumnSizeWithFileNameLength,x._1+x._2)},fileSchemaWithFileName )
retDF
}
Now in the above function, I want to use a dataset instead of Rdd, as my RowNo is not displaying values beyond 99999.
can someone suggest an alternative
I got the solution.
I had created a Hashkey and associated sequence number into a dataframe.
The hashkey is also associated with a dataframe as well.
I joined those two after splitting the fixed length position.

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Dropping constant columns in a csv file

i would like to drop columns which are constant in a dataframe , here what i did , but i see that it tooks some much time to do it , special while writing the dataframe into the csv file , please any help to optimize the code to take less time
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
// select columns with stddev != 0
val finalResult = df.select(columnsToKeep.head, columnsToKeep.tail : _*)
finalResult.write.option("header",true).csv("D:\\ProcessDataSet\\dataWithoutConstant\\Set _1Mud Pumps_MergedCleaned.csv")
}
I think there is no much room left for optimization. You are doing the right thing.
Maybe what you can try is to cache() your dataframe df.
df is used in two separate Spark actions so it is loaded twice.
Try :
...
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
df.cache()
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
...

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)

How to zip two (or more) DataFrame in Spark

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)