Spark will process the data in parallel, but not the operations. In my DAG I want to call a function per column like
Spark processing columns in parallel the values for each column could be calculated independently from other columns. Is there any way to achieve such parallelism via spark-SQL API? Utilizing window functions Spark dynamic DAG is a lot slower and different from hard coded DAG helped to optimize the DAG by a lot but only executes in a serial fashion.
An example which contains a little bit more information can be found https://github.com/geoHeil/sparkContrastCoding
The minimum example below:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val inputToDrop = Seq("col3TooMany")
val inputToBias = Seq("col1", "col2")
val targetCounts = df.filter(df("TARGET") === 1).groupBy("TARGET").agg(count("TARGET").as("cnt_foo_eq_1"))
val newDF = df.toDF.join(broadcast(targetCounts), Seq("TARGET"), "left")
newDF.cache
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
val joinUDF = udf((newColumn: String, newValue: String, codingVariant: Int, results: Map[String, Map[String, Seq[Double]]]) => {
results.get(newColumn) match {
case Some(tt) => {
val nestedArray = tt.getOrElse(newValue, Seq(0.0))
if (codingVariant == 0) {
nestedArray.head
} else {
nestedArray.last
}
}
case None => throw new Exception("Column not contained in initial data frame")
}
})
Now I want to apply my handleBias function to all the columns, unfortunately, this is not executed in parallel.
val res = (inputToDrop ++ inputToBias).toSet.foldLeft(newDF) {
(currentDF, colName) =>
{
logger.info("using col " + colName)
handleBias(currentDF, colName)
}
}
.drop("cnt_foo_eq_1")
val combined = ((inputToDrop ++ inputToBias).toSet).foldLeft(res) {
(currentDF, colName) =>
{
currentDF
.withColumn("combined_" + colName, map(col(colName), array(col("pre_" + colName), col("pre2_" + colName))))
}
}
val columnsToUse = combined
.select(combined.columns
.filter(_.startsWith("combined_"))
map (combined(_)): _*)
val newNames = columnsToUse.columns.map(_.split("combined_").last)
val renamed = columnsToUse.toDF(newNames: _*)
val cols = renamed.columns
val localData = renamed.collect
val columnsMap = cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Seq[Double]]](colName)).toMap
}.toMap
values for each column could be calculated independently from other columns
While it is true it doesn't really help your case. You can generate a number of independent DataFrames, each one with its own additions, but it doesn't mean you can automatically combine this into a single execution plan.
Each application of handleBias shuffles your data twice and output DataFrames don't have the same data distribution as the parent DataFrame. This is why when you fold over the list of columns each addition has to be performed separately.
Theoretically you could design a pipeline which can be expressed (with pseudocode) like this:
add unique id:
df_with_id = df.withColumn("id", unique_id())
compute each df independently and convert to wide format:
dfs = for (c in columns)
yield handle_bias(df, c).withColumn(
"pres", explode([(pre_name, pre_value), (pre2_name, pre2_value)])
)
union all partial results:
combined = dfs.reduce(union)
pivot to convert from long to wide format:
combined.groupBy("id").pivot("pres._1").agg(first("pres._2"))
but I doubt it is worth all the fuss. The process you use is extremely heavy as it is and requires a significant network and disk IO.
If number of total levels (sum count(distinct x)) for x in columns)) is relatively low you can try to compute all statistics with a single pass using for example aggregateByKey with Map[Tuple2[_, _], StatCounter] otherwise consider downsampling to the level where you can compute statistics locally.
Related
I have around 20-25 list of columns from conf file and have to aggregate first Notnull value. I tried the function to pass the column list and agg expr from reading the conf file.
I was able to get first function but couldn't find how to specify first with ignoreNull value as true.
The code that I tried is
def groupAndAggregate(df: DataFrame, cols: List[String] , aggregateFun: Map[String, String]): DataFrame = {
df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}
val df = sc.parallelize(Seq(
(0, null, "1"),
(1, "2", "2"),
(0, "3", "3"),
(0, "4", "4"),
(1, "5", "5"),
(1, "6", "6"),
(1, "7", "7")
)).toDF("grp", "col1", "col2")
//first
groupAndAggregate(df, List("grp"), Map("col1"-> "first", "col2"-> "COUNT") ).show()
+---+-----------+-----------+
|grp|first(col1)|count(col2)|
+---+-----------+-----------+
| 1| 2| 4|
| 0| | 3|
+---+-----------+-----------+
I need to get 3 as a result in place of null.
I am using Spark 2.1.0 and Scala 2.11
Edit 1:
If I use the following function
import org.apache.spark.sql.functions.{first,count}
df.groupBy("grp").agg(first(df("col1"), ignoreNulls = true), count("col2")).show()
I get my desired result. Can we pass the ignoreNulls true for first function in Map?
I have been able to achieve this by creating a list of Columns and passing it to agg function of groupBy. The earlier approach had an issue where i was not able to name the columns as the agg function was not returning me the order of columns in the output DF, i have renamed the columns in the list itself.
import org.apache.spark.sql.functions._
def groupAndAggregate(df: DataFrame): DataFrame = {
val list: ListBuffer[Column] = new ListBuffer[Column]()
try {
val columnFound = getAggColumns(df) // function to return a Map[String, String]
val agg_func = columnFound.entrySet().toList.
foreach(field =>
list += first(df(columnFound.getOrDefault(field.getKey, "")),ignoreNulls = true).as(field.getKey)
)
list += sum(df("col1")).as("watch_time")
list += count("*").as("frequency")
val groupColumns = getGroupColumns(df) // function to return a List[String]
val output = df.groupBy(groupColumns.head, groupColumns.tail: _*).agg(
list.head, list.tail: _*
)
output
} catch {
case e: Exception => {
e.printStackTrace()}
null
}
}
I think you should use na operator and drop all the nulls before you do aggregation.
na: DataFrameNaFunctions Returns a DataFrameNaFunctions for working with missing data.
drop(cols: Array[String]): DataFrame Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.
The code would then look as follows:
df.na.drop("col1").groupBy(...).agg(first("col1"))
That will impact count so you'd have to do count separately.
I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}
I have a dataframe created which holds the join of 2 tables.
I want to compare each field of table1 to that of table2 (Schema is same)
Columns in Table A = colA1, colB1, colC1 , ...
Columns in Table B = colA2, colB2, colC2, ...
So, I need to filter out the data which satisfies the condition
(colA1 = colA2) AND (colB1 = colB2) AND (colC1 = colC2) and so on.
Since my table has a lot of fields, I tried to build a similar exp.
val filterCols = Seq("colA","colB","colC")
val sq = '"'
val exp = filterCols.map({ x => s"(join_df1($sq${x}1$sq) === join_df1($sq${x}2$sq))" }).mkString(" && ")
Resultant Exp : res28: String = (join_df1("colA1") === join_df1("colA2")) && (join_df1("colB1") === join_df1("colB2")) && (join_df1("colC1") === join_df1("colC2"))
Now when i try to substitute it to the dataframe, it throws me an error.
join_df1.filter($exp)
I am not sure whether I am doing it right .I need to find a way to substitute my expression and filter out value.
Any help is appreciated.
Thanks in advance
This is not valid SQL. Try:
val df = Seq(
("a", "a", "b", "b", "c", "c"),
("a", "A", "b", "B", "c", "C")).toDF("a1", "a2", "b1", "b2", "c1", "c2")
val filterCols = Seq("A", "B", "C")
val exp = filterCols.map(x => s"${x}1 = ${x}2").mkString(" AND ")
df.where(exp)
The generated output of reducebykey is an ShuffledRDD with key-value both as array of multiple field. I need to extract all the fields and write to a hive table.
Below is the code which I was trying:
sqlContext.sql(s"select SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT,RMNG_NW_OP_KEY, ACCESS_TYPE FROM FACT.FCT_MEDIATED_USAGE_DATA")
val USAGE_DATA_Reduce = USAGE_DATA.map{ USAGE_DATA => ((USAGE_DATA.getShort(0), USAGE_DATA.getString(1),USAGE_DATA.getString(2)),
(USAGE_DATA.getInt(3), USAGE_DATA.getInt(4)))}.reduceByKey((x, y) => (math.min(x._1, y._1), math.max(x._2,y._2)))
The final output what I am expecting is all the five fields as:
SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT, MINVAL, MAXVAL
So that it can be directly inserted to hive table
If you mean:
Given a RDD[(TupleN, TupleM)], how do I map each record's elements of both key and value tuples into a single concatenated string?
Here's a simplified version, you should be able extrapolate this to solve your problem:
val keyValueRdd = sc.parallelize(Seq(
(1, "key1") -> (10, "value1", "A"),
(2, "key2") -> (20, "value2", "B"),
(3, "key3") -> (30, "value3", "C")
))
val asStrings: RDD[String] = keyValueRdd.map {
case ((k1, k2), (v1, v2, v3)) => List(k1, k2, v1, v2, v3).mkString(",")
}
asStrings.foreach(println)
// prints:
// 3,key3,30,value3,C
// 2,key2,20,value2,B
// 1,key1,10,value1,A
I have the following data:
val RDDApp = sc.parallelize(List("A", "B", "C"))
val RDDUser = sc.parallelize(List(1, 2, 3))
val RDDInstalled = sc.parallelize(List((1, "A"), (1, "B"), (2, "B"), (2, "C"), (3, "A"))).groupByKey
val RDDCart = RDDUser.cartesian(RDDApp)
I want to map this data so that I have an RDD of tuples with (userId, Boolean if the letter is given for user). I thought I found a solution with this:
val results = RDDCart.map (entry =>
(entry._1, RDDInstalled.lookup(entry._1).contains(entry._2))
)
If I call results.first, I get org.apache.spark.SparkException: SPARK-5063. I see the problem with the Action within the Mapping function but do not know how I can work around it so that I get the same result.
Just join and mapValues:
RDDCart.join(RDDInstalled).mapValues{case (x, xs) => xs.toSeq.contains(x)}