Spark scala column level mismatches from 2 dataframes - scala

I have 2 dataframes
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
and i want to find the difference of column level
output should look like
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
like wise i have 100's of column and want to compute difference between same column in 2 dataframes columns are dynamic

Maybe this will help:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
so till now you have two dataframes with value_x,value_y,value_z and going on ...
df1:
+------+------+------+
|df1_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 10| 8|
| 3| 6| 4|
+------+------+------+
df2:
+------+------+------+
|df2_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 5| 4|
| 3| 3| 1|
+------+------+------+
Now we are gonna join them base on id:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
and last, we gonna take all columns on df1/df2 (* Its important that they will have the same columns) - without the id, and create a new column of the diff:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
Result:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
This is the result, and it`s dynamic so you can add valueX you want.

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Group by after group by spark

I have a dataframe with 4 columns co1, col2, col3 and col4. I need to:
Group dataframe based on key col1 and col2
Then group other columns like col3 and col4 and display counts for col3 and col4.
Input
col1 col2 col3 col4
1 1 2 4
1 1 2 4
1 1 3 5
Output
col1 col2 col_name col_value cnt
1 1 col3 2 2
1 1 col3 3 1
1 1 col4 4 2
1 1 col4 5 1
Is this possible?
This the case for melt like operation. You can use implementation provided by ahue as an answer to How to melt Spark DataFrame?.
val df = Seq(
(1, 1, 2, 4), (1, 1, 2, 4), (1, 1, 3, 5)
).toDF("col1", "col2", "col3", "col4")
df.melt(
Seq("col1", "col2"), Seq("col3", "col4"), "col_name", "col_value"
).groupBy("col1", "col2", "col_name", "col_value").count.show
// +----+----+--------+---------+-----+
// |col1|col2|col_name|col_value|count|
// +----+----+--------+---------+-----+
// | 1| 1| col3| 3| 1|
// | 1| 1| col4| 5| 1|
// | 1| 1| col4| 4| 2|
// | 1| 1| col3| 2| 2|
// +----+----+--------+---------+-----+
Here's one approach which should work for aribitrary numbers of key-columns and value-columns (Note that the sample dataset has been expanded for illustration purpose):
val df = Seq(
(1, 1, 2, 4, 6),
(1, 1, 2, 4, 7),
(1, 1, 3, 5, 7)
).toDF("col1", "col2", "col3", "col4", "col5")
import org.apache.spark.sql.functions._
val keyCols = Seq("col1", "col2")
val valCols = Seq("col3", "col4", "col5")
val dfList = valCols.map( c => {
val grpCols = keyCols :+ c
df.groupBy(grpCols.head, grpCols.tail: _*).agg(count(col(c)).as("cnt")).
select(keyCols.map(col) :+ lit(c).as("col_name") :+ col(c).as("col_value") :+ col("cnt"): _*)
} )
dfList.reduce(_ union _).show
// +----+----+--------+---------+---+
// |col1|col2|col_name|col_value|cnt|
// +----+----+--------+---------+---+
// | 1| 1| col3| 3| 1|
// | 1| 1| col3| 2| 2|
// | 1| 1| col4| 4| 2|
// | 1| 1| col4| 5| 1|
// | 1| 1| col5| 6| 1|
// | 1| 1| col5| 7| 2|
// +----+----+--------+---------+---+
We can use groupBy and union to achieve this.
val x = Seq((1, 1,2,4),(1, 1,2,4),(1, 1,3,5)).toDF("col1", "col2", "col3", "col4")
val y = x.groupBy("col1", "col2","col3").
agg(count(col("col3")).alias("cnt")).
withColumn("col_name", lit("col3")).
select(col("col1"), col("col2"), col("col_name"), col("col3").alias("col_value"), col("cnt"))
val z = x.groupBy("col1", "col2","col4").
agg(count(col("col4")).alias("cnt")).
withColumn("col_name", lit("col4")).
select(col("col1"), col("col2"), col("col_name"), col("col4").alias("col_value"), col("cnt"))
y.union(z).show()

Grouping by values on a Spark Dataframe

I'm working on a Spark dataframe containing this kind of data:
A,1,2,3
B,1,2,3
C,1,2,3
D,4,2,3
I want to aggegate this data on the three last columns, so the output would be :
ABC,1,2,3
D,4,2,3
How can I do it in scala ? (this is not a big dataframe so performance is secondary here)
As mentioned in the comments you can first use groupBy to group your columns and then use concat_ws on your first column. Here is one way of doing it,
//create you original DF
val df = Seq(("A",1,2,3),("B",1,2,3),("C",1,2,3),("D",4,2,3)).toDF("col1","col2","col3","col4")
df.show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 1| 2| 3|
| B| 1| 2| 3|
| C| 1| 2| 3|
| D| 4| 2| 3|
+----+----+----+----+
//group by "col2","col3","col4" and store "col1" as list and then
//convert it to string
df.groupBy("col2","col3","col4")
.agg(collect_list("col1").as("col1"))
//you can change the string separator by concat_ws first arg
.select(concat_ws("", $"col1") as "col1",$"col2",$"col3",$"col4").show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| D| 4| 2| 3|
| ABC| 1| 2| 3|
+----+----+----+----+
Alternatively you can map by your key in this case c2, c3, c4 and then concatenate your values via reduce by key. In the end I format each row as needed through the last map. It should be something like the following:
val data=sc.parallelize(List(
("A", "1", "2", "3"),
("B", "1", "2", "3"),
("C", "1", "2", "3"),
("D", "4", "2", "3")))
val res = data.map{ case (c1, c2, c3, c4) => ((c2, c3, c4), String.valueOf(c1)) }
.reduceByKey((x, y) => x + y)
.map(v => v._2.toString + "," + v._1.productIterator.toArray.mkString(","))
.collect

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?
That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+

How to calculate connections of the node in Spark 2

I have the following DataFrame df:
val df = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
I want to count the number of ingoing and outgoing links for each node only when the type of from node is the same as the type of to node (i.e. the values of type_from and type_to).
The cases when to and from are equal should be excluded.
This is how I calculate the number of outgoing links based on this answer proposed by user8371915.
df
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
Of course, I can repeat the same calculation for the incoming links and then join the results. But is there any shorter solution?
df2
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"to" as "nodeId", $"type_to" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
val df_result = df.join(df2, Seq("nodeId", "type"), "rightouter")