How to calculate connections of the node in Spark 2 - scala

I have the following DataFrame df:
val df = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
I want to count the number of ingoing and outgoing links for each node only when the type of from node is the same as the type of to node (i.e. the values of type_from and type_to).
The cases when to and from are equal should be excluded.
This is how I calculate the number of outgoing links based on this answer proposed by user8371915.
df
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
Of course, I can repeat the same calculation for the incoming links and then join the results. But is there any shorter solution?
df2
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"to" as "nodeId", $"type_to" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
val df_result = df.join(df2, Seq("nodeId", "type"), "rightouter")

Related

Spark scala column level mismatches from 2 dataframes

I have 2 dataframes
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
and i want to find the difference of column level
output should look like
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
like wise i have 100's of column and want to compute difference between same column in 2 dataframes columns are dynamic
Maybe this will help:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
so till now you have two dataframes with value_x,value_y,value_z and going on ...
df1:
+------+------+------+
|df1_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 10| 8|
| 3| 6| 4|
+------+------+------+
df2:
+------+------+------+
|df2_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 5| 4|
| 3| 3| 1|
+------+------+------+
Now we are gonna join them base on id:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
and last, we gonna take all columns on df1/df2 (* Its important that they will have the same columns) - without the id, and create a new column of the diff:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
Result:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
This is the result, and it`s dynamic so you can add valueX you want.

How to apply conditional counts (with reset) to grouped data in PySpark?

I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)
Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+
I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+

Group by after group by spark

I have a dataframe with 4 columns co1, col2, col3 and col4. I need to:
Group dataframe based on key col1 and col2
Then group other columns like col3 and col4 and display counts for col3 and col4.
Input
col1 col2 col3 col4
1 1 2 4
1 1 2 4
1 1 3 5
Output
col1 col2 col_name col_value cnt
1 1 col3 2 2
1 1 col3 3 1
1 1 col4 4 2
1 1 col4 5 1
Is this possible?
This the case for melt like operation. You can use implementation provided by ahue as an answer to How to melt Spark DataFrame?.
val df = Seq(
(1, 1, 2, 4), (1, 1, 2, 4), (1, 1, 3, 5)
).toDF("col1", "col2", "col3", "col4")
df.melt(
Seq("col1", "col2"), Seq("col3", "col4"), "col_name", "col_value"
).groupBy("col1", "col2", "col_name", "col_value").count.show
// +----+----+--------+---------+-----+
// |col1|col2|col_name|col_value|count|
// +----+----+--------+---------+-----+
// | 1| 1| col3| 3| 1|
// | 1| 1| col4| 5| 1|
// | 1| 1| col4| 4| 2|
// | 1| 1| col3| 2| 2|
// +----+----+--------+---------+-----+
Here's one approach which should work for aribitrary numbers of key-columns and value-columns (Note that the sample dataset has been expanded for illustration purpose):
val df = Seq(
(1, 1, 2, 4, 6),
(1, 1, 2, 4, 7),
(1, 1, 3, 5, 7)
).toDF("col1", "col2", "col3", "col4", "col5")
import org.apache.spark.sql.functions._
val keyCols = Seq("col1", "col2")
val valCols = Seq("col3", "col4", "col5")
val dfList = valCols.map( c => {
val grpCols = keyCols :+ c
df.groupBy(grpCols.head, grpCols.tail: _*).agg(count(col(c)).as("cnt")).
select(keyCols.map(col) :+ lit(c).as("col_name") :+ col(c).as("col_value") :+ col("cnt"): _*)
} )
dfList.reduce(_ union _).show
// +----+----+--------+---------+---+
// |col1|col2|col_name|col_value|cnt|
// +----+----+--------+---------+---+
// | 1| 1| col3| 3| 1|
// | 1| 1| col3| 2| 2|
// | 1| 1| col4| 4| 2|
// | 1| 1| col4| 5| 1|
// | 1| 1| col5| 6| 1|
// | 1| 1| col5| 7| 2|
// +----+----+--------+---------+---+
We can use groupBy and union to achieve this.
val x = Seq((1, 1,2,4),(1, 1,2,4),(1, 1,3,5)).toDF("col1", "col2", "col3", "col4")
val y = x.groupBy("col1", "col2","col3").
agg(count(col("col3")).alias("cnt")).
withColumn("col_name", lit("col3")).
select(col("col1"), col("col2"), col("col_name"), col("col3").alias("col_value"), col("cnt"))
val z = x.groupBy("col1", "col2","col4").
agg(count(col("col4")).alias("cnt")).
withColumn("col_name", lit("col4")).
select(col("col1"), col("col2"), col("col_name"), col("col4").alias("col_value"), col("cnt"))
y.union(z).show()

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?
That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+

Fill Nan with mean of the row in Scala-Spark

I have an RDD with 6 columns, where the last 5 columns might contain NaNs. My intention is to replace the NaNs with the average value of the rest of the last 5 values of the row which are not Nan. For instance, having this input:
1, 2, 3, 4, 5, 6
2, 2, 2, NaN, 4, 0
3, NaN, NaN, NaN, 6, 0
4, NaN, NaN, 4, 4, 0
The output should be:
1, 2, 3, 4, 5, 6
2, 2, 2, 2, 4, 0
3, 3, 3, 3, 6, 0
4, 3, 3, 4, 4, 0
I know how to fill those NaNs with the average value of the column transforming the RDD to DataFrame:
var aux1 = df.select(df.columns.map(c => mean(col(c))) :_*)
var aux2 = df.na.fill(/*get values of aux1*/)
My question is, how can you do this operation but instead of filling the NaN with the column average, fill it with an average of the values of a subgroup of the row?
You can do this by defining a function to get the mean, and another function to fill nulls in a row.
Given the DF you presented:
val df = sc.parallelize(List((Some(1),Some(2),Some(3),Some(4),Some(5),Some(6)),(Some(2),Some(2),Some(2),None,Some(4),Some(0)),(Some(3),None,None,None,Some(6),Some(0)),(Some(4),None,None,Some(4),Some(4),Some(0)))).toDF("a","b","c","d","e","f")
We need a function to get the mean of a Row:
import org.apache.spark.sql.Row
def rowMean(row: Row): Int = {
val nonNulls = (0 until row.length).map(i => (!row.isNullAt(i), row.getAs[Int](i))).filter(_._1).map(_._2).toList
nonNulls.sum / nonNulls.length
}
And another to fill nulls in a Row:
def rowFillNulls(row: Row, fill: Int): Row = {
Row((0 until row.length).map(i => if (row.isNullAt(i)) fill else row.getAs[Int](i)) : _*)
}
Now we can first compute each row mean:
val rowWithMean = df.map(row => (row,rowMean(row)))
And then fill it:
val result = sqlContext.createDataFrame(rowWithMean.map{case (row,mean) => rowFillNulls(row,mean)}, df.schema)
Finally view before and after...
df.show
+---+----+----+----+---+---+
| a| b| c| d| e| f|
+---+----+----+----+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2|null| 4| 0|
| 3|null|null|null| 6| 0|
| 4|null|null| 4| 4| 0|
+---+----+----+----+---+---+
result.show
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2| 2| 4| 0|
| 3| 3| 3| 3| 6| 0|
| 4| 3| 3| 4| 4| 0|
+---+---+---+---+---+---+
This will work for any width DF with Int columns. You can easily update this to other datatypes, even non-numeric (hint, inspect the df schema!)
A bunch of imports:
import org.apache.spark.sql.functions.{col, isnan, isnull, round, when}
import org.apache.spark.sql.Column
A few helper functions:
def nullOrNan(c: Column) = isnan(c) || isnull(c)
def rowMean(cols: Column*): Column = {
val sum = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(c))
.fold(lit(0.0))(_ + _)
val count = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(lit(1.0)))
.fold(lit(0.0))(_ + _)
sum / count
}
A solution:
val mean = round(
rowMean(df.columns.tail.map(col): _*)
).cast("int").alias("mean")
val exprs = df.columns.tail.map(
c => when(nullOrNan(col(c)), mean).otherwise(col(c)).alias(c)
)
val filled = df.select(col(df.columns(0)) +: exprs: _*)
Well, this is a fun little problem - I will post my solution, but I will definitely watch and see if someone comes up with a better way of doing it :)
First I would introduce a couple of udfs:
val avg = udf((values: Seq[Integer]) => {
val notNullValues = values.filter(_ != null).map(_.toInt)
notNullValues.sum/notNullValues.length
})
val replaceNullWithAvg = udf((x: Integer, avg: Integer) => if(x == null) avg else x)
which I would then apply to the DataFrame like this:
dataframe
.withColumn("avg", avg(array(df.columns.tail.map(s => df.col(s)):_*)))
.select('col1, replaceNullWithAvg('col2, 'avg) as "col2", replaceNullWithAvg('col3, 'avg) as "col3", replaceNullWithAvg('col4, 'avg) as "col4", replaceNullWithAvg('col5, 'avg) as "col5", replaceNullWithAvg('col6, 'avg) as "col6")
This will get you what you are looking for, but arguably not the most sophisticated code I have ever put together...