Sum the Distance in Apache-Spark dataframes - scala

The Following code gives a dataframe having three values in each column as shown below.
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below::
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows.
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?

It can be done like this:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)

I would make a collection of the columns to sum and then use a foldLeft on a UDF:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+

Related

How can I capitalize specific words in a spark column?

I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.
val wrds = ["usa","gb"]
val dF = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
I would like to have an output of
val dF = List(
(1, "z",3, "Bob lives in the USA"),
(4, "t", 2, "GB is where Beth lives")
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?
import org.apache.spark.sql.functions._
val words = Array("usa","gb")
val df = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
val replaced = words.foldLeft(df){
case (adf, word) =>
adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}
replaced.show
Output:
+---+----+-----+--------------------+
| id|name|thing| country|
+---+----+-----+--------------------+
| 1| z| 3|Bob lives in the USA|
| 4| t| 2|GB is where Beth ...|
| 5| t| 2| ogb|
+---+----+-----+--------------------+

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

How to replicate an element in Spark dataframe in Scala?

Suppose I have a DataFrame:
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array")
+---+---+---+------------+
|one|two| X| Array|
+---+---+---+------------+
| 1| 2| x|[1, 2, 3, 4]|
+---+---+---+------------+
I want to replicate the single elements, let's say 4 times, in order to achieve a single row DataFrame with each field as an array of four elements. The desired output would be:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
You can use builit-in array function to replicate n time column of your choice.
Below is PoC code.
import org.apache.spark.sql.functions._
val replicate = (n: Int, colName: String) => array((1 to n).map(s => col(colName)):_*)
val replicatedCol = Seq("one", "two", "X").map(s => replicate(4, s).as(s))
val cols = col("Array") +: replicatedCol
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(cols:_*)
testDf.show(false)
+------------+------------+------------+------------+
|Array |one |two |X |
+------------+------------+------------+------------+
|[1, 2, 3, 4]|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|
+------------+------------+------------+------------+
In the case, you want different n for each column
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(replicate(2, "one").as("one"), replicate(3, "X").as("X"), replicate(4, "two").as("two"), $"Array")
testDf.show(false)
+------+---------+------------+------------+
|one |X |two |Array |
+------+---------+------------+------------+
|[1, 1]|[x, x, x]|[2, 2, 2, 2]|[1, 2, 3, 4]|
+------+---------+------------+------------+
Well, here is my solution:
First declare the columns you want to replicate:
val columnsToReplicate = List("one", "two", "X")
Then define the replication factor and the udf to perform it:
val replicationFactor = 4
val replicate = (s:String) => {
for {
i <- 1 to replicationFactor
} yield s
}
val replicateudf = functions.udf(replicate)
Then just perform the foldLeft on the DataFrame when the columname belongs to your list of desired column names:
testDf.columns.foldLeft(testDf)((acc, colname) => if (columnsToReplicate.contains(colname)) acc.withColumn(colname, replicateudf(acc.col(colname))) else acc)
Output:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
Note: You need to import this class:
import org.apache.spark.sql.functions
EDIT:
Variable replicationFactor as suggested in comments:
val mapColumnsToReplicate = Map("one"->4, "two"->5, "X"->6)
val replicateudf2 = functions.udf ((s: String, replicationFactor: Int) =>
for {
i <- 1 to replicationFactor
} yield s
)
testDf.columns.foldLeft(testDf)((acc, colname) => if (mapColumnsToReplicate.keys.toList.contains(colname)) acc.withColumn(colname, replicateudf2($"$colname", functions.lit(mapColumnsToReplicate(colname))))` else acc)
Output with those values above:
+------------+---------------+------------------+------------+
| one| two| X| Array|
+------------+---------------+------------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2, 2]|[x, x, x, x, x, x]|[1, 2, 3, 4]|
+------------+---------------+------------------+------------+
You can use explode und groupBy/collect_list :
val testDf = sc.parallelize(
Seq((1, 2, "x", Array(1, 2, 3, 4)),
(3, 4, "y", Array(1, 2, 3)),
(5,6, "z", Array(1)))
).toDF("one", "two", "X", "Array")
testDf
.withColumn("id",monotonically_increasing_id())
.withColumn("tmp", explode($"Array"))
.groupBy($"id")
.agg(
collect_list($"one").as("cl_one"),
collect_list($"two").as("cl_two"),
collect_list($"X").as("cl_X"),
first($"Array").as("Array")
)
.select(
$"cl_one".as("one"),
$"cl_two".as("two"),
$"cl_X".as("X"),
$"Array"
)
.show()
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
| [5]| [6]| [z]| [1]|
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
| [3, 3, 3]| [4, 4, 4]| [y, y, y]| [1, 2, 3]|
+------------+------------+------------+------------+
This solution has the advantage that it does not rely on constant array-sizes

spark convert spark-SQL to RDD API

Spark SQL is pretty clear to me. However, I am just getting started with spark's RDD API. As spark apply function to columns in parallel points out this should allow me to get rid of slow shuffles for
def handleBias(df: DataFrame, colName: String, target: String = this.target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
}
In pseudo code: df foreach column (handleBias(column)
So a minimal data frame is loaded up
val input = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
)
val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
but fails to map correctly
val rdd1_inputDf = inputDf.rdd.flatMap { x => {(0 until x.size).map(idx => (idx, x(idx)))}}
rdd1_inputDf.toDF.show
It fails with
java.lang.ClassNotFoundException: scala.Any
java.lang.ClassNotFoundException: scala.Any
An example can be found https://github.com/geoHeil/sparkContrastCoding respectively https://github.com/geoHeil/sparkContrastCoding/blob/master/src/main/scala/ColumnParallel.scala for the problem outlined in this question.
When you call .rdd on a DataFrame you get an RDD[Row] which is not strongly typed. If you want to be able to map over the elements you will need to pattern match over Row:
scala> val input = Seq(
| (0, "A", "B", "C", "D"),
| (1, "A", "B", "C", "D"),
| (0, "d", "a", "jkl", "d"),
| (0, "d", "g", "C", "D"),
| (1, "A", "d", "t", "k"),
| (1, "d", "c", "C", "D"),
| (1, "c", "B", "C", "D")
| )
input: Seq[(Int, String, String, String, String)] = List((0,A,B,C,D), (1,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D))
scala> val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
inputDf: org.apache.spark.sql.DataFrame = [TARGET: int, col1: string ... 3 more fields]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = inputDf.rdd
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at rdd at <console>:27
scala> val typedRDD = rowRDD.map{case Row(a: Int, b: String, c: String, d: String, e: String) => (a,b,c,d,e)}
typedRDD: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[20] at map at <console>:29
scala> typedRDD.keyBy(_._1).groupByKey.foreach{println}
[Stage 7:> (0 + 0) / 4]
(0,CompactBuffer((A,B,C,D), (d,a,jkl,d), (d,g,C,D)))
(1,CompactBuffer((A,B,C,D), (A,d,t,k), (d,c,C,D), (c,B,C,D)))
Otherwise you can use a typed Dataset:
scala> val ds = input.toDS
ds: org.apache.spark.sql.Dataset[(Int, String, String, String, String)] = [_1: int, _2: string ... 3 more fields]
scala> ds.rdd
res2: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[8] at rdd at <console>:30
scala> ds.rdd.keyBy(_._1).groupByKey.foreach{println}
[Stage 0:> (0 + 0) / 4]
(0,CompactBuffer((0,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D)))
(1,CompactBuffer((1,A,B,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D)))

Selection of Edges in GraphFrames

I am applying BFS using the Graph frames in Scala, How can I sum the edges weights of the selected shortest path.
I have Following Code:
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
OutPut of Above code is:
+------+-------+-----+-------+-----+-------+-----+
| from| e0| v1| e1| v2| e2| to|
+------+-------+-----+-------+-----+-------+-----+
|[1,Al]|[1,2,8]|[2,B]|[2,4,7]|[4,D]|[4,5,8]|[5,E]|
+------+-------+-----+-------+-----+-------+-----+
But I need Output Like this:
+----+-------+-----------+---------+
| |source |Destination| Distance|
+----+-------+-----------+---------+
| e0 | 1 | 2 | 8 |
+----+-------+-----------+---------+
| e1 | 2 | 4 | 7 |
+----+-------+-----------+---------+
| e2 | 4 | 5 | 8 |
+----+-------+-----------+---------+
Unlike the above example my graph is huge, it might actually return a large number of edges.