Selection of Edges in GraphFrames - scala

I am applying BFS using the Graph frames in Scala, How can I sum the edges weights of the selected shortest path.
I have Following Code:
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
OutPut of Above code is:
+------+-------+-----+-------+-----+-------+-----+
| from| e0| v1| e1| v2| e2| to|
+------+-------+-----+-------+-----+-------+-----+
|[1,Al]|[1,2,8]|[2,B]|[2,4,7]|[4,D]|[4,5,8]|[5,E]|
+------+-------+-----+-------+-----+-------+-----+
But I need Output Like this:
+----+-------+-----------+---------+
| |source |Destination| Distance|
+----+-------+-----------+---------+
| e0 | 1 | 2 | 8 |
+----+-------+-----------+---------+
| e1 | 2 | 4 | 7 |
+----+-------+-----------+---------+
| e2 | 4 | 5 | 8 |
+----+-------+-----------+---------+
Unlike the above example my graph is huge, it might actually return a large number of edges.

Related

How can I capitalize specific words in a spark column?

I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.
val wrds = ["usa","gb"]
val dF = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
I would like to have an output of
val dF = List(
(1, "z",3, "Bob lives in the USA"),
(4, "t", 2, "GB is where Beth lives")
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?
import org.apache.spark.sql.functions._
val words = Array("usa","gb")
val df = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
val replaced = words.foldLeft(df){
case (adf, word) =>
adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}
replaced.show
Output:
+---+----+-----+--------------------+
| id|name|thing| country|
+---+----+-----+--------------------+
| 1| z| 3|Bob lives in the USA|
| 4| t| 2|GB is where Beth ...|
| 5| t| 2| ogb|
+---+----+-----+--------------------+

Combine two datasets based on value

I have following two datasets:
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle"),
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
I need to join them with specific conditions:
a) There is always a row in dfB with the X flag and empty LineItem which should be the default Category for the Package from dfA
b) When there is a LineItem specified in dfB the default Category should be overwritten with the Category associated to this LineItem
Expected output:
+---------+----------+----------+----------+
| Package | LineItem | Animal | Category |
+---------+----------+----------+----------+
| 001 | 10 | Cat | A |
+---------+----------+----------+----------+
| 001 | 20 | Dog | A |
+---------+----------+----------+----------+
| 001 | 30 | Bear | A |
+---------+----------+----------+----------+
| 002 | 10 | Mouse | C |
+---------+----------+----------+----------+
| 002 | 20 | Squirrel | E |
+---------+----------+----------+----------+
| 002 | 30 | Turtle | C |
+---------+----------+----------+----------+
I spend some time on it today, but I don't have an idea how it could be accomplished. I appreciate your assistance.
Thanks!
You can use two join + when clause:
val dfC = dfA
.join(dfB, dfB.col("Flag") === "X" && dfA.col("LineItem") === dfB.col("LineItem") && dfA.col("Package") === dfB.col("Package"))
.select(dfA.col("Package").as("priorPackage"), dfA.col("LineItem").as("priorLineItem"), dfB.col("Category").as("priorCategory"))
.as("dfC")
val dfD = dfA
.join(dfB, dfB.col("LineItem") === "" && dfB.col("Flag") === "X" && dfA.col("Package") === dfB.col("Package"), "left_outer")
.join(dfC, dfA.col("LineItem") === dfC.col("priorLineItem") && dfA.col("Package") === dfC.col("priorPackage"), "left_outer")
.select(
dfA.col("package"),
dfA.col("LineItem"),
dfA.col("Animal"),
when(dfC.col("priorCategory").isNotNull, dfC.col("priorCategory")).otherwise(dfB.col("Category")).as("Category")
)
dfD.show()
This should work for you:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle")
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
val result = {
dfA.as("a")
.join(dfB.where('Flag === "X").as("b"), $"a.Package" === $"b.Package" and ($"a.LineItem" === $"b.LineItem" or $"b.LineItem" === ""), "left")
.withColumn("anyRowsInGroupWithBLineItemDefined", first(when($"b.LineItem" =!= "", lit(true)), ignoreNulls = true).over(Window.partitionBy($"a.Package", $"a.LineItem")).isNotNull)
.where(!$"anyRowsInGroupWithBLineItemDefined" or ($"anyRowsInGroupWithBLineItemDefined" and $"b.LineItem" =!= ""))
.select($"a.Package", $"a.LineItem", $"a.Animal", $"b.Category")
}
result.orderBy($"a.Package", $"a.LineItem").show(false)
// +-------+--------+--------+--------+
// |Package|LineItem|Animal |Category|
// +-------+--------+--------+--------+
// |001 |10 |Cat |A |
// |001 |20 |Dog |A |
// |001 |30 |Bear |A |
// |002 |10 |Mouse |C |
// |002 |20 |Squirrel|E |
// |002 |30 |Turtle |C |
// +-------+--------+--------+--------+
The "tricky" part is calculating whether or not there are any rows with LineItem defined in dfB for a given Package, LineItem in dfA. You can see how I perform this calculation in anyRowsInGroupWithBLineItemDefined which involves the use of a window function. Other than that, it's just a normal SQL programming exercise.
Also want to note that this code should be more efficient than the other solution as here we only shuffle the data twice (during join and during window function) and only read in each dataset once.

Scala - Filter data in DF for each ID based on the transactions in another DF

Problem Overview:
Dataset 1: Users will have multiple rows associated with some transaction ID
Dataset 2: Each user will have a row associated with each transaction IDs in the database
What I'd like to do is remove any transaction in Dataset 2 that a user has in Dataset 1.
Example:
Dataset 1:
id trans_id
1 a
1 b
1 c
2 c
2 d
2 e
2 f
Dataset 2:
id trans_id score
1 a 0.3
1 b 0.4
1 c 0.5
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
2 c 0.6
2 d 0.8
2 e 0.9
2 f 0.2
Final Dataset:
id trans_id score
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
I'm attempting to do this in scala (python is my language of choice) and I'm a little lost. If I was working with just one ID, I could use the isin function but I'm not sure how to do this for all of the IDs.
Any help would be much appreciated.
The simplest way might be to use a left_anti join:
val df1 = Seq(
(1, "a"), (1, "b"), (1, "c"),
(2, "c"), (2, "d"), (2, "e"), (2, "f")
).toDF("id", "trans_id")
val df2 = Seq(
(1, "a", 0.3), (1, "b", 0.4), (1, "c", 0.5), (1, "d", 0.1), (1, "e", 0.2), (1, "f", 0.5),
(2, "a", 0.1), (2, "b", 0.5), (2, "c", 0.6), (2, "d", 0.8), (2, "e", 0.9), (2, "f", 0.2)
).toDF("id", "trans_id", "score")
df2.join(df1, Seq("id", "trans_id"), "left_anti").show
// +---+--------+-----+
// | id|trans_id|score|
// +---+--------+-----+
// | 1| d| 0.1|
// | 1| e| 0.2|
// | 1| f| 0.5|
// | 2| a| 0.1|
// | 2| b| 0.5|
// +---+--------+-----+

Spark- GraphFrames How to use the component ID in connectedComponents

I'm trying to find all the connected components(in this example, 4 is connected to 100, 2 is connected to 200 etc.) I used val g2 = GraphFrame(v2, e2)
val result2 = g2.connectedComponents.run() and that returns nodes with a component ID. My problem is, how do I use this ID to see all the connected nodes? How to find out which node this id belongs to? Many thanks. I'm quite new to this.
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
val e2= sqlContext.createDataFrame(List(
(4,100, "friend"),
(2, 200, "follow"),
(3, 300, "follow"),
(4, 400, "follow"),
(1, 100, "follow"),
(1,400, "friend")
)).toDF("src", "dst", "relationship")
In this example I'm expected to see the connections below
----+----+
| 4| 400|
| 4| 100|
| 1| 400|
| 1| 100|
This is what the result shows now
(1,1),(2,2),(3,1),(4,1), (100,1) (200,2) (300,3)(400,1). How do I see all the connections?
You have declared "a", "b", "c"... to be your graph's node ids, but later used 1, 2, 3... as node ids to define edges.
You should change the node ids to the numbers: 1,2,3.. while creating the vertices dataframe, by naming that column as "id" :
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
That should give you the desired results.

Sum the Distance in Apache-Spark dataframes

The Following code gives a dataframe having three values in each column as shown below.
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below::
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows.
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?
It can be done like this:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)
I would make a collection of the columns to sum and then use a foldLeft on a UDF:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+