Spark- GraphFrames How to use the component ID in connectedComponents - scala

I'm trying to find all the connected components(in this example, 4 is connected to 100, 2 is connected to 200 etc.) I used val g2 = GraphFrame(v2, e2)
val result2 = g2.connectedComponents.run() and that returns nodes with a component ID. My problem is, how do I use this ID to see all the connected nodes? How to find out which node this id belongs to? Many thanks. I'm quite new to this.
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
val e2= sqlContext.createDataFrame(List(
(4,100, "friend"),
(2, 200, "follow"),
(3, 300, "follow"),
(4, 400, "follow"),
(1, 100, "follow"),
(1,400, "friend")
)).toDF("src", "dst", "relationship")
In this example I'm expected to see the connections below
----+----+
| 4| 400|
| 4| 100|
| 1| 400|
| 1| 100|
This is what the result shows now
(1,1),(2,2),(3,1),(4,1), (100,1) (200,2) (300,3)(400,1). How do I see all the connections?

You have declared "a", "b", "c"... to be your graph's node ids, but later used 1, 2, 3... as node ids to define edges.
You should change the node ids to the numbers: 1,2,3.. while creating the vertices dataframe, by naming that column as "id" :
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
That should give you the desired results.

Related

How can I capitalize specific words in a spark column?

I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.
val wrds = ["usa","gb"]
val dF = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
I would like to have an output of
val dF = List(
(1, "z",3, "Bob lives in the USA"),
(4, "t", 2, "GB is where Beth lives")
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?
import org.apache.spark.sql.functions._
val words = Array("usa","gb")
val df = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
val replaced = words.foldLeft(df){
case (adf, word) =>
adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}
replaced.show
Output:
+---+----+-----+--------------------+
| id|name|thing| country|
+---+----+-----+--------------------+
| 1| z| 3|Bob lives in the USA|
| 4| t| 2|GB is where Beth ...|
| 5| t| 2| ogb|
+---+----+-----+--------------------+

Scala - Filter data in DF for each ID based on the transactions in another DF

Problem Overview:
Dataset 1: Users will have multiple rows associated with some transaction ID
Dataset 2: Each user will have a row associated with each transaction IDs in the database
What I'd like to do is remove any transaction in Dataset 2 that a user has in Dataset 1.
Example:
Dataset 1:
id trans_id
1 a
1 b
1 c
2 c
2 d
2 e
2 f
Dataset 2:
id trans_id score
1 a 0.3
1 b 0.4
1 c 0.5
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
2 c 0.6
2 d 0.8
2 e 0.9
2 f 0.2
Final Dataset:
id trans_id score
1 d 0.1
1 e 0.2
1 f 0.5
2 a 0.1
2 b 0.5
I'm attempting to do this in scala (python is my language of choice) and I'm a little lost. If I was working with just one ID, I could use the isin function but I'm not sure how to do this for all of the IDs.
Any help would be much appreciated.
The simplest way might be to use a left_anti join:
val df1 = Seq(
(1, "a"), (1, "b"), (1, "c"),
(2, "c"), (2, "d"), (2, "e"), (2, "f")
).toDF("id", "trans_id")
val df2 = Seq(
(1, "a", 0.3), (1, "b", 0.4), (1, "c", 0.5), (1, "d", 0.1), (1, "e", 0.2), (1, "f", 0.5),
(2, "a", 0.1), (2, "b", 0.5), (2, "c", 0.6), (2, "d", 0.8), (2, "e", 0.9), (2, "f", 0.2)
).toDF("id", "trans_id", "score")
df2.join(df1, Seq("id", "trans_id"), "left_anti").show
// +---+--------+-----+
// | id|trans_id|score|
// +---+--------+-----+
// | 1| d| 0.1|
// | 1| e| 0.2|
// | 1| f| 0.5|
// | 2| a| 0.1|
// | 2| b| 0.5|
// +---+--------+-----+

group data in pyspark and get the topn data in each group

I have a data, may be simply shown as:
conf = SparkConf().setMaster("local[*]").setAppName("test")
sc = SparkContext(conf=conf).getOrCreate()
spark = SparkSession(sparkContext=sc).builder.getOrCreate()
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
data = spark.createDataFrame(rdd, ['x', 'y'])
data.show()
def f(x):
y = sorted(x, reverse=True)[:2]
return y
h_f = udf(f, IntegerType())
h_f = spark.udf.register("h_f", h_f)
data.groupBy('x').agg({"y": h_f}).show()
But it went wrong: AttributeError: 'function' object has no attribute '_get_object_id', how can I get the topn item in each group?
Considering you are looking for top n 'y' elements which belongs to the each group of 'x'.
from pyspark.sql import Window
from pyspark.sql import functions as F
import sys
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
df = spark.createDataFrame(rdd, ['x', 'y'])
df.show()
df_g = df.groupBy('x').agg(F.collect_list('y').alias('y'))
df_g = df_g.withColumn('y_sorted', F.sort_array('y', asc = False))
df_g.withColumn('y_slice', F.slice(df_g.y_sorted, 1, 2)).show()
Output
+---+-----------+-----------+--------+
| x| y| y_sorted| y_slice|
+---+-----------+-----------+--------+
| 1|[10, 8, 12]|[12, 10, 8]|[12, 10]|
| 3| [11, 7, 9]| [11, 9, 7]| [11, 9]|
+---+-----------+-----------+--------+

Spark [Scala]: Checking if all the Rows of a smaller DataFrame exists in the bigger DataFrame

I got two DataFrames, with the same schema (but +100 columns):
Small size: 1000 rows
Bigger size: 90000 rows
How to check every Row in 1 exists in 2? What is the "Spark way" of doing this? Should I use map and then deal with it at the Row level; or I use join and then use some sort of comparison with the small size DataFrame?
You can use except, which returns all rows of the first dataset that are not present in the second
smaller.except(bigger).isEmpty()
You can inner join the DF and count to check if ther eis a difference.
def isIncluded(smallDf: Dataframe, biggerDf: Dataframe): Boolean = {
val keys = smallDf.columns.toSeq
val joinedDf = smallDf.join(biggerDf, keys) // You might want to broadcast smallDf for performance issues
joinedDf.count == smallDf
}
However, I think the except method is clearer. Not sure about the performances (It might just be a join underneath)
I would do it with join, probably
This join will give you all rows that are in small data frame but are missing in large data frame. Then just check if it is zero size or no.
Code:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
df2.join(df1, df1("cA") === df2("cA"), "leftOuter").show
Output:
+---+---+---+---+---+---+---+---+---+---+
| cA| cB| cH| cI| cJ| cA| cB| cC| cD| cE|
+---+---+---+---+---+---+---+---+---+---+
| C|ghi| a| c| ?| C|ghi|0.2|0.2| 1|
| A|abc| a| b| ?| A|abc|0.1|0.0| 0|
+---+---+---+---+---+---+---+---+---+---+

Sum the Distance in Apache-Spark dataframes

The Following code gives a dataframe having three values in each column as shown below.
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below::
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows.
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?
It can be done like this:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)
I would make a collection of the columns to sum and then use a foldLeft on a UDF:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+