Combine two datasets based on value

Combine two datasets based on value - scala

I have following two datasets:
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle"),
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
I need to join them with specific conditions:
a) There is always a row in dfB with the X flag and empty LineItem which should be the default Category for the Package from dfA
b) When there is a LineItem specified in dfB the default Category should be overwritten with the Category associated to this LineItem
Expected output:
+---------+----------+----------+----------+
| Package | LineItem | Animal | Category |
+---------+----------+----------+----------+
| 001 | 10 | Cat | A |
+---------+----------+----------+----------+
| 001 | 20 | Dog | A |
+---------+----------+----------+----------+
| 001 | 30 | Bear | A |
+---------+----------+----------+----------+
| 002 | 10 | Mouse | C |
+---------+----------+----------+----------+
| 002 | 20 | Squirrel | E |
+---------+----------+----------+----------+
| 002 | 30 | Turtle | C |
+---------+----------+----------+----------+
I spend some time on it today, but I don't have an idea how it could be accomplished. I appreciate your assistance.
Thanks!

You can use two join + when clause:
val dfC = dfA
.join(dfB, dfB.col("Flag") === "X" && dfA.col("LineItem") === dfB.col("LineItem") && dfA.col("Package") === dfB.col("Package"))
.select(dfA.col("Package").as("priorPackage"), dfA.col("LineItem").as("priorLineItem"), dfB.col("Category").as("priorCategory"))
.as("dfC")
val dfD = dfA
.join(dfB, dfB.col("LineItem") === "" && dfB.col("Flag") === "X" && dfA.col("Package") === dfB.col("Package"), "left_outer")
.join(dfC, dfA.col("LineItem") === dfC.col("priorLineItem") && dfA.col("Package") === dfC.col("priorPackage"), "left_outer")
.select(
dfA.col("package"),
dfA.col("LineItem"),
dfA.col("Animal"),
when(dfC.col("priorCategory").isNotNull, dfC.col("priorCategory")).otherwise(dfB.col("Category")).as("Category")
)
dfD.show()

This should work for you:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle")
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
val result = {
dfA.as("a")
.join(dfB.where('Flag === "X").as("b"), $"a.Package" === $"b.Package" and ($"a.LineItem" === $"b.LineItem" or $"b.LineItem" === ""), "left")
.withColumn("anyRowsInGroupWithBLineItemDefined", first(when($"b.LineItem" =!= "", lit(true)), ignoreNulls = true).over(Window.partitionBy($"a.Package", $"a.LineItem")).isNotNull)
.where(!$"anyRowsInGroupWithBLineItemDefined" or ($"anyRowsInGroupWithBLineItemDefined" and $"b.LineItem" =!= ""))
.select($"a.Package", $"a.LineItem", $"a.Animal", $"b.Category")
}
result.orderBy($"a.Package", $"a.LineItem").show(false)
// +-------+--------+--------+--------+
// |Package|LineItem|Animal |Category|
// +-------+--------+--------+--------+
// |001 |10 |Cat |A |
// |001 |20 |Dog |A |
// |001 |30 |Bear |A |
// |002 |10 |Mouse |C |
// |002 |20 |Squirrel|E |
// |002 |30 |Turtle |C |
// +-------+--------+--------+--------+
The "tricky" part is calculating whether or not there are any rows with LineItem defined in dfB for a given Package, LineItem in dfA. You can see how I perform this calculation in anyRowsInGroupWithBLineItemDefined which involves the use of a window function. Other than that, it's just a normal SQL programming exercise.
Also want to note that this code should be more efficient than the other solution as here we only shuffle the data twice (during join and during window function) and only read in each dataset once.

Related

Pyspark group by collect list, to_json and pivot

Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big

There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+

I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+

Compare two data frame row by row in pyspark

I am trying to compare two data frame row by row such that if any mismatch found it prints in below formatted way.
Example:
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None),(None,None,None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
source_df.show()
+-------+------+------+
| name|gender|salary|
+-------+------+------+
| James| M| 60000|
|Michael| M| 70000|
| Robert| null|400000|
| Maria| F|500000|
| Jen| | null|
| null| null| null|
+-------+------+------+
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None),(None,None,None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
target_df.show()
| name_1|gender_1|salary_1|
+-------+--------+--------+
| Anurag| M| 70000|
|Michael| M| 70000|
| Sunil| null| 900000|
| Maria| F| 500000|
| Jen| | null|
| null| null| null|
+-------+--------+--------+
So, here I have to iterate over row of 1st dataframe James|M|60000 and compare with row of 2nd dataframe Anurag|M|70000 and so on.. and print the output in formatted way if any mismatch found
eg :
1st df: 'James|M|60000'
2nd df: 'Anurag|M|70000'
output: Mismatch: name-->James,name_1--> Anurag
salary-->60000, salary_1--> 70000..so on..
Please let me know if any other info required.
Very new to pyspark, so need your help. Thanks in Advance for your help.
The below code working fine for me.
from pyspark.sql.functions import concat, monotonically_increasing_id, udf, col
from pyspark.sql import SparkSession
from os import truncate
import findspark
findspark.init()
findspark.find()
findspark.find()
def print_mismatch(row):
output = ""
for i in range(len(source_cols)):
output += f"Row Index--> {row['id']}, "
if row[source_cols[i]] != row[target_cols[i]]:
output += f"{source_cols[i]}--> {row[source_cols[i]]}, {target_cols[i]}--> {row[target_cols[i]]} "
return output
spark = SparkSession \
.builder \
.appName("SparkExample") \
.getOrCreate()
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
rdd_df = source_df.rdd.zipWithIndex()
source_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
rdd_df = target_df.rdd.zipWithIndex()
target_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id1"))
final_df = source_df.join(target_df, source_df.id == target_df.id1)
source_cols = source_df.columns
target_cols = target_df.columns
final_coulmn = source_cols + target_cols
df = spark.createDataFrame([], schema=final_df.schema)
for i in range(len(source_cols)):
final = final_df.filter(
final_df[f'{source_cols[i]}'] != final_df[f'{target_cols[i]}'])
df = df.union(final)
# final.collect()
final_df.show()
#df.show()
df_rdd = df.rdd
df_rdd.map(print_mismatch).collect() ```

You can achieve the same as follows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, monotonically_increasing_id, udf, col
def print_mismatch(row):
output = ""
for i in range(len(source_cols)):
if row[source_cols[i]] != row[target_cols[i]]:
output += f"{source_cols[i]}--> {row[source_cols[i]]}, {target_cols[i]}--> {row[target_cols[i]]}"
print(output)
spark = SparkSession \
.builder \
.appName("SparkExample") \
.getOrCreate()
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
rdd_df = source_df.rdd.zipWithIndex()
source_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
# source_df = source_df.withColumn("id", monotonically_increasing_id())
source_df.show()
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
rdd_df = target_df.rdd.zipWithIndex()
target_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
# target_df = target_df.withColumn("id", monotonically_increasing_id())
target_df.show()
final_df = source_df.join(target_df, source_df.id == target_df.id)
final_df.show()
source_cols = source_df.columns
target_cols = target_df.columns
final_df.foreach(lambda row: print_mismatch(row))
name--> James, name_1--> Anurag salary--> 60000, salary_1--> 70000
name--> Robert, name_1--> Sunil salary--> 400000, salary_1--> 900000
spark-submit --master local spark_combining_df.py
enter image description here

Spark:join condition with Array (nullable ones)

I have 2 Dataframes and would like to join them and would like to filter the data, i want to filter the
data where OrgTypeToExclude is matching with respect to each transactionid.
in single word my transactionId is join contiions and OrgTypeToExclude is exclude condition,sharing a simple example here
import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")
df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()
--Expecting output
id Code Alp
4 "USAL" "C"
2 "USMD" "B"
3 "USGA" "C"
Thanks,
Manoj.

Transactions is an array type & you are accessing TransactionId & OrgTypeToExclude on that so you will be getting multiple arrays.
Instead of that You just explode root level Transactions array & extract the struct values that is OrgTypeToExclude & TransactionId next steps will be easy.
Please check below code.
scala> val jsonstr ="""{
|
| "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
| "Transactions": [
| {
| "TransactionId": "USAL",
| "OrgTypeToExclude": ["A","B"]
| },
| {
| "TransactionId": "USMD",
| "OrgTypeToExclude": ["E"]
| },
| {
| "TransactionId": "USGA",
| "OrgTypeToExclude": []
| }
| ]
| }"""
jsonstr: String =
{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}
scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]
scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]
scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1 |USAL|A |
|4 |USAL|C |
|2 |USMD|B |
|5 |USMD|E |
|3 |USGA|C |
+---+----+---+
scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B] |USAL |
|[E] |USMD |
|[] |USGA |
+----------------+-------------+
scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4 |USAL|C |
|2 |USMD|B |
|3 |USGA|C |
+---+----+---+
scala>

First, it looks like you overlooked the fact that Transactions is also an array, which we can use explode to deal with:
val json = spark.read.json(Seq(jsonstr).toDS)
.select(explode($"Transactions").as("t")) // deal with Transactions array first
.select($"t.TransactionId", $"t.OrgTypeToExclude")
Also, array_contains wants a value rather than a column as its second argument. I'm not aware of a version that supports referencing a column, so we'll make a udf:
val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
We can then modify the join condition like so:
df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
And the expected result:
scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
| 4|USAL| C| USAL| [A, B]|
| 2|USMD| B| USMD| [E]|
| 3|USGA| C| USGA| []|
+---+----+---+-------------+----------------+

Spark: create a sessionId based on timestamp

I like to do the following transformation. Given a data frame that records whether a user is logged. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT = 20.
A session period is defined as : [first record --> first record + Timeout]
For instance, the original DataFrame would look like the following:
scala> val df = sc.parallelize(List(
("user1",0),
("user1",3),
("user1",15),
("user1",22),
("user1",28),
("user1",41),
("user1",45),
("user1",85),
("user1",90)
)).toDF("user_id","timestamp")
df: org.apache.spark.sql.DataFrame = [user_id: string, timestamp: int]
+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1 |0 |
|user1 |3 |
|user1 |15 |
|user1 |22 |
|user1 |28 |
|user1 |41 |
|user1 |45 |
|user1 |85 |
|user1 |90 |
+-------+---------+
The goal is:
+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1 |0 | 0 |-> first record (session 0: period [0->20])
|user1 |3 | 0 |
|user1 |15 | 0 |
|user1 |22 | 1 |-> 22 not in [0->20]->new session(period 22->42)
|user1 |28 | 1 |
|user1 |41 | 1 |
|user1 |45 | 2 |-> 45 not in [22->42]->newsession(period 45->65)
|user1 |85 | 3 |
|user1 |90 | 3 |
+-------+---------+----------+
Are there any elegant solution to solve this problem, preferably in Scala.
Thanks in advance!

This may not be an elegant solution but this worked for given data format.
sc.parallelize(List(
("user1", 0),
("user1", 3),
("user1", 15),
("user1", 22),
("user1", 28),
("user1", 41),
("user1", 45),
("user1", 85),
("user1", 90))).toDF("user_id", "timestamp").map { x =>
val userId = x.getAs[String]("user_id")
val timestamp = x.getAs[Int]("timestamp")
val session = timestamp / 20
(userId, timestamp, session)
}.toDF("user_id", "timestamp", "session").show()
Result
You can change timestamp / 20 according to your need.

Please see my code.
Two issues here:
1,I think the performance is not well.
2,I use the "userid" to join, if this doesn't meet your requirement. You add a new column with a same value to timeSetFrame and newSessionSec.
var newSession = ss.sparkContext.parallelize(List(
("user1", 0), ("user1", 3), ("user1", 15), ("user1", 22),
("user1", 28), ("user1", 41), ("user1", 45), ("user1", 85),
("user1", 90))).zipWithIndex().toDF("tmp", "index")
val getUser_id = udf( ( s : Row) => {
s.getString(0)
})
val gettimestamp = udf( ( s : Row) => {
s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
.withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp") //.show()
val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer") // .show
val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
var result = 0
var begin = 0
val loop = new Breaks
loop.breakable {
for (time <- aa) {
if (time > (begin + 20)) {
begin = time
result += 1
}
if (time == ts) {
loop.break;
}
}
}
result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()

Selection of Edges in GraphFrames

I am applying BFS using the Graph frames in Scala, How can I sum the edges weights of the selected shortest path.
I have Following Code:
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
OutPut of Above code is:
+------+-------+-----+-------+-----+-------+-----+
| from| e0| v1| e1| v2| e2| to|
+------+-------+-----+-------+-----+-------+-----+
|[1,Al]|[1,2,8]|[2,B]|[2,4,7]|[4,D]|[4,5,8]|[5,E]|
+------+-------+-----+-------+-----+-------+-----+
But I need Output Like this:
+----+-------+-----------+---------+
| |source |Destination| Distance|
+----+-------+-----------+---------+
| e0 | 1 | 2 | 8 |
+----+-------+-----------+---------+
| e1 | 2 | 4 | 7 |
+----+-------+-----------+---------+
| e2 | 4 | 5 | 8 |
+----+-------+-----------+---------+
Unlike the above example my graph is huge, it might actually return a large number of edges.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Combine two datasets based on value - scala

Related

Pyspark group by collect list, to_json and pivot

Compare two data frame row by row in pyspark

Spark:join condition with Array (nullable ones)

Spark: create a sessionId based on timestamp

Selection of Edges in GraphFrames

Categories

Resources