Scala - Spark: build a Graph (graphX) from vertices and edges dataframe - scala

I have two dataframe with this schema:
edges
|-- src: string (nullable = true)
|-- dst: string (nullable = true)
|-- relationship: struct (nullable = false)
| |-- business_id: string (nullable = true)
| |-- normalized_influence: double (nullable = true) root
vertices
|-- id: string (nullable = true)
|-- state: boolean (nullable = true)
To have a graph I converted these dataframe in this way:
import org.apache.spark.graphx._
import scala.util.hashing.MurmurHash3
case class Relationship(business_id: String, normalized_influence: Double)
case class MyEdge(src: String, dst: String, relationship: Relationship)
val edgesRDD: RDD[Edge[Relationship]] = communityEdgeDF.as[MyEdge].rdd.map { edge =>
Edge(
MurmurHash3.stringHash(edge.src).toLong,
MurmurHash3.stringHash(edge.dst).toLong,
edge.relationship
)
}
case class MyVertex(id: String, state: Boolean)
val verticesRDD : RDD[(VertexId, (String, Boolean))] = communityVertexDF.as[MyVertex].rdd.map { vertex =>
(
MurmurHash3.stringHash(vertex.id).toLong,
(vertex.id, vertex.state)
)
}
val graphX = Graph(verticesRDD, edgesRDD)
This is a part of the output of the vertices
res6: Array[(org.apache.spark.graphx.VertexId, (String, Boolean))] = Array((1874415454,(KRZALzi0ZgrGYyjZNg72_g,false)), (1216259959,(JiFBQ_-vWgJtRZEEruSStg,false)), (-763896211,(LZge-YpVL0ukJVD2nw5sag,false)), (-2032982683,(BHP3LVkTOfh3w4UIhgqItg,false)), (844547135,(JRC3La2fiNkK0VU7qZ9vyQ,false))
and this the edges:
res3: Array[org.apache.spark.graphx.Edge[Relationship]] = Array(Edge(-268040669,1495494297,Relationship(cJWbbvGmyhFiBpG_5hf5LA,0.0017532149785518423)), Edge(-268040669,-125364603,Relationship(cJWbbvGmyhFiBpG_5hf5LA,0.0017532149785518423))
But doing this:
graphX.vertices.collect
I have this wrong output:
Array((1981723824,null), (-333497649,null), (-597749329,null), (451246392,null), (-1287295481,null), (1013727024,null), (-194805089,null), (1621180464,null), (1874415454,(KRZALzi0ZgrGYyjZNg72_g,false)), (1539311488,null)
What is the problem? Was I wrong to build Graph?

Related

Create a new column in a dataset using a case class structure

Assuming the following schema for a table - Places:
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
val places is of type Dataset[Row]
and I have the following case class:
case class csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
How would I go about altering or creating a new data set that has the following schema
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- subpremise: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
|-- csm: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
I've been looking into withColumn methods and they seem to require UDFs, the challenge here being that I have to manually specify the columns which will be easy for this use case, but as my problem scales it will be difficult to manually maintain them
Used this as a reference: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe
In your case class declaration you have stateProvince parameter, but in your dataframe there's state_province column instead.
I'm not sure if it's not a typo, so first - some quick-n-dirty not-thoroughly-tested camelCase to snake_case converter just in case:
def normalize(x: String): String =
"([a-z])([A-Z])".r replaceAllIn(x, m => s"${m.group(1)}_${m.group(2).toLowerCase}")
Next, let's get the parameters of a case class:
val case_class_params = Seq[csm]().toDF.columns
And with this, we can now get columns for our case class struct:
val csm_cols = case_class_params.map(x => col(normalize(x)))
val df2 = df.withColumn("csm", struct(csm_cols:_*))
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|place_id|street_address|city |state_province|postal_code|country |neghborhood|csm |
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|123 |str_addr |some_city|some_province |some_zip |some_country|NA |{some_city, some_province, some_country}|
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neghborhood: string (nullable = true)
|-- csm: struct (nullable = false)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
Nested case classes
Seq[csm]().toDF.columns won't give you nested columns. For that some basic schema traversal is required. E.g., one way to do it, adapted from here:
def flatten(schema: StructType): Seq[String] =
schema.fields.flatMap {
field =>
field.dataType match {
case structType: StructType =>
flatten(structType)
case _ =>
field.name :: Nil
}
}
case class StateProvince(
stateProvince: Option[String] = None,
country: Option[String] = None)
case class csm(
city: Option[String] = None,
state: StateProvince
)
val case_class_params = flatten(Seq[csm]().toDF.schema)
// case_class_params: Seq[String] = ArraySeq(city, stateProvince, country)
case class Source(
place_id: Option[String],
street_address: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String]
)
case class Csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
case class Result(
place_id: Option[String],
street_address: Option[String],
subpremise: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String],
csm: Csm
)
import spark.implicits._
val sourceDF = Seq(
Source(
Some("s-1-1"),
Some("s-1-2"),
Some("s-1-3"),
Some("s-1-4"),
Some("s-1-5"),
Some("s-1-6"),
Some("s-1-7")
),
Source(
Some("s-2-1"),
Some("s-2-2"),
Some("s-2-3"),
Some("s-2-4"),
Some("s-2-5"),
Some("s-2-6"),
Some("s-2-7")
)
).toDF()
val resultDF = sourceDF
.map(r => {
Result(
Some(r.getAs[String]("place_id")),
Some(r.getAs[String]("street_address")),
Some("set your value"),
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("postal_code")),
Some(r.getAs[String]("country")),
Some(r.getAs[String]("neighborhood")),
Csm(
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("country"))
)
)
})
.toDF()
resultDF.printSchema()
// root
// |-- place_id: string (nullable = true)
// |-- street_address: string (nullable = true)
// |-- subpremise: string (nullable = true)
// |-- city: string (nullable = true)
// |-- state_province: string (nullable = true)
// |-- postal_code: string (nullable = true)
// |-- country: string (nullable = true)
// |-- neighborhood: string (nullable = true)
// |-- csm: struct (nullable = true)
// | |-- city: string (nullable = true)
// | |-- stateProvince: string (nullable = true)
// | |-- country: string (nullable = true)
resultDF.show(false)
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |place_id|street_address|subpremise |city |state_province|postal_code|country|neighborhood|csm |
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |s-1-1 |s-1-2 |set your value|s-1-3|s-1-4 |s-1-5 |s-1-6 |s-1-7 |[s-1-3, s-1-4, s-1-6]|
// |s-2-1 |s-2-2 |set your value|s-2-3|s-2-4 |s-2-5 |s-2-6 |s-2-7 |[s-2-3, s-2-4, s-2-6]|
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+

search in schema of a dataframe using pyspark

I have a set of dataframes, dfs, with different schema, for example:
root
|-- A_id: string (nullable = true)
|-- b_cd: string (nullable = true)
|-- c_id: integer (nullable = true)
|-- d_info: struct (nullable = true)
| |-- eid: string (nullable = true)
| |-- oid: string (nullable = true)
|-- l: array (nullable = true)
| |-- m: struct (containsNull = true)
| | |-- n: string (nullable = true)
| | |-- o: string (nullable = true)
..........
I want to check if, for example, "oid" is given in one of the column (here under d_info column). How can I search inside a schema for a set of dataframes and distinguish them. Pyspark or Scala suggestion are both helpful. Thank you
A dictionary/map of [node , root to node path] could be created for DataFame StructType (including nested StructType) using a recursive function.
val df = spark.read.json("nested_data.json")
val path = searchSchema(df.schema, "n", "root")
def searchSchema(schema: StructType, key: String, path: String): String = {
val paths = scala.collection.mutable.Map[String, String]()
addPaths(schema, path, paths)
paths(key)
}
def addPaths(schema: StructType, path: String, paths: scala.collection.mutable.Map[String, String]): Unit = {
for (field <- schema.fields) {
val _path = s"$path.${field.name}"
paths += (field.name -> _path)
field.dataType match {
case structType: StructType => addPaths(structType, _path, paths)
case arrayType: ArrayType => addPaths(arrayType.elementType.asInstanceOf[StructType], _path, paths)
case _ => //donothing
}
}
}
Input and output
Input = {"A_id":"A_id","b_cd":"b_cd","c_id":1,"d_info":{"eid":"eid","oid":"oid"},"l":[{"m":{"n":"n1","o":"01"}},{"m":{"n":"n2","o":"02"}}]}
Output = Map(n -> root.l.m.n, b_cd -> root.b_cd, d_info -> root.d_info, m -> root.l.m, oid -> root.d_info.oid, c_id -> root.c_id, l -> root.l, o -> root.l.m.o, eid -> root.d_info.eid, A_id -> root.A_id)

UnFlatten Dataframe to a specific structure

I have a flat dataframe (df) with the structure as below:
root
|-- first_name: string (nullable = true)
|-- middle_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- title: string (nullable = true)
|-- start_date: string (nullable = true)
|-- end_Date: string (nullable = true)
|-- city: string (nullable = true)
|-- zip_code: string (nullable = true)
|-- state: string (nullable = true)
|-- country: string (nullable = true)
|-- email_name: string (nullable = true)
|-- company: struct (nullable = true)
|-- org_name: string (nullable = true)
|-- company_phone: string (nullable = true)
|-- partition_column: string (nullable = true)
And I need to convert this dataframe into a structure like (as my next data will be in this format):
root
|-- firstName: string (nullable = true)
|-- middleName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- currentPosition: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- title: string (nullable = true)
| | |-- startDate: string (nullable = true)
| | |-- endDate: string (nullable = true)
| | |-- address: struct (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- zipCode: string (nullable = true)
| | | |-- state: string (nullable = true)
| | | |-- country: string (nullable = true)
| | |-- emailName: string (nullable = true)
| | |-- company: struct (nullable = true)
| | | |-- orgName: string (nullable = true)
| | | |-- companyPhone: string (nullable = true)
|-- partitionColumn: string (nullable = true)
So far I have implemented this:
case class IndividualCompany(orgName: String,
companyPhone: String)
case class IndividualAddress(city: String,
zipCode: String,
state: String,
country: String)
case class IndividualPosition(title: String,
startDate: String,
endDate: String,
address: IndividualAddress,
emailName: String,
company: IndividualCompany)
case class Individual(firstName: String,
middleName: String,
lastName: String,
currentPosition: Seq[IndividualPosition],
partitionColumn: String)
val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))
val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany)
=> List(IndividualPosition(title, startDate, endDate, address, emailName, company)))
val selectData = df.select(
col("first_name").as("firstName"),
col("middle_name).as("middleName"),
col("last_name").as("lastName"),
makePosition(col("job_title"),
col("start_date"),
col("end_Date"),
makeAddress(col("city"),
col("zip_code"),
col("state"),
col("country")),
col("email_name"),
makeCompany(col("org_name"),
col("company_phone"))).as("currentPosition"),
col("partition_column").as("partitionColumn")
).as[Individual]
select_data.printSchema()
select_data.show(10)
I can see a proper schema generated for select_data, but it gives an error on the last line where I am trying to get some actual data. I am getting an error saying failed to execute user defined function.
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)
Is there any better way to achieve this?
The problem here is that an udf can't take IndividualAddress and IndividualCompany directly as input. These are represented as structs in Spark and to use them in an udf the correct input type is Row. That means you need to change the declaration of makePosition to:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
address: Row,
emailName: String,
company: Row)
Inside the udf you now need to use e.g. address.getAs[String]("city") to access the case class elements, and to use the class as a whole you need to create it again.
The easier and better alternative would be to do everything in a single udf as follows:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
city: String,
zipCode: String,
state: String,
country: String,
emailName: String,
orgName: String,
companyPhone: String) =>
Seq(
IndividualPosition(
title,
startDate,
endDate,
IndividualAddress(city, zipCode, state, country),
emailName,
IndividualCompany(orgName, companyPhone)
)
)
)
I had a similar requirement.
What I did was create a typed user defined aggregation which will produce a List of elements.
import org.apache.spark.sql.{Encoder, TypedColumn}
import org.apache.spark.sql.expressions.Aggregator
import scala.collection.mutable
object ListAggregator {
private type Buffer[T] = mutable.ListBuffer[T]
/** Returns a column that aggregates all elements of type T in a List. */
def create[T](columnName: String)
(implicit listEncoder: Encoder[List[T]], listBufferEncoder: Encoder[Buffer[T]]): TypedColumn[T, List[T]] =
new Aggregator[T, Buffer[T], List[T]] {
override def zero: Buffer[T] =
mutable.ListBuffer.empty[T]
override def reduce(buffer: Buffer[T], elem: T): Buffer[T] =
buffer += elem
override def merge(b1: Buffer[T], b2: Buffer[T]): Buffer[T] =
if (b1.length >= b2.length) b1 ++= b2 else b2 ++= b1
override def finish(reduction: Buffer[T]): List[T] =
reduction.toList
override def bufferEncoder: Encoder[Buffer[T]] =
listBufferEncoder
override def outputEncoder: Encoder[List[T]] =
listEncoder
}.toColumn.name(columnName)
}
Now you can use it like this.
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
final case class Flat(id: Int, name: String, age: Int)
final case class Grouped(age: Int, users: List[(Int, String)])
val data =
List(
(1, "Luis", 21),
(2, "Miguel", 21),
(3, "Sebastian", 16)
).toDF("id", "name", "age").as[Flat]
val grouped =
data
.groupByKey(flat => flat.age)
.mapValues(flat => (flat.id, flat.name))
.agg(ListAggregator.create(columnName = "users"))
.map(tuple => Grouped(age = tuple._1, users = tuple._2))
// grouped: org.apache.spark.sql.Dataset[Grouped] = [age: int, users: array<struct<_1:int,_2:string>>]
grouped.show(truncate = false)
// +---+------------------------+
// |age|users |
// +---+------------------------+
// |16 |[[3, Sebastian]] |
// |21 |[[1, Luis], [2, Miguel]]|
// +---+------------------------+

Array of String to Array of Struct in Scala + Spark

I am currently using Spark and Scala 2.11.8
I have the following schema:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- descriptions: array (nullable = true)
|-- |-- element: string (containsNull = true)
I am trying to use UDF to convert it to the following:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- description: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
| | |-- code: string (nullable = true)
| | |-- cost: int(nullable = true)
So source data looks like this:
[WrappedArray(a abc 100,b abc 300)]
[WrappedArray(c abc 400)]
I need to use " " (space) as a delimiter, but don't know how to do this in scala.
def convert(product: Seq[String]): Seq[Row] = {
??/
}
I am fairly new in Scala, so can someone guide me how to construct this type of function?
Thanks.
I do not know if I understand your problem right, but map could be your friend.
case class Row(a: String, b: String, c: Int)
val value = List(List("a", "abc", 123), List("b", "bcd", 321))
value map {
case List(a: String, b: String, c: Int) => Row(a,b,c);
}
if you have to parse it first:
val value2 = List("a b 123", "c d 345")
value2 map {
case s => {
val split = s.toString.split(" ")
Row(split(0), split(1), split(2).toInt)
}
}

How can I create a Spark DataFrame from a nested array of struct element?

I have read a JSON file into Spark. This file has the following structure:
scala> tweetBlob.printSchema
root
|-- related: struct (nullable = true)
| |-- next: struct (nullable = true)
| | |-- href: string (nullable = true)
|-- search: struct (nullable = true)
| |-- current: long (nullable = true)
| |-- results: long (nullable = true)
|-- tweets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cde: struct (nullable = true)
...
...
| | |-- cdeInternal: struct (nullable = true)
...
...
| | |-- message: struct (nullable = true)
...
...
What I would ideally want is a DataFrame with columns "cde", "cdeInternal", "message"... as shown below
root
|-- cde: struct (nullable = true)
...
...
|-- cdeInternal: struct (nullable = true)
...
...
|-- message: struct (nullable = true)
...
...
I have managed to use "explode" to extract elements from the "tweets" array into a column called "tweets"
scala> val tweets = tweetBlob.select(explode($"tweets").as("tweets"))
tweets: org.apache.spark.sql.DataFrame = [tweets: struct<cde:struct<author:struct<gender:string,location:struct<city:string,country:string,state:string>,maritalStatus:struct<evidence:string,isMarried:string>,parenthood:struct<evidence:string,isParent:string>>,content:struct<sentiment:struct<evidence:array<struct<polarity:string,sentimentTerm:string>>,polarity:string>>>,cdeInternal:struct<compliance:struct<isActive:boolean,userProtected:boolean>,tracks:array<struct<id:string>>>,message:struct<actor:struct<displayName:string,favoritesCount:bigint,followersCount:bigint,friendsCount:bigint,id:string,image:string,languages:array<string>,link:string,links:array<struct<href:string,rel:string>>,listedCount:bigint,location:struct<displayName:string,objectType:string>,objectType:string,postedTime...
scala> tweets.printSchema
root
|-- tweets: struct (nullable = true)
| |-- cde: struct (nullable = true)
...
...
| |-- cdeInternal: struct (nullable = true)
...
...
| |-- message: struct (nullable = true)
...
...
How can I select all columns inside the struct and create a DataFrame out of it? Explode does not work on a struct if my understanding is correct.
Any help is appreciated.
One possible way to handle this is to extract required information from the schema. Lets start with some dummy data:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
case class Bar(x: Int, y: String)
case class Foo(bar: Bar)
val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df.printSchema
// root
// |-- bar: struct (nullable = true)
// | |-- x: integer (nullable = false)
// | |-- y: string (nullable = true)
and a helper function:
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}"))
}
Finally the results:
df.select(children("bar", df): _*).printSchema
// root
// |-- x: integer (nullable = true)
// |-- y: string (nullable = true)
You can use
df
.select(explode(col("path_to_collection")).as("collection"))
.select(col("collection.*"))`:
Example:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> val inline = sqlContext.read.json(sc.parallelize(json :: Nil)).select(explode(col("schools")).as("collection")).select(col("collection.*"))
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
Or, you can also use SQL function inline:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> sqlContext.read.json(sc.parallelize(json :: Nil)).registerTempTable("tmp")
scala> val inline = sqlContext.sql("SELECT inline(schools) FROM tmp")
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> case class Bar(x: Int, y: String)
defined class Bar
scala> case class Foo(bar: Bar)
defined class Foo
scala> val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df: org.apache.spark.sql.DataFrame = [bar: struct<x: int, y: string>]
scala> df.printSchema
root
|-- bar: struct (nullable = true)
| |-- x: integer (nullable = false)
| |-- y: string (nullable = true)
scala> df.select("bar.*").printSchema
root
|-- x: integer (nullable = true)
|-- y: string (nullable = true)
scala>