how to map an DataFrame to an EdgeRDD - scala

I have a DataFrame like:
val data = sc.parallelize(Array((1,10,10,7,7),(2,7,7,7,8),(3, 5,5,6,8))).toDF("id","col1","col2","col3","col4")
What I want to do is to create an EdgeRDD where two ids share a link if they share the same value in at least one of the columns
id col1 col2 col3 col4
1 10 10 7 7
2 7 7 7 8
3 5 5 6 8
then node 1 and 2 have an undirected link 1--2, because they share a common value in col3.
For the same reason, node 2 and 3 share an undirected link because they share a common value in col4
I know how to resolve this in a ugly way (but I have way too many columns to adopt this strategy in my real case)
val data2 = data.withColumnRenamed("id", "idd").withColumnRenamed("col1", "col1d").withColumnRenamed("col2", "col2d").withColumnRenamed("col3", "col3d").withColumnRenamed("col4", "col4d")
val res = data.join(data2, data("id") < data2("idd")
&& (data("col1") === data2("col1d")
|| data("col2") === data2("col2d")
|| data("col3") === data2("col3d")
|| data("col4") === data2("col4d")))
//> res : org.apache.spark.sql.DataFrame = [id: int, col1: int, col2: int, col
//| 3: int, col4: int, idd: int, col1d: int, col2d: int, col3d: int, col4d: int
//| ]
res.show //> +---+----+----+----+----+---+-----+-----+-----+-----+
//| | id|col1|col2|col3|col4|idd|col1d|col2d|col3d|col4d|
//| +---+----+----+----+----+---+-----+-----+-----+-----+
//| | 1| 10| 10| 7| 7| 2| 7| 7| 7| 8|
//| | 2| 7| 7| 7| 8| 3| 5| 5| 6| 8|
//| +---+----+----+----+----+---+-----+-----+-----+-----+
//|
val links = EdgeRDD.fromEdges(res.map(row => Edge(row.getAs[Int]("id").toLong, row.getAs[Int]("idd").toLong, "indirect")))
//> links : org.apache.spark.graphx.impl.EdgeRDDImpl[String,Nothing] = EdgeRDD
//| Impl[27] at RDD at EdgeRDD.scala:42
links.foreach(println) //> Edge(1,2,indirect)
//| Edge(2,3,indirect)
how to resolve this for much more columns?

Do you mean something like this?
val expr = data.columns.diff(Seq("id"))
.map(c => data(c) === data2(s"${c}d"))
.reduce(_ || _)
data.join(data2, data("id") < data2("idd") && expr)
You can also use aliases
import org.apache.spark.sql.functions.col
val expr = data.columns.diff(Seq("id"))
.map(c => col(s"d1.$c") === col(s"d2.$c"))
.reduce(_ || _)
data.alias("d1").join(data.alias("d2"), col("d1.id") < col("d2.id") && expr)
You can easily follow each of this by as simple select ($ is equivalent to col but requires an import of sqlContext.implicits.StringToColumn)
.select($"id".cast("long"), $"idd".cast("long"))
or
.select($"d1.id".cast("long"), $"d2.id".cast("long"))
and a pattern matching:
.rdd.map { case Row(src: Long, dst: Long) => Edge(src, dst, "indirect") }
Just note that logical disjunctions like this one cannot be optimized and are expanded to a Cartesian product followed by a filter. If you want to avoid you can try to approach this problem in different ways.
Lets start with reshaping data from wide to long:
val expr = explode(array(data.columns.tail.map(
c => struct(lit(c).alias("column"), col(c).alias("value"))
): _*))
val long = data.withColumn("tmp", expr)
.select($"id", $"tmp.column", $"tmp.value")
This will give us a DataFrame with a following schema:
long.printSchema
// root
// |-- id: integer (nullable = false)
// |-- column: string (nullable = false)
// |-- value: integer (nullable = false)
With data like this you have multiple choices including optimized join:
val pairs = long.as("long1")
.join(long.as("long2"),
$"long1.column" === $"long2.column" && // Optimized
$"long1.value" === $"long2.value" && // Optimized
$"long1.id" < $"long2.id" // Not optimized - filtered after sort-merge join
)
// Select only ids
.select($"long1.id".alias("src"), $"long2.id".alias("dst"))
// And keep distict
.distinct
pairs.show
// +---+---+
// |src|dst|
// +---+---+
// | 1| 2|
// | 2| 3|
// +---+---+
This can further improved by using different hashing techniques to avoid a large number of record generated by explode.
You can also think about this problem as bipartite graph where observations belong to on category of nodes and property-value pairs to another.
sealed trait CustomNode
case class Record(id: Long) extends CustomNode
case class Property(name: String, value: Int) extends CustomNode
With this as a starting point you can use long to generate edges of the following type:
Record -> Property
and solve this problem using GraphX directly by searching for paths like
Record -> Property <- Record
Hint: Collect neighbors for each property and propagate back.
Same as before you should consider using hashing or buckets to reduce limit a number of the generated Property nodes.

Related

Dataframe replace each row null values with unique epoch time

I have 3 rows in dataframes and in 2 rows, the column id has got null values. I need to loop through the each row on that specific column id and replace with epoch time which should be unique and should happen in dataframe itself. How can it be done?
For eg:
id | name
1 a
null b
null c
I wanted this dataframe which converts null to epoch time.
id | name
1 a
1435232 b
1542344 c
df
.select(
when($"id").isNull, /*epoch time*/).otherwise($"id").alias("id"),
$"name"
)
EDIT
You need to make sure the UDF precise enough - if it is only has millisecond resolution you will see duplicate values. See my example below that clearly illustrates my approach works:
scala> def rand(s: String): Double = Math.random
rand: (s: String)Double
scala> val udfF = udf(rand(_: String))
udfF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(StringType)))
scala> res11.select(when($"id".isNull, udfF($"id")).otherwise($"id").alias("id"), $"name").collect
res21: Array[org.apache.spark.sql.Row] = Array([0.6668195187088702,a], [0.920625293516218,b])
Check this out
scala> val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c"))
s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c))
scala> val df = s1.toDF("id","name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> val epoch = java.time.Instant.now.getEpochSecond
epoch: Long = 1539084285
scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show
+----------+----+
| id|name|
+----------+----+
| 1| a|
|1539084285| b|
|1539084285| c|
+----------+----+
scala>
EDIT1:
I used milliseconds, then also I get same values. Spark doesn't capture nano seconds in time portion. It is possible that many rows could get the same milliseconds. So your assumption of getting unique values based on epoch would not work.
scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli
getEpoch: (x: String)Long
scala> val myudfepoch = udf( getEpoch(_:String):Long )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539087300957| b|
|1539087300957| c|
+-------------+----+
scala>
The only possibility is to use the monotonicallyIncreasingId, but that values may not be of same length all the time.
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539090186541| b|
|1539090186543| c|
+-------------+----+
scala>
EDIT2:
I'm able to trick the System.nanoTime and get the increasing ids, but they will not be sequential, but the length can be maintained. See below
scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12)
getEpoch: (x: String)String
scala> val myudfepoch = udf( getEpoch(_:String):String )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+------------+----+
| id|name|
+------------+----+
| 1| a|
|186127230392| b|
|186127230399| c|
+------------+----+
scala>
Try this out when running in clusters and adjust the take(12), if you get duplicate values.

How to add a column collection based on the maximum and minimum values in a dataframe

I've got this DataFrame
val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")
I want to convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8.
Original DataFrame:
Desired DataFrame
a.select("min","max","salary")
.as[(Integer,Integer,String)]
.map{
case(min,max,salary) =>
(min,max,salary.split("-").flatMap(x => {
for(i <- 0 to x.length-1) yield (i)
}))
}.toDF("1","2","3").show()
you need to create a UDF to expand the limits. The following UDF will convert convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8 and so on
import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
val extendUDF = udf((str: String) => {
val nums = str.replace("k","").split("-").map(_.toInt)
(nums(0) to nums(1)).toList.mkString(",")
})
val output = inputDF.withColumn("salary_level", extendUDF($"salary"))
Output:
scala> output.show
+---+---+------+----------------+
|min|max|salary| salary_level|
+---+---+------+----------------+
| 5| 7| 5k-7k| 5,6,7|
| 4| 8| 4k-8k| 4,5,6,7,8|
| 6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+
You can easily do this with a udf.
// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )
val input = sc.parallelize(List((5,7,"5k-7k"),
(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show
Here one quick option with an UDF
import org.apache.spark.sql.functions
val toSalary = functions.udf((value: String) => {
val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
val (startSalary, endSalary) = (array.headOption, array.tail.headOption)
(startSalary, endSalary) match {
case (Some(s), Some(e)) => (s to e).toList.mkString(",")
case _ => ""
}
})
for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")
Input
+---+---+------+
|min|max|salary|
+---+---+------+
| 5| 7| 5k-7k|
| 4| 8| 4k-8k|
| 6| 12| 6k-2k|
+---+---+------+
Result
+---+---+------------+
|min|max|salary_level|
+---+---+------------+
| 5| 7| 5,6,7|
| 4| 8| 4,5,6,7,8|
| 6| 12| 2,3,4,5,6|
+---+---+------------+
First you remove the k and split your string by the dash. Then you get the start and endSalary and perform a range beetwen them.

In Spark SQL, how do you register and use a generic UDF?

In my Project, I want to achieve ADD(+) function, but my parameter maybe LongType, DoubleType, IntType. I use sqlContext.udf.register("add",XXX), but I don't know how to write XXX, which is to make generic functions.
You can create a generic UDF by creating a StructType with struct($"col1", $"col2") that holds your values and have your UDF work off of this. It gets passed into your UDF as a Row object, so you can do something like this:
val multiAdd = udf[Double,Row](r => {
var n = 0.0
r.toSeq.foreach(n1 => n = n + (n1 match {
case l: Long => l.toDouble
case i: Int => i.toDouble
case d: Double => d
case f: Float => f.toDouble
}))
n
})
val df = Seq((1.0,2),(3.0,4)).toDF("c1","c2")
df.withColumn("add", multiAdd(struct($"c1", $"c2"))).show
+---+---+---+
| c1| c2|add|
+---+---+---+
|1.0| 2|3.0|
|3.0| 4|7.0|
+---+---+---+
You can even do interesting things like take a variable number of columns as input. In fact, our UDF defined above already does that:
val df = Seq((1, 2L, 3.0f,4.0),(5, 6L, 7.0f,8.0)).toDF("int","long","float","double")
df.printSchema
root
|-- int: integer (nullable = false)
|-- long: long (nullable = false)
|-- float: float (nullable = false)
|-- double: double (nullable = false)
df.withColumn("add", multiAdd(struct($"int", $"long", $"float", $"double"))).show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0|10.0|
| 5| 6| 7.0| 8.0|26.0|
+---+----+-----+------+----+
You can even add a hard-coded number into the mix:
df.withColumn("add", multiAdd(struct(lit(100), $"int", $"long"))).show
+---+----+-----+------+-----+
|int|long|float|double| add|
+---+----+-----+------+-----+
| 1| 2| 3.0| 4.0|103.0|
| 5| 6| 7.0| 8.0|111.0|
+---+----+-----+------+-----+
If you want to use the UDF in SQL syntax, you can do:
sqlContext.udf.register("multiAdd", (r: Row) => {
var n = 0.0
r.toSeq.foreach(n1 => n = n + (n1 match {
case l: Long => l.toDouble
case i: Int => i.toDouble
case d: Double => d
case f: Float => f.toDouble
}))
n
})
df.registerTempTable("df")
// Note that 'int' and 'long' are column names
sqlContext.sql("SELECT *, multiAdd(struct(int, long)) as add from df").show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0| 3.0|
| 5| 6| 7.0| 8.0|11.0|
+---+----+-----+------+----+
This works too:
sqlContext.sql("SELECT *, multiAdd(struct(*)) as add from df").show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0|10.0|
| 5| 6| 7.0| 8.0|26.0|
+---+----+-----+------+----+
I don't think you can register a generic UDF.
If we take a look at the signature of the register method
(actually, it's just one of the 22 register overloads, used for UDFs with one argument, the others are equivalent):
def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction
We can see that it's parameterized with a A1: TypeTag type - the TypeTag means that at the time of registration, we must have evidence of the actual type of the UDF's argument. So - passing a generic function func without typing it explicitly can't compile.
For your case, you might be able to take advantage of Spark's ability to cast numeric types automatically - write a UDF for Doubles only, and you can also apply it to Ints (the output would be Double, though):
sqlContext.udf.register("add", (i: Double) => i + 1)
// creating a table with Double and Int types:
sqlContext.createDataFrame(Seq((1.5, 4), (2.2, 5))).registerTempTable("table1")
// applying UDF to both types:
sqlContext.sql("SELECT add(_1), add(_2) FROM table1").show()
// output:
// +---+---+
// |_c0|_c1|
// +---+---+
// |2.5|5.0|
// |3.2|6.0|
// +---+---+

What is going wrong with `unionAll` of Spark `DataFrame`?

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:
object Entities {
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
val as = Seq(
A(1,3),
A(2,4)
)
val bs = Seq(
B(5,3),
B(6,4)
)
}
class UnsortedTestSuite extends SparkFunSuite {
configuredUnitTest("The truth test.") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val aDF = sc.parallelize(Entities.as, 4).toDF
val bDF = sc.parallelize(Entities.bs, 4).toDF
aDF.show()
bDF.show()
aDF.unionAll(bDF).show
}
}
Output:
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
+---+---+
| b| a|
+---+---+
| 5| 3|
| 6| 4|
+---+---+
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
| 5| 3|
| 6| 4|
+---+---+
Why does the result contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.
To quote PostgreSQL manual:
In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types
Column names, excluding the first table in the set operation, are simply ignored.
This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.
If you want to match using names you can do something like this
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
To check both names and types it is should be enough to replace columns with:
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.
https://issues.apache.org/jira/browse/SPARK-21043
no issues/bugs - if you observe your case class B very closely then you will be clear.
Case Class A --> you have mentioned the order (a,b), and
Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
thanks,
Subbu
Use unionByName:
Excerpt from the documentation:
def unionByName(other: Dataset[T]): Dataset[T]
The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.

Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.
I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.
How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.
Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.
Kindly please advise with a working or pseudo code.
Thanks
Sanjay
Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
df1.show
// +---+----+---+------------+
// | x| y| z| foobar|
// +---+----+---+------------+
// | 1| 3.0| a| [3.0,291.0]|
// | 2|-1.0| b|[-2.0,-98.0]|
// | 3| 0.0| c| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- x: long (nullable = false)
// |-- y: double (nullable = false)
// |-- z: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)
This can be easily flattened later but usually there is no need for that.
Switch to RDD, reshape and rebuild DF:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
Seq(x * y, z.head.toInt * y)
val schema = StructType(df.schema.fields ++
Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
r.toSeq ++
foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | x| y| z| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| a| 3.0|291.0|
// | 2|-1.0| b|-2.0|-98.0|
// | 3| 0.0| c| 0.0| 0.0|
// +---+----+---+----+-----+
Assume that after your function there will be a sequence of elements, giving an example as below:
val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
df.show
+------------------+---+
| infoComb|age|
+------------------+---+
|Mike,1986,Toronto| 30|
| Andre,1980,Ottawa| 36|
| jill,1989,London| 27|
+------------------+---+
now what you can do with this infoComb is that you can start split the string and get more columns with:
df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
+-----+----------+-------+---+
| name|yearOfBorn| city|age|
+-----+----------+-------+---+
|Mike| 1986|Toronto| 30|
|Andre| 1980| Ottawa| 36|
| jill| 1989| London| 27|
+-----+----------+-------+---+
Hope this helps.
If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:
val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
.withColumn("newCol2", myFun2(myDf("originalColumn"))
.drop(myDf("originalColumn"))
where myFun is an udf defined like this:
def myFun= udf(
(originalColumnContent : String) => {
// do something with your original column content and return a new one
}
)
I opted to create a function to flatten one column and then just call it simultaneously with the udf.
First define this:
implicit class DfOperations(df: DataFrame) {
def flattenColumn(col: String) = {
def addColumns(df: DataFrame, cols: Array[String]): DataFrame = {
if (cols.isEmpty) df
else addColumns(
df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
cols.tail
)
}
val field = df.select(col).schema.fields(0)
val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)
addColumns(df, newCols).drop(col)
}
def withColumnMany(colName: String, col: Column) = {
df.withColumn(colName, col).flattenColumn(colName)
}
}
Then usage is very simple:
case class MyClass(a: Int, b: Int)
val df = sc.parallelize(Seq(
(0),
(1)
)).toDF("x")
val f = udf((x: Int) => MyClass(x*2,x*3))
df.withColumnMany("test", f($"x")).show()
// +---+------+------+
// | x|test_a|test_b|
// +---+------+------+
// | 0| 0| 0|
// | 1| 2| 3|
// +---+------+------+
This can be easily achieved by using pivot function
df4.groupBy("year").pivot("course").sum("earnings").collect()