PySpark exception with GraphFrames - pyspark

I am building a simple Network Graph with PySpark and GraphFrames (running on Google Dataproc)
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)],
["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Then, I try to run `label progation'
result = g.labelPropagation(maxIter=5)
But I get the following error:
Py4JJavaError: An error occurred while calling o164.run.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 829, cluster-network-graph-w-12.c.myproject-bi.internal, executor 2): java.lang.ClassNotFoundException: org.graphframes.GraphFrame$$anonfun$5
It looks like the package 'GraphFrame' isn't available - but only if I run label propagation. How can I fix it?

I have solved using the following parameters
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11')])
spark = SparkSession.builder \
.appName('testing bq')\
.config(conf=conf) \
.getOrCreate()

Seems like it is a known issue of graphframes in google Dataproc.
Create a python file and add the following lines and then run it:
from setuptools import setup
setup(name='graphframes',
version='0.5.10',
packages=['graphframes', 'graphframes.lib']
)
You can visit this for details:
https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172

Related

INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER

I am very new to pyspark and getting below error, even if drop all date related columns or selecting only one column. Date format stored in my data frame like "". Can anyone please suggest changes I could made in dataframe to resolve this/date formats supported by new parser.
It's working if I set "spark.sql.legacy.timeParserPolicy" to "LEGACY"
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Caused by: DateTimeParseException: Text '1/1/2023 3:57:22 AM' could not be parsed at index 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 15.0 failed 4 times, most recent failure: Lost task 4.3 in stage 15.0 (TID 355) (10.139.64.5 executor 0): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '1/1/2023 3:57:22 AM' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Example:
#spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.functions import *
from pyspark.sql import functions as F
emp = [(1, "AAA", "dept1", 1000, "12/22/2022 3:11:44 AM"),
(2, "BBB", "dept1", 1100, "12/22/2022 3:11:44 AM"),
(3, "CCC", "dept1", 3000, "12/22/2022 3:11:44 AM"),
(4, "DDD", "dept1", 1500, "12/22/2022 3:11:44 AM"),
(5, "EEE", "dept2", 8000, "12/22/2022 3:11:44 AM"),
(6, "FFF", "dept2", 7200, "12/22/2022 3:11:44 AM"),
(7, "GGG", "dept3", 7100, "12/22/2022 3:11:44 AM"),
(8, "HHH", "dept3", 3700, "12/22/2022 3:11:44 PM"),
(9, "III", "dept3", 4500, "12/22/2022 3:11:44 PM"),
(10, "JJJ", "dept5", 3400,"12/22/2022 3:11:44 PM")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary",
"date"])
#empdf.printSchema()
df = empdf.withColumn("date", F.to_timestamp(col("date"),
"MM/dd/yyyy hh:mm:ss a"))
df.show(12,False)
Thanks a lot, in Advance
Just wanted to share update on this, After updating df = empdf.withColumn("date", F.to_timestamp(col("date"), "MM/dd/yyyy hh:mm:ss a")) to df = empdf.withColumn("date", F.to_timestamp(col("date"), "M/d/yyyy h:m:s a")) it's working for me.

Grouping by an alternating sequence in Spark

I have a set of data where I can identify an alternating sequence. However, I want to group this data into one chunk leaving all the other data unchanged. That is, where ever flickering is occurring in id, I want to overwrite that group of id with the single id that makes since to the order. As a small example, consider
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4), // flickering; id=0
("a", "silom", 1, 5), // flickering; id=0
("a", "silom", 0, 6), // flickering; id=0
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11), // flickering and so on
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val resultDataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 15), // grouped by flickering summing on time_sec
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 60),
("a", "silom", 5, 15). // grouped by flickering summing on time_sec
).toDF("user", "cat", "id", "time_sec")
Now a more realistic MWE. In this case, we can have multiple users and cat; unfortunately, this approach doesnt use the dataframe API and needs to collect data to the driver. This isn't scalable and needs to recursively call getGrps by dropping the length of the returned array indices.
How can I implement this using the dataframe API so as not to need collect the data to the driver which would be impossible due to size? Also, if there is a better way to do this, what would that be?
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.DoubleType
import scala.collection.mutable.WrappedArray
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4),
("a", "silom", 1, 5),
("a", "silom", 0, 6),
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11),
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15),
("a", "suk", 18, 1),
("a", "suk", 19, 2),
("a", "suk", 20, 3),
("a", "suk", 21, 4),
("a", "suk", 20, 5),
("a", "suk", 21, 6),
("a", "suk", 0, 7),
("a", "suk", 1, 8),
("a", "suk", 2, 9),
("a", "suk", 3, 10),
("a", "suk", 4, 11),
("a", "suk", 3, 12),
("a", "suk", 4, 13),
("a", "suk", 3, 14),
("a", "suk", 5, 15),
("b", "silom", 4, 1),
("b", "silom", 3, 2),
("b", "silom", 2, 3),
("b", "silom", 1, 4),
("b", "silom", 0, 5),
("b", "silom", 1, 6),
("b", "silom", 2, 7),
("b", "silom", 3, 8),
("b", "silom", 4, 9),
("b", "silom", 5, 10),
("b", "silom", 6, 11),
("b", "silom", 7, 12),
("b", "silom", 8, 13),
("b", "silom", 9, 14),
("b", "silom", 10, 15),
("b", "suk", 11, 1),
("b", "suk", 12, 2),
("b", "suk", 13, 3),
("b", "suk", 14, 4),
("b", "suk", 13, 5),
("b", "suk", 14, 6),
("b", "suk", 13, 7),
("b", "suk", 12, 8),
("b", "suk", 11, 9),
("b", "suk", 10, 10),
("b", "suk", 9, 11),
("b", "suk", 8, 12),
("b", "suk", 7, 13),
("b", "suk", 6, 14),
("b", "suk", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val recastDataDF = dataDF.withColumn("id", $"id".cast(DoubleType))
val category = recastDataDF.select("cat").distinct.collect.map(x => x(0).toString)
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("zipped", array("user", "cat", "sequencing_diff", "rn", "id"))
// non dataframe API approach (not scalable)
// needs to collect data to driver to process
val iterTuples = data.select("zipped").collect.map(x => x(0).asInstanceOf[WrappedArray[Any]]).map(x => x.toArray)
val shifted: Array[Array[Any]] = iterTuples.drop(1)
val combined = iterTuples
.zipAll(shifted, Array("", "", Double.NaN, Double.NaN, Double.NaN), Array("", "", Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data1(3).toString.toDouble > 2 && data0(3).toString.toDouble > 2 && data1(0) == data0(0) && data1(1) == data0(1)) {
if(data0(2) != data1(2) && data0(2).toString.toDouble + data1(2).toString.toDouble == 0) {
(data1(0), data1(1), data1(3), data0(4))
}
else ("", "", Double.NaN, Double.NaN)
}
else ("", "", Double.NaN, Double.NaN)
}
.filter(t => t._1 != "" && t._2 != "" && t._3 == t._3 && t._4 == t._4) // fast NaN removal
val typeMappedArray = testArr.map(x => (x._1.toString, x._2.toString, x._3.toString.toDouble, x._4.toString.toDouble))
def getGrps(arr: Array[(String, String, Double, Double)]): (Array[Double], Double, String, String) = {
if(arr.nonEmpty) {
val user = arr.take(1)(0)._1
val cat = arr.take(1)(0)._2
val rowNum = arr.take(1)(0)._3
val keepID = arr.take(1)(0)._4
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._3) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID, user, cat)
}
else (Array(Double.NaN), Double.NaN, "", "")
}
// after overwriting, this would allow me to group by user, cat, id to sum the time
getGrps(typeMappedArray) // returns rows number to overwrite, value to overwrite id with, user, cat
res0: (Array(5.0, 6.0, 7.0),0.0,a,silom)
getGrps(typeMappedArray.drop(3))
res1: (Array(11.0, 12.0, 13.0, 14.0),4.0,a,silom)
A second approach using collect_list but this relies on getGrps working recursively which I cannot get working properly. Here is the code I have so far with a modified getGrps for the the collect_list minus recursive.
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("id_rn", array($"id", $"rn", $"sequencing_diff"))
.groupBy($"user", $"cat").agg(collect_list($"id_rn").alias("array_data"))
// collect one row to develop how the UDF would work
val testList = data.where($"user" === "a" && $"cat" === "silom").select("array_data").collect
.map(x => x(0).asInstanceOf[WrappedArray[WrappedArray[Any]]])
.map(x => x.toArray)
.head
.map(x => (x(0).toString.toDouble, x(1).toString.toDouble, x(2).asInstanceOf[Double]))
// this code would be in the UDF; that is, we would pass array_data to the UDF
scala.util.Sorting.stableSort(testList, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = testList.drop(1)
val combined = testList
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
// called inside the UDF
def getGrps2(arr: Array[(Double, Double)]): (Array[Double], Double) = {
// no need for user or cat
if(arr.nonEmpty) {
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID)
}
else (Array(Double.NaN), Double.NaN)
}
We would .withColumn("data_to_update", udf) and the data_to_update column would be a WrappedArray[Tuple2[Array[Double], Double]] with row_numbers to id to overwrite. The result for user a, cat silom would be
WrappedArray((Array(4.0, 5.0, 6.0),0.0), (Array(10.0, 11.0, 12.0, 13.0),4.0))
The array pieces are row numbers and the Double is id to update those rows with
The following recursive method applied in a UDF operating on the array_data column will create the desired results
def getGrps(arr: Array[(Double, Double)]): Array[(Array[Double], Double)] = {
def returnAlternatingIDs(arr: Array[(Double, Double)],
altIDs: Array[(Array[Double], Double)]): Array[(Array[Double], Double)] = arr match {
case arr if arr.nonEmpty =>
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else {
Double.NaN
}
}
.filter(v => v == v)
val updateArray = altIDs ++ Array((rowNums, keepID))
returnAlternatingIDs(arr.drop(rowNums.length), updateArray)
case _ => altIDs
}
returnAlternatingIDs(arr, Array((Array(Double.NaN), Double.NaN))).drop(1)
}
The return value for the first collect_list is Array((Array(5.0, 6.0, 7.0),0.0), (Array(11.0, 12.0, 13.0, 14.0),4.0)) as desired.
Full UDF
val identifyFlickeringIDs: UserDefinedFunction = udf {
(colArrayData: WrappedArray[WrappedArray[Double]]) =>
val newArray: Array[(Double, Double, Double)] = colArrayData.toArray
.map(x => (x(0).toDouble, x(1).toDouble, x(2).toDouble))
// sort array by rn via less than relation
stableSort(newArray, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = newArray.toArray.drop(1)
val combined = newArray
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val parsedArray = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1 && data0._3 + data1._3 == 0) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
getGrps(parsedArray).filter(data => data._1.length > 1)
}

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

Spark- GraphFrames How to use the component ID in connectedComponents

I'm trying to find all the connected components(in this example, 4 is connected to 100, 2 is connected to 200 etc.) I used val g2 = GraphFrame(v2, e2)
val result2 = g2.connectedComponents.run() and that returns nodes with a component ID. My problem is, how do I use this ID to see all the connected nodes? How to find out which node this id belongs to? Many thanks. I'm quite new to this.
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
val e2= sqlContext.createDataFrame(List(
(4,100, "friend"),
(2, 200, "follow"),
(3, 300, "follow"),
(4, 400, "follow"),
(1, 100, "follow"),
(1,400, "friend")
)).toDF("src", "dst", "relationship")
In this example I'm expected to see the connections below
----+----+
| 4| 400|
| 4| 100|
| 1| 400|
| 1| 100|
This is what the result shows now
(1,1),(2,2),(3,1),(4,1), (100,1) (200,2) (300,3)(400,1). How do I see all the connections?
You have declared "a", "b", "c"... to be your graph's node ids, but later used 1, 2, 3... as node ids to define edges.
You should change the node ids to the numbers: 1,2,3.. while creating the vertices dataframe, by naming that column as "id" :
val v2 = sqlContext.createDataFrame(List(
("a",1),
("b", 2),
("c", 3),
("d", 4),
("e", 100),
("f", 200),
("g", 300),
("h", 400)
)).toDF("nodes", "id")
That should give you the desired results.

spark1.6.2 with scala2.10.6 No TypeTag available

Im trying to run the KMeans case from here.
This is my code:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[10]")//.set("spark.sql.warehouse.dir", "file:///")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Crates a DataFrame
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)).toDF("id", "features")
// Trains a k-means model
val kmeans = new KMeans()
.setK(2)
.setFeaturesCol("features")
.setPredictionCol("prediction")
val model = kmeans.fit(dataset)
// Shows the result
println("Final Centers: ")
model.clusterCenters.foreach(println)}
The Error follow:
Information:2016/9/19 0019 下午 3:36 - Compilation completed with 1 error and 0 warnings in 2s 454ms
D:\IdeaProjects\de\src\main\scala\com.te\KMeansExample.scala
Error:Error:line (18)No TypeTag available for (Int, org.apache.spark.mllib.linalg.Vector)
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
some details:
1. When I run this with spark1.6.2 and scala 2.10.6.it is compile fail and show the Error above.But when change the scala version to 2.11.0 .it's run OK.
2. I run this code in Hue which submit this job to my Cluster by Livy, and my Cluster build with Spark1.6.2 and scala2.10.6
Can anybody help me ? Thanks
I am not very sure about the cause of this problem but I think that it is because scala reflection in older versions of scala was not able to work out the TypeTag of yet not inferred function parameters.
In this case,
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)).toDF("id", "features")
The parameter Seq((1, Vectors.dense(0.0, 0.0, 0.0)),.....) is being seen by Scala the first time and hence its type is still not inferred by the system. And hence scala reflection can not work out the associated TypeTag.
So... my guess is that if you just move that out.. allow scala to infer the type... it will work.
val vectorSeq = Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)
val dataset: DataFrame = sqlContext.createDataFrame(vectorSeq).toDF("id", "features")