how to Connect to NEO4J in Spark worker nodes? - scala

I need to get a small subgraph in a spark map function. I have tried to use AnormCypher and NEO4J-SPARK-CONNECTOR, but neither works. AnormCypher will lead to a java IOException Error (I build the connection in a mapPartition function, test at localhost server). And Neo4j-spark-connector will cause TASK NOT SERIALIZABLE EXCEPTION below.
Is there a good way to get a subgraph(or connect to graph data base like neo4j) in the Spark worker node?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
at ....
my code snippet using neo4j-spark-connector 2.0.0-m2:
val neo = Neo4j(sc) // this runs on the driver
// this runs by a map function
def someFunctionToBeMapped(p: List[Long]) = {
val metaGraph = neo.cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return a.id ,r.distance, b.id").loadRowRdd.map( row => ((row(0).asInstanceOf[Long],row(2).asInstanceOf[Long]), row(1).asInstanceOf[Double]) ).collect().toList
The AnromCypher code is :
def partitionMap(partition: Iterator[List[Long]]) = {
import org.anormcypher._
import play.api.libs.ws._
// Provide an instance of WSClient
val wsclient = ning.NingWSClient()
// Setup the Rest Client
// Need to add the Neo4jConnection type annotation so that the default
// Neo4jConnection -> Neo4jTransaction conversion is in the implicit scope
implicit val connection: Neo4jConnection = Neo4jREST("127.0.0.1", 7474, "neo4j", "000000")(wsclient)
//
// Provide an ExecutionContext
implicit val ec = scala.concurrent.ExecutionContext.global
val res = partition.filter( placeList => {
val startPlace = Cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return p")().flatMap( row => row.data )
})
wsclient.close()
res
}

I have used spark standalone mode and able to connect neo4j database
Version used :
spark 2.1.0
neo4j-spark-connector 2.1.0-m2
My code:-
val sparkConf = new SparkConf().setAppName("Neo$j").setMaster("local")
val sc = new SparkContext(sparkConf)
println("***Getting Started ****")
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n) RETURN id(n) as id").loadDataFrame
println(rdd.count)
Spark submit:-
spark-submit --class package.classname --jars pathofneo4jsparkconnectoryJAR --conf spark.neo4j.bolt.password=***** targetJarFile.jar

Related

Why does UDF throw NotSerializableException in streaming queries?

I use Spark 2.4.3 for one structured streaming application (readStream from Event Hub Azure / writeStream to CosmosDB). There are some transformation steps for the data and one step is to make a lookup into CosmosDB for some validation and adding some more fields.
//messagesF13 contains PersonHashCode,....
...
val messagesF14 = messagesF13.withColumn("LookupData", getHData($"PersonHashCode"))
//messagesF14.printSchema()
messagesF14.writeStream.outputMode("append").format("console").option("truncate", false).start().awaitTermination()
The code for getHData is copied below:
case class PersonHolder( id: String,
Person_uid: String,
Person_seq: Integer)
val getHData= udf ( (hash256: String) => {
val queryStmt = s""" SELECT *
FROM c
WHERE c.Person_uid ='$hash256'"""
val readConfig = Config(Map("Endpoint" -> "https://abc-cosmos.documents.azure.com:443/",
"Masterkey" -> "ABCABC==",
"Database" -> "person-data",
"preferredRegions" -> "East US;",
"Collection" -> "tmp-persons",
"query_custom" -> queryStmt,
"SamplingRatio" -> "1.0"))
val coll = spark.sqlContext.read.cosmosDB(readConfig)
coll.createOrReplaceTempView("c")
val q3 = queryStmt + " AND c.Person_seq = 0"
val df3 = spark.sql(q3)
if (df3.head(1).isEmpty){
null //None
}
else {
val y = df31.select($"id",$"Person_uid",$"Person_seq")
val y1 = y.as[PersonHolder].collectAsList
y1.get(0)
}
}
)
It does not work, the (well known) error is:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task not serializable: java.io.NotSerializableException: com.microsoft.azure.eventhubs.ConnectionStringBuilder
What are the possible workarounds/solutions for avoiding the error?
Thank you in advance for some links/GitHub code/docs!
It does not work
And it won't. Sorry.
User-defined functions (UDFs) run on executors where there is no spark.sqlContext. Both spark and sqlContext are uninitialized on executors.
one step is to make a lookup into CosmosDB for some validation and adding some more fields.
That's a classic join, esp. with this code in the getHData udf:
val coll = spark.sqlContext.read.cosmosDB(readConfig)
You should simply do the following:
val coll = spark.sqlContext.read.cosmosDB(readConfig)
val messagesF14 = messagesF13.join(coll).where(...)

Integration tests for a simple Spark application

I need to write some unit and integrations tests for a small research project. I am using a simple Spark application which read the data from a file and outputs the number of characters in a file. I am using ScalaTest for writing unit tests. But I could not come up with the integration tests for this project. According to the project flow I need to execute unit tests, package a jar file and then using this jar file execute integration tests. I have a file with data as a resource for the tests. So should I package this file with the source code or should I put it into a separate location? What kinds of integration tests can I write for this application?
Simple Spark application looks like this:
object SparkExample {
def readFile(sparkContext: SparkContext, fileName: String) = {
sparkContext.textFile(fileName)
}
def mapStringToLength(data: RDD[String]) = {
data.map(fileData => fileData.length)
}
def printIntFileData(data: RDD[Int]) = {
data.foreach(fileString =>
println(fileString.toString)
)
}
def printFileData(data: RDD[String]) = {
data.foreach(fileString =>
println(fileString)
)
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]")
.appName("TestApp")
.getOrCreate()
val dataFromFile = readFile(spark.sparkContext, args(0))
println("\nAll the data:")
val dataToInt = mapStringToLength(dataFromFile)
printFileData(dataFromFile)
printIntFileData(dataToInt)
spark.stop()
}
}
Unit tests I wrote:
class SparkExampleTest extends FunSuite with BeforeAndAfter with Matchers{
val master = "local"
val appName = "TestApp"
var sparkContext: SparkContext = _
val fileContent = "This is the text only for the test purposes. There is no sense in it completely. This is the test of the Spark Application"
val fileName = "src/test/resources/test_data.txt"
val noPathFileName = "test_data.txt"
val errorFileName = "test_data1.txt"
before {
val sparkSession = SparkSession
.builder
.master(master)
.appName(appName)
.getOrCreate()
sparkContext = sparkSession.sparkContext
}
test("SparkExample.readFile"){
assert(SparkExample.readFile(sparkContext, fileName).collect() sameElements Array(fileContent))
}
test("SparkExample.mapStringToLength"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent))
assert(SparkExample.mapStringToLength(rdd).collect() sameElements Array(stringLength))
}
test("SparkExample.mapStringToLength Negative"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent + " "))
assert(SparkExample.mapStringToLength(rdd).collect() != Array(stringLength))
}
test("SparkExample.readFile does not throw Exception"){
noException should be thrownBy SparkExample.readFile(sparkContext, fileName).collect()
}
test("SparkExample.readFile throws InvalidInputException without filePath"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, noPathFileName).collect()
}
test("SparkExample.readFile throws InvalidInputException with wrong filename"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, errorFileName).collect()
}
}
Spark Testing Base is the way to go, - it is basically a lightweight embedded spark for your tests. It would probably be more on the "integration tests" side of things than unit tests, but you can track code coverage etc. also, eg. with scoverage
https://github.com/holdenk/spark-testing-base

Spark Graphx : class not found error on EMR cluster

I am trying to process Hierarchical Data using Grapghx Pregel and the code I have works fine on my local.
But when I am running on my Amazon EMR cluster it is giving me an error:
java.lang.NoClassDefFoundError: Could not initialize class
What would be the reason of this happening? I know the class is there in the jar file as it run fine on my local as well there is no build error.
I have included GraphX dependency on pom file.
Here is a snippet of code where error is being thrown:
def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))] =
{
val verticesRDD = vertexDF.rdd
.map { x => (x.get(0), x.get(1), x.get(2)) }
.map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }
//create the edge RD top down relationship
val EdgesRDD = edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
.map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }
// create the edge RD top down relationship
val graph = Graph(verticesRDD, EdgesRDD).cache()
//val pathSeperator = """/"""
//initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
hrchyOutRDD
}
I was able to narrow down the line that is causing an error:
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
I had this exact same issue happening to me, where I was able to run it on spark-shell failing when executed from spark-submit. Here’s an example of the code I was trying to execute (looks like it's the same as yours)
The error that pointed me to the right solution was:
org.apache.spark.SparkException: A master URL must be set in your configuration
In my case, I was getting that error because I had defined the SparkContext outside the main function:
object Test {
val sc = SparkContext.getOrCreate
val sqlContext = new SQLContext(sc)
def main(args: Array[String]) {
...
}
}
I was able to solve it by moving SparkContext and sqlContext inside the main function as described in this other post

Spark Execution for twitter Streaming

Hi I'm new to spark and scala . I'm trying to stream some tweets through spark streaming with the following code:
object TwitterStreaming {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("WrongUsage: PropertiesFile, [<filters>]")
System.exit(-1)
}
StreamingExamples.setStreaningLogLevels()
val myConfigFile = args(0)
val batchInterval_s = 1
val fileConfig = ConfigFactory.parseFile(new File(myConfigFile))
val appConf = ConfigFactory.load(fileConfig)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generate OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", appConf.getString("consumerKey"))
System.setProperty("twitter4j.oauth.consumerSecret", appConf.getString("consumerSecret"))
System.setProperty("twitter4j.oauth.accessToken", appConf.getString("accessToken"))
System.setProperty("twitter4j.oauth.accessTokenSecret", appConf.getString("accessTokenSecret"))
val sparkConf = new SparkConf().setAppName("TwitterStreaming").setMaster(appConf.getString("SPARK_MASTER"))//local[2]
val ssc = new StreamingContext(sparkConf, Seconds(batchInterval_s)) // creating spark streaming context
val stream = TwitterUtils.createStream(ssc, None)
val tweet_data = stream.map(status => TweetData(status.getId, "#" + status.getUser.getScreenName, status.getText.trim()))
tweet_data.foreachRDD(rdd => {
println(s"A sample of tweets I gathered over ${batchInterval_s}s: ${rdd.take(10).mkString(" ")} (total tweets fetched: ${rdd.count()})")
})
}
}
case class TweetData(id: BigInt, author: String, tweetText: String)
My Error:
Exception in thread "main" com.typesafe.config.ConfigException$WrongType:/WorkSpace/InputFiles/application.conf: 5: Cannot concatenate object or list with a non-object-or-list, ConfigString("local") and SimpleConfigList([2]) are not compatible
at com.typesafe.config.impl.ConfigConcatenation.join(ConfigConcatenation.java:116)
can any one check the the code and tell me where I'm doing wrong?
If your config file contains:
SPARK_MASTER=local[2]
Change it to:
SPARK_MASTER="local[2]"

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.