AuthenticationException Neo4j Spark streaming - scala

I am using Kafka Producer and Spark Consumer. I want to pass some data in the topic as an array to the consumer and execute a Neo4j query with this data as parameters. For now, I want to test this query with a set of data.
The problem is that when I try to run my consumer I get an exception:
org.neo4j.driver.v1.exceptions.AuthenticationException: Unsupported authentication token, scheme 'none' is only allowed when auth is disabled.
Here is my main method with Spark and Neo4j configs:
def main(args: Array[String]) {
val sparkSession = SparkSession
.builder()
.appName("KafkaSparkStreaming")
.master("local[*]")
.getOrCreate()
val sparkConf = sparkSession.conf
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(3))
streamingContext.sparkContext.setLogLevel("ERROR")
val neo4jLocalConfig = ConfigFactory.parseFile(new File("configs/local_neo4j.conf"))
sparkConf.set("spark.neo4j.bolt.url", neo4jLocalConfig.getString("neo4j.url"))
sparkConf.set("spark.neo4j.bolt.user", neo4jLocalConfig.getString("neo4j.user"))
sparkConf.set("spark.neo4j.bolt.password", neo4jLocalConfig.getString("neo4j.password"))
val arr = Array("18731", "41.84000015258789", "-87.62999725341797")
execNeo4jSearchQuery(arr, sparkSession.sparkContext)
streamingContext.start()
streamingContext.awaitTermination()
}
And this is the method in which I run my query:
def execNeo4jSearchQuery(data: Array[String], sc: SparkContext) = {
println("Id: " + data(0) + ", Lat: " + data(1) + ", Lon: " + data(2))
val neo = Neo4j(sc)
val sqlContext = new SQLContext(sc)
val query = "MATCH (m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic), (t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->(e:Event)-[a_r:AT]->(v:Venue) WHERE mt.topic_id = gt.topic_id AND distance(point({ longitude: {lon}, latitude: {lat}}),point({ longitude: v.lon, latitude: v.lat })) < 4000 AND mt.member_id = {id} RETURN g.group_name, e.event_name, v.venue_name"
val df = neo.cypher(query).params(Map("lat" -> data(1).toDouble, "lon" -> data(2).toDouble, "id" -> data(0).toInt))
.partitions(4).batch(25)
.loadDataFrame
}
I checked the query it works fine in Neo4j. So what may cause this exception?

I have researched and tried various options that helped me find an answer. AS I understood this exception happens because I do not set SparkConfig parameters for Neo4j properly. And the solutions would be to provide the SparkConfig as one of the SparkSession attributes. SparkConfig should already have all the Neo4j attributes set up

Related

Output is not showing, spark scala

Output is showing the schema, but output of sql query is not visible. I dont understand where I am doing wrong.
object ex_1 {
def parseLine(line:String): (String, String, Int, Int) = {
val fields = line.split(" ")
val project_code = fields(0)
val project_title = fields(1)
val page_hits = fields(2).toInt
val page_size = fields(3).toInt
(project_code, project_title, page_hits, page_size)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "Weblogs")
val lines = sc.textFile("F:/Downloads_F/pagecounts.out")
val parsedLines = lines.map(parseLine)
println("hello")
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val RDD1 = parsedLines.toDF("project","page","pagehits","pagesize")
RDD1.printSchema()
RDD1.createOrReplaceTempView("logs")
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
val results = min1.collect()
results.foreach(println)
println("bye")
spark.stop()
}
}
As confirmed in the comments, using the show method displays the result of spark.sql(..).
Since spark.sql returns a DataFrame, calling show is the ideal way to display the data. Where you where calling collect, previously, is not advised:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
..
..
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
// where `false` prevents the output from being truncated.
min1.show(false)
println("bye")
spark.stop()
Even if your DataFrame is empty you will still see a table output including the column names (i.e: the schema); whereas .collect() and println would print nothing in this scenario.

How to perform Unit testing on Spark Structured Streaming?

I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}

How do I stream data to Neo4j using Spark

I am trying to write streaming data to Neo4j using Spark and am having some problems (I am very new to Spark).
I have tried setting up a stream of word counts and can write this to Postgres using a custom ForeachWriter as in the example here. So I think that I understand the basic flow.
I have then tried to replicate this and send the data to Neo4j instead using the neo4j-spark-connector. I am able to send data to Neo4j using the example in the Zeppelin notebook here. So I've tried to transfer this code across to the ForeachWriter but I've got a problem - the sparkContext is not available in the ForeachWriter and from what I have read it shouldn't be passed in because it runs on the driver while the foreach code runs on the executors. Can anyone help with what I should do in this situation?
Sink.scala:
val spark = SparkSession
.builder()
.appName("Neo4jSparkConnector")
.config("spark.neo4j.bolt.url", "bolt://hdp1:7687")
.config("spark.neo4j.bolt.password", "pw")
.getOrCreate()
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
wordCounts.printSchema()
val writer = new Neo4jSink()
import org.apache.spark.sql.streaming.ProcessingTime
val query = wordCounts
.writeStream
.foreach(writer)
.outputMode("append")
.trigger(ProcessingTime("25 seconds"))
.start()
query.awaitTermination()
Neo4jSink.scala:
class Neo4jSink() extends ForeachWriter[Row]{
def open(partitionId: Long, version: Long):Boolean = {
true
}
def process(value: Row): Unit = {
val word = ("Word", Seq("value"))
val word_count = ("WORD_COUNT", Seq.empty)
val count = ("Count", Seq("count"))
Neo4jDataFrame.mergeEdgeList(sparkContext, value, word, word_count, count)
}
def close(errorOrNull:Throwable):Unit = {
}
}

Spark job hangingup in EC2 m4.10x machine

I have the below code that launches a spark job, when I am working on files less than 40 (maximum cores in my machine) the parallelize works fine, however when I work on files more than that its creating trouble. Any advise please.
`
object Cleanup extends Processor {
def main(args: Array[String]): Unit = {
val fileSeeker = new TelemetryFileSeeker("Config")
val files = fileSeeker.searchFiles(bucketName, urlPrefix, "2018-01-01T00:00:00.000Z", "2018-04-30T00:00:00.000Z").filter(_.endsWith(".gz"))
.map(each => (each, each.slice(0, each.lastIndexOf("/")))).slice(0,100)
if (files.nonEmpty) {
println("Number of Files" + files.length)
sc.parallelize(files).map(each => changeFormat(each)).collect()
}
}
def changeFormat(file: (String, String)): Unit = {
val fileProcessor = new Processor("Config", sparksession)
val uuid = java.util.UUID.randomUUID.toString
val tempInput = "inputfolder" + uuid
val tempOutput = "outputfolder" + uuid
val inpaths = Paths.get(tempInput)
val outpaths = Paths.get(tempOutput)
if (Files.notExists(inpaths)) Files.createDirectory(inpaths)
if (Files.notExists(outpaths)) Files.createDirectory(outpaths)
val downloadedFiles = fileProcessor.downloadAndUnzip(bucketName, List(file._1), tempInput)
val parsedFiles = fileProcessor.parseCSV(downloadedFiles)
parsedFiles.select(
"pa1",
"pa2",
"pa3"
).withColumn("pa4", lit(0.0)).write.mode(SaveMode.Overwrite).format(CSV_FORMAT)
.option("codec", "org.apache.hadoop.io.compress.GzipCodec").save(tempOutput)
val processedFiles = new File(tempOutput).listFiles.filter(_.getName.endsWith(".gz"))
val filesNames = processedFiles.map(_.getName).toList
val filesPaths = processedFiles.map(_.getPath).toList
fileProcessor.cleanUpRemote(bucketName, "new/" + file._2, filesNames)
fileProcessor.uploadFiles(bucketName, "new/" + file._2, filesPaths)
fileProcessor.cleanUpLocal(tempInput, tempOutput)
val remoteFiles = fileProcessor.checkRemote(bucketName, "new/" + file._2, filesNames)
logger.info("completed " + file._1)
}
}
spark config below
lazy val spark = SparkSession
.builder()
.appName("Project")
.config("spark.master", "local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.executor.memory", "5g")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.enableHiveSupport()
.getOrCreate()
FYI: each function parsecsv function downloads 1 file into temporary folder and creates a dataframe in the specific folder. Files are of size 1GB. Also, I am trying to run this using java -cp jar class.
While I couldn't figure out the exact issue I tried to bypass the issue by passing only 38 files at a time by using the "List->grouped" method

Creating a broadcast variable with SparkSession ? Spark 2.0

Is it possible to create broadcast variables with the sparkContext provided by SparkSession ? I keep getting an error under sc.broadcast , however in a different project when using the SparkContext from org.apache.spark.SparkContext I have no problems.
import org.apache.spark.sql.SparkSession
object MyApp {
def main(args: Array[String]){
val spark = SparkSession.builder()
.appName("My App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
.setLogLevel("ERROR")
val path = "C:\\Boxes\\github-archive\\2015-03-01-0.json"
val ghLog = spark.read.json(path)
val pushes = ghLog.filter("type = 'PushEvent'")
pushes.printSchema()
println("All events: "+ ghLog.count)
println("Only pushes: "+pushes.count)
pushes.show(5)
val grouped = pushes.groupBy("actor.login").count()
grouped.show(5)
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
import scala.io.Source.fromFile
val fileName= "ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(fileName).getLines()
} yield line.trim
)
val bcEmployees = sc.broadcast(employees)
}
}
Or is it a problem of using a Set () instead of a Seq object ?
Thanks for any help
Edit:
I keep getting a "cannot resolve symbol broadcast" error msg in intellij
after complying I get an error of:
Error:(47, 28) value broadcast is not a member of Unit
val bcEmployees = sc.broadcast(employees)
^
Your sc variable has type Unit because, according to the docs, setLogLevel has return type Unit. Do this instead:
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("ERROR")
It is important to keep track of the types of your variables to catch errors earlier.