log error from catch block to cosmos db - spark - scala

Objective:- Retrieve objects from an S3 bucket using a 'get' api call, write the retrieved object to azure datalake and in case of errors like 404s (object not found) write the error message to cosmos DB
"my_dataframe" consists of the a column (s3ObjectName) with object names like:-
s3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
//retry function that writes cosmos error in event of failure
def retry[T](n: Int)(fn: => T): T = {
Try {
return fn
} match {
case Success(x) => x
case Failure(t: Throwable) => {
Thread.sleep(1000)
if (n > 1) {
retry(n - 1)(fn)
} else {
val loggerDf = Seq((t.toString)).toDF("Description")
.withColumn("Type", lit("Failure"))
.withColumn("id", uuid())
loggerDf.write.format("cosmos.oltp").options(ExceptionCfg).mode("APPEND").save()
throw t
}
}
}
}
//execute s3 get api call
my_dataframe.rdd.foreachPartition(partition => {
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
val s3client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
partition.foreach(x => {
retry (2) {
val objectKey = x.getString(0)
val i = s3client.getObject(s3bucket_name, objectKey).getObjectContent
val inputS3String = IOUtils.toString(i, "UTF-8")
val filePath = s"${data_lake_file_path}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
}
})
})
When the above is executed it results in the following error:-
Caused by: java.lang.NullPointerException
This error occurs in the retry function when it is asked to create the dataframe loggerDf and write it to cosmos db
Is there another way to write the error messages to cosmos DB ?

Maybe this isn't a good time to use spark. There is already some hadoop tooling to accomplish this type of S3 file transfer using hadoop that does what you are doing but uses hadoop tools.
If you still feel like spark is the correct tooling:
Split this into a reporting problem and a data transfer problem.
Create and test a list of the files to see if they're valid. Write a UDF that does the dirty work of creating a data frame of good/bad files.
Report the files that aren't valid. (To Cosmos)
Transfer the files that are valid.

If you want to write errors to cosmo DB you'll need to use an "out of band" method to initiate the connection from the executors.(Think: initiating a jdbc connection from inside the partition.foreach.)
As a lower standard, if you wanted to know if it happened you could use Accumulators. This isn't made for logging but does help transfer information from executors to the driver. This would enable you to write something back to Cosmos, but really was intended be used to simply count if something has happened. (And can double count if you end up retrying a executor, so it's not perfect.) It technically can transfer information back to the driver, but should only be used for countable things. (If this type of failure is extremely irregular it's likely suitable. If this happens a lot it's not suitable for use.)

Related

Scala: get data from scylla using spark

scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.

Using Spark on Dataproc, how to write to GCS separately from each partition?

Using Spark on GCP Dataproc, I successfuly write an entire RDD to GCS like so:
rdd.saveAsTextFile(s"gs://$path")
The products are files for each partition in the same path.
How do I write files for each partition (with a unique path based on information from the partition)
Below is an invented non working wishful code example
rdd.mapPartitionsWithIndex(
(i, partition) =>{
partition.write(path = s"gs://partition_$i", data = partition_specific_data)
}
)
when I call the function below from within the partition on my mac it writes to local disk, on Dataproc I get an error not recognizing the gs as a valid path.
def writeLocally(filePath: String, data: Array[Byte], errorMessage: String): Unit = {
println("Juicy Platform")
val path = new Path(filePath)
var ofos: Option[FSDataOutputStream] = null
try {
println(s"\nTrying to write to $filePath\n")
val conf = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
// conf.addResource(new Path("/home/hadoop/conf/core-site.xml"))
println(conf.toString)
val fs = FileSystem.get(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
println(s"\nWrote to $filePath\n")
}
catch {
case e: Exception =>
logError(errorMessage, s"Exception occurred writing to GCS:\n${ExceptionUtils.getStackTrace(e)}")
}
finally {
ofos match {
case Some(i) => i.close()
case _ =>
}
}
}
This is the error:
java.lang.IllegalArgumentException: Wrong FS: gs://path/myFile.json, expected: hdfs://cluster-95cf-m
If running on a Dataproc cluster, you shouldn't need to explicitly populate "fs.gs.impl" in the Configuration; a new Configuration() should already contain the necessary mappings.
The main problem here is that val fs = FileSystem.get(conf) is using the fs.defaultFS property of the conf; it has no way of knowing whether you wanted to get a FileSystem instance specific to HDFS or to GCS. In general, In Hadoop and Spark, a FileSystem instance is fundamentally tied to a single URL scheme; you need to fetch a scheme-specific instance for each different scheme, such as hdfs:// or gs:// or s3://.
The simplest fix to your problem is to always use Path.getFileSystem(Configuration) as opposed to FileSystem.get(Configuration). And make sure your path is fully-qualified with the scheme:
...
val path = "gs://bucket/foo/data"
val fs = path.getFileSystem(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
...

log from spark udf to driver

I have a simple UDF in databricks used in spark. I can't use println or log4j or something because it will get outputted to the execution, I need it in the driver. I have a very system log setup
var logMessage = ""
def log(msg: String){
logMessage += msg + "\n"
}
def writeLog(file: String){
println("start write")
println(logMessage)
println("end write")
}
def warning(msg: String){
log("*WARNING* " + msg)
}
val CleanText = (s: int) => {
log("I am in this UDF")
s+2
}
sqlContext.udf.register("CleanText", CleanText)
How can I get this to function properly and log to driver?
The closest mechanism in Apache Spark to what you're trying to do is accumulators. You can accumulate the log lines on the executors and access the result in driver:
// create a collection accumulator using the spark context:
val logLines: CollectionAccumulator[String] = sc.collectionAccumulator("log")
// log function adds a line to accumulator
def log(msg: String): Unit = logLines.add(msg)
// driver-side function can print the log using accumulator's *value*
def writeLog() {
import scala.collection.JavaConverters._
println("start write")
logLines.value.asScala.foreach(println)
println("end write")
}
val CleanText = udf((s: Int) => {
log(s"I am in this UDF, got: $s")
s+2
})
// use UDF in some transformation:
Seq(1, 2).toDF("a").select(CleanText($"a")).show()
writeLog()
// prints:
// start write
// I am in this UDF, got: 1
// I am in this UDF, got: 2
// end write
BUT: this isn't really recommended, especially not for logging purposes. If you log on every record, this accumulator would eventually crash your driver on OutOfMemoryError or just slow you down horribly.
Since you're using Databricks, I would check what options they support for log aggregation, or simply use the Spark UI to view the executor logs.
You can't... unless you want to go crazy and make some sort of log-back appender that sends logs over the network or something like that.
The code for the UDF will be run on all your executors when you evaluate a data frame. So, you might have 2000 hosts running it and each of them will log to their own location; that's how Spark works. The driver isn't the one running the code so it can't be logged to.
You can use YARN log aggregate to pull all the logs from the executors though for later analysis.
You could probably also write to a kafka stream or something creative like that with some work and write the logs contiguously later off the stream.

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)
The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})

Inserting into Cassandra with Akka Streams

I'm learning Akka Streams and as an exercise I would like to insert logs into Cassandra. The issue is that I could not manage to make the stream insert logs into the database.
I naively tried the following :
object Application extends AkkaApp with LogApacheDao {
// The log file is read line by line
val source: Source[String, Unit] = Source.fromIterator(() => scala.io.Source.fromFile(filename).getLines())
// Each line is converted to an ApacheLog object
val flow: Flow[String, ApacheLog, Unit] = Flow[String]
.map(rawLine => {
rawLine.split(",") // implicit conversion Array[String] -> ApacheLog
})
// Log objects are inserted to Cassandra
val sink: Sink[ApacheLog, Future[Unit]] = Sink.foreach[ApacheLog] { log => saveLog(log) }
source.via(flow).to(sink).run()
}
saveLog() is defined in LogApacheDao like this (I omitted the column values for a clearer code):
val session = cluster.connect()
session.execute(s"CREATE KEYSPACE IF NOT EXISTS $keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};")
session.execute(s"DROP TABLE IF EXISTS $keyspace.$table;")
session.execute(s"CREATE TABLE $keyspace.$table (...)")
val preparedStatement = session.prepare(s"INSERT INTO $keyspace.$table (...) VALUES (...);")
def saveLog(logEntry: ApacheLog) = {
val stmt = preparedStatement.bind(...)
session.executeAsync(stmt)
}
The conversion from Array[String] to ApacheLog when entering in the sink happens without issue (verified with println). Also, the keyspace and table are both created, but when the execution comes to saveLog, it seems that something is blocking and no insertion is made.
I do not get any errors but Cassandra driver core (3.0.0) keeps giving me :
Connection[/172.17.0.2:9042-1, inFlight=0, closed=false] was inactive for 30 seconds, sending heartbeat
Connection[/172.17.0.2:9042-2, inFlight=0, closed=false] heartbeat query succeeded
I should add that I use a dockerized Cassandra.
Try using the Cassandra Connector in alpakka.