foreachPartition method of RDD is not working in GCP cluster - scala

I am trying to upload data in spark job through API calls where API has payload limit of 5MB per API call(3rd party API limitation). I am accumulating data to form API body until payload limit to minimize the number of API calls. I am doing the same inside foreachPartition method of RDD with some comments to analyze. But this entire code is running totally fine when I run spark job locally in my machine (APIs are getting called & data is getting uploaded) but not working the same way in GCP cluster. While running the job in GCP dataproc cluster data is not getting uploaded through APIs so I believe the code inside foreachPartition is not getting called.
While running locally I can see all the log messages(1,2,3,4,5,6,7) but While running the job in GCP cluster I can see only few log messages(1,3)
Sample code is below for your reference
Would appreciate your suggestions to make it running in GCP cluster as well.
def exportData(
client: ApiClient,
batchSize: Int,
): Unit = {
val exportableDataRdd = getDataToUpload() //this is rdd of type RDD[UserDataObj]
logger.info(s"1. exportableDataRdd count:${exportableDataRdd.count}") //not a good practice to call count here but calling just for debugging
exportableDataRdd.foreachPartition { iterator =>
logger.info(s"2. perPartition iteration")
perPartitionMethod(client, iterator, batchSize)
}
logger.info(s"3. Data export completed")
}
def perPartitionMethod(
client: ApiClient,
iterator: Iterator[UserDataObj],
batchSize: Int
): Unit = {
logger.info(s"4. Inside perPartition")
iterator.grouped(batchSize).foreach { userDataGroup =>
val payLoadLimit = 5000000 //5 MB
val groupSize = userDataGroup.size
var counter = 0
var batchedUsersData = Seq[UserDataObj]()
userDataGroup.map{ user =>
counter = counter + 1
val curUsersDataSet = batchedUsersData :+ user
val body = Map[String, Any]("data" -> curUsersDataSet.map(_.toMap))
val apiPayload = Serialization.write(body)
val size = apiPayload.getBytes().length
if (size > payLoadLimit) {
val usersToUpload = batchedUsersData
logger.info(s"5. API called with batch size: ${usersToUpload.size}")
uploadDataThroughAPI(usersToUpload, client) //method to upload data through API
batchedUsersData = Seq[UserDataObj](user)
} else {
batchedUsersData = batchedUsersData :+ user
}
//upload left out data
if(counter == groupSize && batchedUsersData.size > 0){
uploadDataThroughAPI(batchedUsersData, client)
logger.info(s"6. API called with batch size: ${batchedUsersData.size}")
}
}
}
logger.info(s"7. perPartition completed")
}

Related

Calling elocation geocode API returns empty in Spark

I have this odd problem when calling eLocations geocoding API via Spark where i will always get an empty body even on addresses I know will return a coordinate. I am developing a geocoding app using Spark (2.3.3) and scala. I am also using scalaj to call the REST API. So the line of code which calls the API is as such:
def getGeoCoderLocation(sc: SparkSession, req: String, url: String, proxy_host: String, proxy_port: String, response_format: String): scala.collection.Map[String, (String, String)] = {
import sc.implicits._
val httpresponse = Http(url).proxy(proxy_host, proxy_port.toInt).postForm.params(("xml_request", req), ("format", response_format)).asString
println(httpresponse.body)
println(httpresponse.contentType.getOrElse(""))
println(httpresponse.headers)
println(httpresponse)
if(!httpresponse.contentType.getOrElse("").contains("text/html")) {
val body = httpresponse.body
val httpresponse_body = parseJSON(Option(body).getOrElse("[{\"x\":, \"y\":}]"))
val location = for (it <- 0 until httpresponse_body.length) yield {
(Option(httpresponse_body(it)(0).x).getOrElse("").toString, Option(httpresponse_body(it)(0).y).getOrElse("").toString, it)
}
val locDF = location.toDF(Seq("LONGITUDE", "LATITUDE", "row"): _*)//.withColumn("row", monotonically_increasing_id())
locDF.show(20, false)
locDF.rdd.map { r => (Option(r.get(2)).getOrElse("").toString, (Option(r.get(0)).getOrElse("").toString, Option(r.getString(1)).getOrElse("").toString)) }.collectAsMap()
}
else {
val locDF = Seq(("","","-")).toDF(Seq("LONGITUDE", "LATITUDE", "row"): _*)//.withColumn("row", monotonically_increasing_id())
locDF.show(20, false)
locDF.rdd.map { r => (Option(r.get(2)).getOrElse("").toString, (Option(r.get(0)).getOrElse("").toString, Option(r.getString(1)).getOrElse("").toString)) }.collectAsMap()
}
}
Where
url = http://elocation.oracle.com/elocation/lbs
proxy_host = (ip of proxy)
proxy_port = (port number)
req = "<?xml version=\"1.0\" standalone=\"yes\"?>\n<geocode_request vendor=\"elocation\">\n\t(address_list)\n\t\t|<list of requests>|\n\t</address_list>\n</geocode_request>"
response_format = JSON
So when I print the body it will always be [{}] (i.e. empty JSON Array) when I run my app in Spark. When I run the same request without a spark-submit i will get a proper Array of JSON objects (e.g. java -jar test.jar).
Is there a setting in Spark which blocks the app from receiving REST responses? We are using Cloudera 5.16.x
I have also tried setting the proxy information using --conf "spark.executor.extraJavaOptions=-Dhttp.proxyHost=(ip) -Dhttp.proxyPort=(port) -Dhttps.proxyHost=(ip) -Dhttps.proxyPort=(port)" but I will get:
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: Login failure for user: (principal) from keytab (keytab) javax.security.auth.login.LoginException: Cannot locate KDC
Please help as I don't know where to look to solve this as i have never encounter this before.
ok found the cause. the payload was actually empty which is why elocation keep returning blank.
Lesson of the day, check your payload.

Alpakka S3 connector stream won't handle the load, throwing akka.stream.BufferOverflowException

I have an akka-http service and I am trying out the alpakka s3 connector for uploading files. Previously I was using a temporary file and then uploading with Amazon SDK. This approach required some adjustments for Amazon SDK to make it more scala like, but it could handle even a 1000 requests at once. Throughput wasn't amazing, but all of the requests went through eventually. Here is the code before changes, with no alpakka:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
storeUploadedFile("csv", tempDestination) {
case (metadata, file) =>
val uploadFuture = upload(file, file.toPath.getFileName.toString)
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
case class S3UploaderException(msg: String) extends Exception(msg)
def upload(file: File, key: String): Future[String] = {
val s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withRegion(Regions.EU_WEST_3)
.build()
val promise = Promise[String]()
val listener = new ProgressListener() {
override def progressChanged(progressEvent: ProgressEvent): Unit = {
(progressEvent.getEventType: #unchecked) match {
case ProgressEventType.TRANSFER_FAILED_EVENT => promise.failure(S3UploaderException(s"Uploading a file with a key: $key"))
case ProgressEventType.TRANSFER_COMPLETED_EVENT |
ProgressEventType.TRANSFER_CANCELED_EVENT => promise.success(key)
}
}
}
val request = new PutObjectRequest("S3_BUCKET", key, file)
request.setGeneralProgressListener(listener)
s3Client.putObject(request)
promise.future
}
```
When I changed this to use alpakka connector, the code looks much nicer as we can just connect the ByteSource and alpakka Sink together. However this approach cannot handle such a big load. When I execute 1000 requests at once (10 kb files) less than 10% go through and the rest fails with exception:
akka.stream.alpakka.s3.impl.FailedUpload: Exceeded configured
max-open-requests value of [32]. This means that the request queue of
this pool
(HostConnectionPoolSetup(bargain-test.s3-eu-west-3.amazonaws.com,443,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,30
seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.3),10
seconds,1
minute,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$4787/1279590204#4d809f4c),List(),ParserSettings(2048,16,64,64,8192,64,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,Map(If-Range
-> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0,
User-Agent ->
32),false,true,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4535/297570074#6b426c59),None,TCPTransport),New,1
second),akka.http.scaladsl.HttpsConnectionContext#7e0f3726,akka.event.MarkerLoggingAdapter#74f3a78b)))
has completely filled up because the pool currently does not process
requests fast enough to handle the incoming request load. Please retry
the request later. See
http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html
for more information.
Here is how the summary of a Gatling test looks like:
---- Response Time Distribution ----------------------------------------
t < 800 ms 0 ( 0%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 90 ( 9%)
failed 910 ( 91%)
When I execute 100 of simultaneous requests, half of it fails. So, still close to satisfying.
This is a new code:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
extractRequestContext { ctx =>
implicit val materializer = ctx.materializer
extractActorSystem { actorSystem =>
fileUpload("csv") {
case (metadata, byteSource) =>
val uploadFuture = byteSource.runWith(S3Uploader.sink("s3FileKey")(actorSystem, materializer))
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
}
def sink(s3Key: String)(implicit as: ActorSystem, m: Materializer) = {
val regionProvider = new AwsRegionProvider {
def getRegion: String = Regions.EU_WEST_3.getName
}
val settings = new S3Settings(MemoryBufferType, None, new DefaultAWSCredentialsProviderChain(), regionProvider, false, None, ListBucketVersion2)
val s3Client = new S3Client(settings)(as, m)
s3Client.multipartUpload("S3_BUCKET", s3Key)
}
```
The complete code with both endpoints can be seen here
I have a couple of questions.
1) Is this a feature? Is this what we can call a backpressure?
2) If I would like this code to behave like the old approach with a temporary file (no failed requests and all of them finish at some point) what do I have to do? I was trying to implement a queue for the stream (link to the source below), but this made no difference. The code can be seen here.
(* DISCLAIMER * I am still a scala newbie trying to quickly understand akka streams and find some workaround for the issue. There are big chances that there is something simple wrong in this code. * DISCLAIMER *)
It’s a backpressure feature.
Exceeded configured max-open-requests value of [32] In the config max-open-requests is set to 32 by default.
Streaming is used to work with big amount of data, not to handle many many requests per second.
Akka developers had to put something for max-open-requests. They choose 32 for some reason for sure. And they had no idea what it will be used for. May it be sending 1000 32KB files or 1000 1GB files at once? They don’t know. But they still want to make sure that by default (and 80% of people use defaults probably) the apps will be handled gracefully and safely. So they had to limit processing power.
You asked to do 1000 “now” but I am pretty sure AWS did not send 1000 files simultaneously but used some queue, which may be a good case for you too if you have many small files to upload.
But it is perfectly fine to tune it to your case!
If you know your machine and the target will take care of more simultaneous connections, you can change the number to a higher value.
Also, for a lot of HTTP calls use cached host connection pool.

Apache Spark: how to cancel job in code and kill running tasks?

I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()

Gatling Feeders - creating new instances

The following code works as expected, for each iteration the next value from the valueFeed is popped and written to the output.csv file
class TestSimulation extends Simulation {
val valueFeed = csv("input.csv")
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(valueFeed)
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
When feed creation is inlined in the feed method the behaviour is still exactly the same
class TestSimulation extends Simulation {
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(csv("input.csv"))
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
Since the feed creation is not extracted I would not expect each iteration to be using the same feed but creating it's own feed instance.
Why then is it the behaviour implies the same feed is being used and the first value from the input file not always written to the output?
Example input file (data truncated, tested with more lines to prevent empty feeder exception):
value
1
2
3
4
5
Because csv(...) is in fact FeederBuilder which is called once to produce the feeder to be used within the scenario.
The gatling DSL defines builders - these are executed only once at startup, so even when you inline you get a feeder shared between all users as the same (and only) builder is used to create all the users.
if you want to have each user have its own copy of the data, you can't use the .feed method, but you can get all the records and use other looping constructs to iterate through them
val records = csv("foo.csv").records
foreach(records, "record") {
exec(flattenMapIntoAttributes("${record}"))
}

continuously fetch database results with scalaz.stream

I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance
Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.