RDD collect OOM issues - scala

I am trying to parse the input rdd into map with broadcast but running into memory issues on EMR. Below is the code
val data = sparkContext.textFile("inputPath")
val result = sparkContext.broadcast(data
.map(x => {
val row = x.split(",")
if(row.length == 4) {
if(row(3) == null) {
(row(0).toString, "0")
} else {
(row(0).toString, row(3).toString)
}
} else {
(row(0).toString, "0")
}}).collect.toMap)
rdd.map(other operations).filter(x > result.value.contains(x)
It fails at line filter operation with
22/08/30 03:52:59 ERROR Client: Application diagnostics message: User class threw exception: java.lang.OutOfMemoryError
Any insights will be helpful. The input data is less than 10GB and I have set spark.driver.maxResultSize=40g and spark.driver.memory=45g

Related

foreachPartition method of RDD is not working in GCP cluster

I am trying to upload data in spark job through API calls where API has payload limit of 5MB per API call(3rd party API limitation). I am accumulating data to form API body until payload limit to minimize the number of API calls. I am doing the same inside foreachPartition method of RDD with some comments to analyze. But this entire code is running totally fine when I run spark job locally in my machine (APIs are getting called & data is getting uploaded) but not working the same way in GCP cluster. While running the job in GCP dataproc cluster data is not getting uploaded through APIs so I believe the code inside foreachPartition is not getting called.
While running locally I can see all the log messages(1,2,3,4,5,6,7) but While running the job in GCP cluster I can see only few log messages(1,3)
Sample code is below for your reference
Would appreciate your suggestions to make it running in GCP cluster as well.
def exportData(
client: ApiClient,
batchSize: Int,
): Unit = {
val exportableDataRdd = getDataToUpload() //this is rdd of type RDD[UserDataObj]
logger.info(s"1. exportableDataRdd count:${exportableDataRdd.count}") //not a good practice to call count here but calling just for debugging
exportableDataRdd.foreachPartition { iterator =>
logger.info(s"2. perPartition iteration")
perPartitionMethod(client, iterator, batchSize)
}
logger.info(s"3. Data export completed")
}
def perPartitionMethod(
client: ApiClient,
iterator: Iterator[UserDataObj],
batchSize: Int
): Unit = {
logger.info(s"4. Inside perPartition")
iterator.grouped(batchSize).foreach { userDataGroup =>
val payLoadLimit = 5000000 //5 MB
val groupSize = userDataGroup.size
var counter = 0
var batchedUsersData = Seq[UserDataObj]()
userDataGroup.map{ user =>
counter = counter + 1
val curUsersDataSet = batchedUsersData :+ user
val body = Map[String, Any]("data" -> curUsersDataSet.map(_.toMap))
val apiPayload = Serialization.write(body)
val size = apiPayload.getBytes().length
if (size > payLoadLimit) {
val usersToUpload = batchedUsersData
logger.info(s"5. API called with batch size: ${usersToUpload.size}")
uploadDataThroughAPI(usersToUpload, client) //method to upload data through API
batchedUsersData = Seq[UserDataObj](user)
} else {
batchedUsersData = batchedUsersData :+ user
}
//upload left out data
if(counter == groupSize && batchedUsersData.size > 0){
uploadDataThroughAPI(batchedUsersData, client)
logger.info(s"6. API called with batch size: ${batchedUsersData.size}")
}
}
}
logger.info(s"7. perPartition completed")
}

MongoDB reactive template transactions

I've been using mongodb for my open source project for more than a year now and recently I decided to try out the transactions. After writing some tests for methods that use transactions I figured out that they throw some strange exceptions and I can't figure out what is the problem. So I have a method delete that uses custom coroutine context and a mutex:
open suspend fun delete(photoInfo: PhotoInfo): Boolean {
return withContext(coroutineContext) {
return#withContext mutex.withLock {
return#withLock deletePhotoInternalInTransaction(photoInfo)
}
}
}
It then calls a method that executes some deletion:
//FIXME: doesn't work in tests
//should be called from within locked mutex
private suspend fun deletePhotoInternalInTransaction(photoInfo: PhotoInfo): Boolean {
check(!photoInfo.isEmpty())
val transactionMono = template.inTransaction().execute { txTemplate ->
return#execute photoInfoDao.deleteById(photoInfo.photoId, txTemplate)
.flatMap { favouritedPhotoDao.deleteFavouriteByPhotoName(photoInfo.photoName, txTemplate) }
.flatMap { reportedPhotoDao.deleteReportByPhotoName(photoInfo.photoName, txTemplate) }
.flatMap { locationMapDao.deleteById(photoInfo.photoId, txTemplate) }
.flatMap { galleryPhotoDao.deleteByPhotoName(photoInfo.photoName, txTemplate) }
}.next()
return try {
transactionMono.awaitFirst()
true
} catch (error: Throwable) {
logger.error("Could not delete photo", error)
false
}
}
Here I have five operations that delete data from five different documents. Here is an example of one of the operations:
open fun deleteById(photoId: Long, template: ReactiveMongoOperations = reactiveTemplate): Mono<Boolean> {
val query = Query()
.addCriteria(Criteria.where(PhotoInfo.Mongo.Field.PHOTO_ID).`is`(photoId))
return template.remove(query, PhotoInfo::class.java)
.map { deletionResult -> deletionResult.wasAcknowledged() }
.doOnError { error -> logger.error("DB error", error) }
.onErrorReturn(false)
}
I want this operation to fail if either of deletions fails so I use a transaction.
Then I have some tests for a handler that uses this delete method:
#Test
fun `photo should not be uploaded if could not enqueue static map downloading request`() {
val webClient = getWebTestClient()
val userId = "1234235236"
val token = "fwerwe"
runBlocking {
Mockito.`when`(remoteAddressExtractorService.extractRemoteAddress(any())).thenReturn(ipAddress)
Mockito.`when`(banListRepository.isBanned(Mockito.anyString())).thenReturn(false)
Mockito.`when`(userInfoRepository.accountExists(userId)).thenReturn(true)
Mockito.`when`(userInfoRepository.getFirebaseToken(Mockito.anyString())).thenReturn(token)
Mockito.`when`(staticMapDownloaderService.enqueue(Mockito.anyLong())).thenReturn(false)
}
kotlin.run {
val packet = UploadPhotoPacket(33.4, 55.2, userId, true)
val multipartData = createTestMultipartFile(PHOTO1, packet)
val content = webClient
.post()
.uri("/v1/api/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(BodyInserters.fromMultipartData(multipartData))
.exchange()
.expectStatus().is5xxServerError
.expectBody()
val response = fromBodyContent<UploadPhotoResponse>(content)
assertEquals(ErrorCode.DatabaseError.value, response.errorCode)
assertEquals(0, findAllFiles().size)
runBlocking {
assertEquals(0, galleryPhotoDao.testFindAll().awaitFirst().size)
assertEquals(0, photoInfoDao.testFindAll().awaitFirst().size)
}
}
}
#Test
fun `photo should not be uploaded when resizeAndSavePhotos throws an exception`() {
val webClient = getWebTestClient()
val userId = "1234235236"
val token = "fwerwe"
runBlocking {
Mockito.`when`(remoteAddressExtractorService.extractRemoteAddress(any())).thenReturn(ipAddress)
Mockito.`when`(banListRepository.isBanned(Mockito.anyString())).thenReturn(false)
Mockito.`when`(userInfoRepository.accountExists(userId)).thenReturn(true)
Mockito.`when`(userInfoRepository.getFirebaseToken(Mockito.anyString())).thenReturn(token)
Mockito.`when`(staticMapDownloaderService.enqueue(Mockito.anyLong())).thenReturn(true)
Mockito.doThrow(IOException("BAM"))
.`when`(diskManipulationService).resizeAndSavePhotos(any(), any())
}
kotlin.run {
val packet = UploadPhotoPacket(33.4, 55.2, userId, true)
val multipartData = createTestMultipartFile(PHOTO1, packet)
val content = webClient
.post()
.uri("/v1/api/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(BodyInserters.fromMultipartData(multipartData))
.exchange()
.expectStatus().is5xxServerError
.expectBody()
val response = fromBodyContent<UploadPhotoResponse>(content)
assertEquals(ErrorCode.ServerResizeError.value, response.errorCode)
assertEquals(0, findAllFiles().size)
runBlocking {
assertEquals(0, galleryPhotoDao.testFindAll().awaitFirst().size)
assertEquals(0, photoInfoDao.testFindAll().awaitFirst().size)
}
}
}
#Test
fun `photo should not be uploaded when copyDataBuffersToFile throws an exception`() {
val webClient = getWebTestClient()
val userId = "1234235236"
val token = "fwerwe"
runBlocking {
Mockito.`when`(remoteAddressExtractorService.extractRemoteAddress(any())).thenReturn(ipAddress)
Mockito.`when`(banListRepository.isBanned(Mockito.anyString())).thenReturn(false)
Mockito.`when`(userInfoRepository.accountExists(userId)).thenReturn(true)
Mockito.`when`(userInfoRepository.getFirebaseToken(Mockito.anyString())).thenReturn(token)
Mockito.`when`(staticMapDownloaderService.enqueue(Mockito.anyLong())).thenReturn(true)
Mockito.doThrow(IOException("BAM"))
.`when`(diskManipulationService).copyDataBuffersToFile(Mockito.anyList(), any())
}
kotlin.run {
val packet = UploadPhotoPacket(33.4, 55.2, userId, true)
val multipartData = createTestMultipartFile(PHOTO1, packet)
val content = webClient
.post()
.uri("/v1/api/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(BodyInserters.fromMultipartData(multipartData))
.exchange()
.expectStatus().is5xxServerError
.expectBody()
val response = fromBodyContent<UploadPhotoResponse>(content)
assertEquals(ErrorCode.ServerDiskError.value, response.errorCode)
assertEquals(0, findAllFiles().size)
runBlocking {
assertEquals(0, galleryPhotoDao.testFindAll().awaitFirst().size)
assertEquals(0, photoInfoDao.testFindAll().awaitFirst().size)
}
}
}
Usually the first test passes:
and the following two fail with the following exception:
17:09:01.228 [Thread-17] ERROR com.kirakishou.photoexchange.database.dao.PhotoInfoDao - DB error
org.springframework.data.mongodb.UncategorizedMongoDbException: Command failed with error 24 (LockTimeout): 'Unable to acquire lock '{8368122972467948263: Database, 1450593944826866407}' within a max lock request timeout of '5ms' milliseconds.' on server 192.168.99.100:27017.
And then:
Caused by: com.mongodb.MongoCommandException: Command failed with error 246 (SnapshotUnavailable): 'Unable to read from a snapshot due to pending collection catalog changes; please retry the operation. Snapshot timestamp is Timestamp(1545661357, 23). Collection minimum is Timestamp(1545661357, 24)' on server 192.168.99.100:27017.
And:
17:22:36.951 [Thread-16] WARN reactor.core.publisher.FluxUsingWhen - Async resource cleanup failed after cancel
com.mongodb.MongoCommandException: Command failed with error 251 (NoSuchTransaction): 'Transaction 1 has been aborted.' on server 192.168.99.100:27017.
Sometimes two of them pass and the last one fails.
It looks like only the first transaction succeeds and any following will fail and I guess the reason is that I have to manually close it (or the ClientSession). But I can't find any info on how to close transactions/sessions. Here is one of the few examples I could find where they use transactions with reactive template and I don't see them doing anything additional to close transaction/session.
Or maybe it's because I'm mocking a method to throw an exception inside the transaction? Maybe it's not being closed in this case?
The client sessions/tranactions are closed properly however it appears the indexes creation in tests are acquiring global lock causes the next transaction lock to fall behind and wait before timing out on the lock request.
Basically you have to manage your index creation so they don’t interfere with transaction from client.
One quick fix would be to increase the lock timeout by running below command in shell.
db.adminCommand( { setParameter: 1, maxTransactionLockRequestTimeoutMillis: 50 } )
In production you can look at the transaction error label
and retry the operation.
More here https://docs.mongodb.com/manual/core/transactions-production-consideration/#pending-ddl-operations-and-transactions
You could check connection options and accord you driver
val connection = MongoConnection(List("localhost"))
val db = connection.database("plugin")
...
connection.askClose()
you could search method askClose(), hope you can helpfull

ScalaTest asserting multiple futures using AsyncFunSuiteLike

I've been trying to perform a test that uses a mock Http server to respond and a function that returns a Future[String] or an Exception if the Http server response isn't 200.
I'm trying to achieve a test without using Awaits, but instead AsyncFunSuiteLike.
However the following test seems impossible to resolve without doing it synchronously:
test("Error responses") {
Future.sequence {
NanoHTTPD.Response.Status.values().toList.filter(status => status.getRequestStatus >= 400).map {
status => {
httpService.setStatusCode(status)
val responseBody = s"Request failed with status $status"
httpService.setResponseContent(responseBody)
val errorMessage = s"Error response (${status.getRequestStatus}) from http service: $responseBody"
recoverToExceptionIf[ServiceException] {
myObject.httpCall("123456")
}.map {
ex => assert(ex.getMessage === errorMessage)
}
}
}
}.map(assertions => assert(assertions.forall(_ == Succeeded)))
}
Basically the problem is that when the Futures are tested, the NanoHTTPD is set to the last valued set in the map, so all ex.getMessage are the same. If I run those status codes one by one I do get the desired results, but, is there a way to perform all this in one single Async test?
From the looks of it, NanoHTTPD is stateful, so you have a race between the .set... calls and the .httpCall.
If you can spin up a new httpService within each Future, then you should be able to parallelize the tests (unless the state in question would be shared across instances, in which case you're likely out of luck).
So you'd have something like (replace Status with the type of status in your code and HTTPService with the type of httpService):
// following code composed on the fly and not run through the compiler...
def spinUpHTTPService(status: Status, body: String): Future[HTTPService] = {
// insert the code outside of the test which creates httpService
httpService.setStatusCode(status)
httpService.setResponseContent(body)
httpService
}
test("Error responses") {
Future.sequence(
NanoHTTPD.Response.Status.values().toList.filter(status => status.getRequestStatus >= 400).map { status =>
spinUpHTTPService(status, s"Request failed with status $status")
.flatMap { httpService =>
val errorMessage = s"Error response (${status.getRequestStatus}) from http service: $responseBody"
recoverToExceptionIf[ServiceException] {
myObject.httpCall("123456")
} map {
ex => assert(ex.getMessage === errorMessage)
}
} // Future.flatMap
} // List.map
).map { assertions => assertions.forAll(_ == Succeeded) }
}

OrientDB multithread Concurrent Modification Exception and other errors

I am writing an application that writes data to a graph on orientDB (v 2.2.3) this graph is something like the following:
I have threads that add vertices to C vertices each C vertex has an independent thread which is responsible add D vertexes with there edges.
Each thread is working on a separate transaction, I have been getting various errors and exception like the following:
com.orientechnologies.orient.core.exception.OStorageException: Error on commit
at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:253)
at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperation(OStorageRemote.java:189)
at com.orientechnologies.orient.client.remote.OStorageRemote.commit(OStorageRemote.java:1271)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:549)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:109)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2665)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2634)
at com.tinkerpop.blueprints.impls.orient.OrientTransactionalGraph.commit(OrientTransactionalGraph.java:175)
at JSONManager$.commitGrap2(JSONManager.scala:371)
at JSONManager$$anonfun$main$2$$anon$1.run(JSONManager.scala:87)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
.....
Caused by: java.util.ConcurrentModificationException
at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:711)
at java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:739)
at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1284)
at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1271)
at com.orientechnologies.orient.client.remote.OStorageRemote$2.execute(OStorageRemote.java:192)
at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:224)
... 12 more
UPDATE code:
val t: Runnable = new Runnable {
override def run(): Unit = {
graph = factory.getTx
saveDUnits(dUnit, graph)
commitGrap(graph)
graph.shutdown()
}
};
pool.execute(t)
def commitGrap(graph: OrientGraph): Unit = {
var retryCount = 0
while (retryCount < 10) {
try {
graph.commit()
retryCount = 11
} catch {
case e: Exception => println("Commit Error")
e.printStackTrace()
var sleepTime = 50
if (retryCount > 5) {
sleepTime = 6000
}
Thread.sleep(sleepTime);
}
retryCount = retryCount + 1
}
}
Finally I found the error what i did, the problem was in creating OrientGraphFactory instance, the non thread safe factory is created like the follwoing
var factory: OrientGraphFactory = new OrientGraphFactory("remote:106.140.20.233/test", "root", "123")
The thread safe factory is created like the following:
var factory: OrientGraphFactory = new OrientGraphFactory("remote:106.140.20.233/test", "root", "123").setupPool(1, 20)
I missed adding .setPool(1,20)
That's it

continuously fetch database results with scalaz.stream

I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance
Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.