Building Inverted Index exceed the Java Heap Size - scala

This might be a very special case but after pounding on my head for a while I thought to get help from Stackoverflow community.
I am building an inverted index for large data set (One day worth of data from large system). The building of inverted index get executed as a map reduce job on Hadoop. Inverted index is build with the help of scala. Structure of the inverted index as follows: {key:"New", ProductID:[1,2,3,4,5,...]} these get written in to avro files.
During this process I run into Java Heap Size issue. I think the reason is that terms like "New" as I showed above contain large number of productId(s). I have a, rough idea where the issue could take place in my Scala code:
def toIndexedRecord(ids: List[Long], token: String): IndexRecord = {
val javaList = ids.map(l => l: java.lang.Long).asJava //need to convert from scala long to java long
new IndexRecord(token, javaList)
}
And this is how I use this method (It get used in many locations but same code structure and login get use)
val titles = textPipeDump.map(vf => (vf.itemId, normalizer.customNormalizer(vf.title + " " + vf.subTitle).trim))
.flatMap {
case (id, title) =>
val ss = title.split("\\s+")
ss.map(word => (word, List(id)))
}
.filter(f => f._2.nonEmpty)
.group
.sum
.map {
case (token, ids) =>
toIndexedRecord(ids, token)
}
textPipeDump is scalding MultipleTextLine field object
case class MultipleTextLineFiles(p : String*) extends FixedPathSource(p:_*) with TextLineScheme
I have a case class to split and grab the fields that I want from that text line and thats the object ss
Here is my stack trace:
Exception in thread "IPC Client (47) connection to /127.0.0.1:55977 from job_201306241658_232590" java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:226)
at org.apache.hadoop.ipc.Client$Connection.close(Client.java:903)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:800)
28079664 [main] ERROR cascading.flow.stream.TrapHandler - caught Throwable, no trap available, rethrowing
cascading.pipe.OperatorException: [WritableSequenceFile(h...][com.twitter.scalding.GroupBuilder$$anonfun$1.apply(GroupBuilder.scala:189)] operator Every failed executing operation: MRMAggregator[decl:'value']
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:136)
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:39)
at cascading.flow.stream.OpenReducingDuct.receive(OpenReducingDuct.java:49)
at cascading.flow.stream.OpenReducingDuct.receive(OpenReducingDuct.java:28)
at cascading.flow.hadoop.stream.HadoopGroupGate.run(HadoopGroupGate.java:90)
at cascading.flow.hadoop.FlowReducer.reduce(FlowReducer.java:133)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:168)
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at scala.collection.immutable.List.$plus$plus(List.scala:193)
at com.twitter.algebird.ListMonoid.plus(Monoid.scala:86)
at com.twitter.algebird.ListMonoid.plus(Monoid.scala:84)
at com.twitter.scalding.KeyedList$$anonfun$sum$1.apply(TypedPipe.scala:264)
at com.twitter.scalding.MRMAggregator.aggregate(Operations.scala:279)
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:128)
... 12 more
When I execute the map reduce job for small data set I don't get the error. Which means that as the data increase number of items/product_ids that I index for words like New or old etc get bigger which cause the Heap Size to overflow.
So, the question is how can avoid java heap size overflow and accomplish this task.

Related

Get a spark Column from a spark Row

I am new to Scala, Spark and so struggling with a map function I am trying to create.
The map function on the Dataframe a Row (org.apache.spark.sql.Row)
I have been loosely following this article.
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val parsed = Try(from_avro(???, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
The from_avro function wants to accept a Column (org.apache.spark.sql.Column), however I don't see a way in the docs to get a column from a Row.
I am fully open to the idea that I may be doing this whole thing wrong.
Ultimately my goal is to parse the bytes coming in from a Structure Stream.
Parsed records get written to a Delta Table A and the failed records to another Delta Table B
For context the source table looks as follows:
Edit - from_avro returning null on "bad record"
There have been a few comments saying that from_avro returns null if it fails to parse a "bad record". By default from_avro uses mode FAILFAST which will throw an exception if parsing fails. If one sets the mode to PERMISSIVE an object in the shape of the schema is returned but with all properties being null (also not particularly useful...). Link to the Apache Avro Data Source Guide - Spark 3.1.1 Documentation
Here is my original command:
val parsedDf = filterValueDF.select($"topic",
$"partition",
$"offset",
$"timestamp",
$"timestampType",
$"valueSchemaId",
from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))
If there are ANY bad rows the job is aborted with org.apache.spark.SparkException: Job aborted.
A snippet of the log of the exception:
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
... 10 more
Suppressed: java.lang.NullPointerException
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
at org.apache.parquet.format.Util.write(Util.java:222)
at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
... 16 more
In order to get a specific column from the Row object you can use either row.get(i) or using the column name with row.getAs[T]("columnName"). Here you can check the details of the Row class.
Then your code would look as next:
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val binaryFixedValue = row.getSeq[Byte](6) // or row.getAs[Seq[Byte]]("fixedValue")
val parsed = Try(from_avro(binaryFixedValue, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
Although in your case, you don't really need to go into the map function because then you have to work with primitive Scala types when from_avro works with the Dataframe API. This is the reason that you can't call from_avro directly from map since instances of Column class can be used only in combination with the Dataframe API, i.e: df.select($"c1"), here c1 is an instance of Column. In order to use from_avro, as you initially intended, just type:
filterValueDF.select(from_avro($"fixedValue", currentValueSchema))
As #mike already mentioned, if from_avro fails to parse the AVRO content will return null. Finally, if you want to separate succeeded rows from failed, you could do something like:
val includingFailuresDf = filterValueDF.select(
from_avro($"fixedValue", currentValueSchema) as "avro_res")
.withColumn("failed", $"avro_res".isNull)
val successDf = includingFailuresDf.where($"failed" === false)
val failedDf = includingFailuresDf.where($"failed" === true)
Please be aware that the code was not tested.
From what i understand you just need to fetch a column for a row . You can probably do that by getting a column value at specific index using row.get()

How to avoid OutOfMemoryError in a small ArrayBuffer in a scope of small function?

The function scanFolder() was running, but sometimes exception bellow is produced
object MyClass{
// ... etc
val fs = FileSystem.get(new Configuration())
// .. etc
val dbs = scanFolder(warehouse)
val dbs_prod = dbs.filter( s => db_regex.findFirstIn(s).isDefined )
for (db <- dbs_prod)
for (t <- scanFolder(db) {
var parts = 0; var size = 0L
fs.listStatus( new Path(t) ).foreach( p => {
parts = parts + 1
try { // can lost partition-file during loop, producing "file not found"
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
}) // p loop, partitions
allVals.append( s"('${cutPath(db)}','${cutPath(t)}',${parts},${size},'$dbHoje')" )
if (trStrange_count>0) contaEstranhos += 1
}
def scanFolder(thePath: String, toCut: Boolean = false) : ArrayBuffer[String] = {
val lst = ArrayBuffer[String]()
fs.listStatus( new Path(thePath) ).foreach(
x => lst.append( cutPath(x.getPath.toString,toCut) )
)
lst.sorted
}
}
The error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at org.apache.hadoop.fs.Path.<init>(Path.java:109)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
...
(edit after more tests)
I am using Scala v2.11 (and save data after loops using Spark v2.2).
I changend the code, elimitating all use of function scanFolder() calls, to avoid use of ArrayBuffer. Now is using directally the iterator fs.listStatus( new Path(x) ).foreach( ...code... ) in the second loop.
... The program run during ~30 minutes... during some messages:
Exception in thread "LeaseRenewer:spdq#TLVBRPRDK" java.lang.OutOfMemoryError: Java heap space
Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-23" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-39" java.lang.OutOfMemoryError: Java heap space
Final error message, stoping the program:
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "SparkListenerBus"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
19/10/16 18:42:14 WARN DefaultPromise: An exception was thrown by org.apache.spark.network.server.TransportRequestHandler$$Lambda$16/773004452.operationComplete()
java.lang.OutOfMemoryError: Java heap space
19/10/16 18:42:14 WARN DefaultChannelPipeline: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
I don't have the answer to your specific question. But I'd like to give some recommendations which could be helpful:
prefer stream style of processing, i.e. avoid loops. Use map, filter, fold, etc. Don't use global state like lst and allVars in your code snippet.
avoid unnecessary sorting. scanFolder ends with sorting without a clean reason.
consider adding more heap memory for your task, this will reduce GC pressure.
use a JVM profiler to narrow down a specific piece of code greedy for memory allocations. In your case you won't need that, as it pretty obvious. But in trickier cases it could help a lot.
After #SergeyRomanovsky good clues I solved the problem... Here a fragment that I used first with Spark-Shell (sshell on my terminal),
Add memory by most popular directives, sshell --driver-memory 12G --executor-memory 24G
Remove the most internal (and problematic) loop, reducing int to parts = fs.listStatus( new Path(t) ).length and enclosing it into a try directive.
Adding more one try directive to run the most internal loop after the success of the try .length.
Reduced the ArrayBuffer[] variables to a minimal, removing the old scanFolder().
Complete fragment:
// ... val allVals = ArrayBuffer[String]()
// for loop:
var parts = -1
var size = -1L
try { // can lost partition-file during loop, producing "file not found"
val pp = fs.listStatus( new Path(t) )
parts = pp.length
pp.foreach( p => {
try { // timeOut and other filesystem bugs
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
}) // p loop, partitions
} catch { case _: Throwable => }
allVals.append( s"('${dbCut}','${cutPath(t)}',$parts,$size,'$dbHoje')" )
PS: for item 1 on compiler, use SparkSession.builder()... .config("spark.executor.memory", "24G").config("spark.driver.memory", "12G")

Next steps for debugging lmdbjni access violation

I am using this fork of LMDBjni:https://github.com/deephacks/lmdbjni to
form the backend of a medium-sized databases project in scala.
I've been hitting a EXCEPTION_ACCESS_VIOLATION (0xc0000005) in the JNI code for this LMDB, and would like to know whether there is anything obvious I'm doing wrong or what the next steps for debugging should be. I'm not exactly sure what I'm looking for, so I'm going to list as much information about what's happening as I can and hope the symptoms make sense to somebody.
The access violation occurs in a Database with a single key mapping to a single 8-byte value, on approximately the 4000th access to that database (the exact number appears to be the same with each run), suggesting that this is a deterministic problem.
I believe I only have one thread accessing the database at a time, and regardless, in my understanding, as the operation is wrapped in a transaction, concurrent accesses should not matter anyway.
By looking through stack traces and printing values the issue comes from this generic construction I wrote for building transactions.
My code that causes the issue is here, the crash occurs in the marked db.get() call:
def transactionalGetAndSet[A](
key: Key,
db: Database
)(
compute: A => LMDBEither[A]
)(
implicit sa: Storeable[A],
env: Env
): LMDBEither[A] = {
import org.fusesource.lmdbjni.Transaction
// get a new transaction
val tx: Transaction = instance.env.createWriteTransaction()
println("tx = " + tx + " id = " + tx.getId)
// get the key as an Array[Byte]. This is done by converting the key to a base64 string then converting that to bytes (so arbitary objects can be made into keys)
val k = key.render
println("Key = " + key + " Rendered = " + new String(k))
// instantiate a result value, so there is something if it fails
var res: LMDBEither[A] = NoResult.left // initialise the result as a failure to begin with
try {
res = for { // This for construction chains together operations that return LMDBEithers into one LMDBEither
bytes <- LMDBEither(db.get(tx, k)) // error occurs in this Database.get() call
_ = println("bytes = " + bytes)
a <- sa.fromBytes(safeRetrieve(bytes)) // sa is effectively an unmarshaller/unmarshaller object which converts Vector[Byte] => LMDBEither[A]
_ = println("a = " + a)
res <- compute(a) // get the next value for the value at the key
_ = println("res = " + res)
_ <- LMDBEither(db.put(tx, k, sa.toBytes(res).toArray))
} yield a // effectively, if all these steps worked, res == Right(a)
res // return the result
} finally {
// Make sure you either commit or rollback to avoid resource leaks.
if (res.isRight) tx.commit() // if the result is not an error (ie Either.isRight is true)
else tx.abort()
tx.close()
}
}
Where LMDBEither[A] is an alias for Either[E, A] for a specific error type E, and LMDBEither(x) is a function that lifts an expression that might throw exceptions during execution into an LMDBEither, catching any exceptions.
the function safeRetrieve converts a possibly null Array[Byte] into a definitely not null Vector[Byte], as follows:
private def safeRetrieve(bytes: Array[Byte]): Vector[Byte] =
Option(bytes).fold(Vector[Byte]()){ // if the array is null, convert to an empty vector, otherwise call the array's wrapper's vector
arr =>
println("Vector = " + arr.toVector)
arr.toVector
}
To the best of my knowledge, this does not modify the memory where the array is stored (LMDB's protected memory)
The values printed up to and including the crash are as follows:
tx = org.fusesource.lmdbjni.Transaction#391cec1f id = 15104
Key = Vector(Objects) Rendered = 84507411390877848991196161
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000018002453f, pid=10220, tid=7268
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C [lmdbjni-64-0-7710432736670562378.4+0x2453f]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\dev\PartIIProject\hs_err_pid10220.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Which is more evidence that the mdb_get() fails.
the full contents of the file referenced are here: https://pastebin.com/v6AmFBjq
Again, I would be extremely grateful for the small chance that anyone could point me in the right direction. What next steps should I be taking?

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Method for reducing memory load of Spark program

I have a Spark program with calculates relations between users, i.e. it receives data set of type:
RDD[(java.lang.Long, Map[(String, String), Integer])]
Where the Long is timestamp, and the map is a score relevant to tuples of two users. and should run some function over the scores and return the following type:
Map[String, Map[java.lang.Long, java.lang.Double]]
Where the String is the first String in the tuple, and the map is the results of the function per timeslot.
In my case I have around 2000 users so the maps I receive are quite big (2000^2 per timeslot), and also the results relies on the previous timeslot results.
I am running the program locally and receiving GC overhead limit exceeded. I increased the heap memory to 14g using: -Xmx14G in vmarguments (I see the java process is occupying more than 12g of memory) but it didn't help.
Currently implemented method
I have tried several directions to decrease the memory consumption and currently came up with the following idea: since every timestamp relies only on the previous one I will collect every timeslot separately and keep the previous results on driver. In this manner I will run calculations only on part of the data and hopefully it will not crush the program.
The code:
def calculateScorePerTimeslot(scorePerTimeslotRDD: RDD[(java.lang.Long, Map[(String, String), Integer])]): Map[String, Map[java.lang.Long, java.lang.Double]] = {
var distancesPerTimeslotVarRDD = distancesPerTimeslotRDD.groupBy(_._1).sortBy(_._1)
println("Start collecting all the results - cache the data!!")
distancesPerTimeslotVarRDD.cache()
println("Caching all the data has completed!")
while(!distancesPerTimeslotVarRDD.isEmpty())
{
val dataForTimeslot: (java.lang.Long, Iterable[(java.lang.Long, Map[(String, String), Integer])]) = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
//Code which is irrelevant for question - logic
println("Removing timeslot: " + dataForTimeslot._1)
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1))
println("Filtering has complete! - without: " + dataForTimeslot._1)
}
}
Summary: Basically, the idea is to extract one timeslot at a time process it and save the results at driver - in this manner I try to reduce the size of data which passes on collect.
Reason I write this post
Unfortunately, this doesn't help me and the program still dies. My question is: is this manner of taking the first() item of a RDD and then filter it have the effect of iterating over the items on RDD? Are there other better ideas to tackle this kinds of question (better ideas which are not increasing the memory or moving to a real distributed cluster)?
Firstly, RDD[(java.lang.Long, Map[(String, String), Integer])] uses more memory than RDD[(java.lang.Long, Array[(String, String, Integer)])]. You'll save some memory if you can use the latter.
Secondly, your loop is pretty inefficient in caching data. Always call unpersist on any RDD you no longer need.
distancesPerTimeslotVarRDD.cache()
var rddSize = distancesPerTimeslotVarRDD.count()
println("Caching all the data has completed!")
while(rddSize > 0) {
val prevRDD = distancesPerTimeslotVarRDD
val dataForTimeslot = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
// Code which is irrelevant for answer - logic
println("Removing timeslot: " + dataForTimeslot._1)
// Cache the new value of distancesPerTimeslotVarRDD
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1)).cache()
// Force calculation so we can throw away previous iteration value
rddSize = distancesPerTimeslotVarRDD.count()
println("Filtering has complete! - without: " + dataForTimeslot._1)
// Get rid of previously cached RDD
prevRDD.unpersist(false)
}
Thirdly, you can try using Kryo Serializer, though this sometimes makes things worse. You have to configure the serializer and replace cache with persist(StorageLevel.MEMORY_ONLY_SER)