Scope of 'spark.driver.maxResultSize' - scala

I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code:
def merge = (up1: Profile, up2: Profile) => { up1.addWeights(up2); up1}
val aggregated = dailyProfiles
.aggregateByKey(new Profile(), 3200)(merge, merge).cache()
Curiously, Spark fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 116318 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The obvious solution is to increment "spark.driver.maxResultSize", but two things puzzle me.
Too much of a coincidence that I get 1024.0 greater than 1024.0
All the documentation and help I found googling this particular error and configuration parameter indicates that it affect functions that take a value back to the driver. (say take() or collect()), but I'm not taking ANYTHING to the driver, just reading from HDFS, aggregating, saving back to HDFS.
Does anyone know why I'm getting this error?

Yes, It's failing because The values we see in exception message are
rounded off by one precision and comparison happening in bytes.
That serialized output must be more than 1024.0 MB and less than 1024.1 MB.
Check added Apache Spark code snippet, It's very interesting and very rare to get this error. :)
Here totalResultSize > maxResultSize both are Long types and in holds the value in bytes. But msg holds rounded value from Utils.bytesToString().
//TaskSetManager.scala
def canFetchMoreResults(size: Long): Boolean = sched.synchronized {
totalResultSize += size
calculatedTasks += 1
if (maxResultSize > 0 && totalResultSize > maxResultSize) {
val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
s"(${Utils.bytesToString(totalResultSize)}) is bigger than spark.driver.maxResultSize " +
s"(${Utils.bytesToString(maxResultSize)})"
logError(msg)
abort(msg)
false
} else {
true
}
}
Apache Spark 1.3 - source
//Utils.scala
def bytesToString(size: Long): String = {
val TB = 1L << 40
val GB = 1L << 30
val MB = 1L << 20
val KB = 1L << 10
val (value, unit) = {
if (size >= 2*TB) {
(size.asInstanceOf[Double] / TB, "TB")
} else if (size >= 2*GB) {
(size.asInstanceOf[Double] / GB, "GB")
} else if (size >= 2*MB) {
(size.asInstanceOf[Double] / MB, "MB")
} else if (size >= 2*KB) {
(size.asInstanceOf[Double] / KB, "KB")
} else {
(size.asInstanceOf[Double], "B")
}
}
"%.1f %s".formatLocal(Locale.US, value, unit)
}
Apache Spark 1.3 - source

Related

Scala specific -- does an immutable.Set consume significantly more memory than a mutable.Set?

Using Scala 2.12.8 on Java Hotspot 17.0.1, have a large number of instances of an object which contains code like this:
var tmpSet[SomeType] = mutable.Set[SomeType]()
lazy val finalSet[SomeType] = { val tmp = tmpSet.toSet; tmpSet = null; tmp }
During initialization, the logic builds the tmpSet in all of the objects. Then runs this:
for(i <- 0 until numberOfInstances ) instance(i).finalSet.size
to force the conversion to an immutable.Set which will be used for all further processing.
Before the conversion, using an -Xmx14G parameter, about 4.5G of memory has been consumed (for all the tmpSet's). Running the conversion always throws OOM. Have placed traces of memory use at points within the for(...) loop and can see memory usage steadily increasing until the OOM.
Any idea what is happening here? Even if the GC is disabled and does not recover any of the tmpSet instances that have been set to null, there should still be enough RAM -- unless an immutable.Set takes far more memory than the equivalent mutable.Set.
WITHDRAWING this question. Wrote a testbed (below as an answer) to mimic this situation and it does NOT show this behaviour -- so must be some other problem within my codebase.
Testbed code to mimic the situation, and it does not exhibit the problem.
Note Had trouble pasting the code with some special symbols. In particular replaced < with & lt ; to keep the HTML happy (about line 16). Also increased the indent.
/** Test case to explore OOM issue when converting a large number of mutable.Set's to immutable.Set's
* The 'lazy val finalSet' logic converts each mutable.Set into an immutable.Set, releasing the mutable.Set in the process.
*
* The original expectation was that this would consume roughly the same amount of memory as before the conversion!!!
**/
import scala.collection.mutable
import scala.collection.immutable
case class TestSetOOM(numStrings:Int) {
import TestSetOOM._
// tmpSet is loaded with sizeOfSet unique strings when this class is initialized
var tmpSet:mutable.Set[String] = mutable.Set[String]()
// finalSet is an immutable.Set initialized when finalSet.size is invoked. tmpSet is converted, and then released
lazy val finalSet:immutable.Set[String] = { val tmp = tmpSet.toSet; tmpSet = null; tmp}
// Executes at initialization
for(i <- 0 until numStrings) tmpSet += getString
}
object TestSetOOM {
/////// The basic parameters controlling the tests ////////
val numInstances = 10000 // How many instances of TestSetOOM to create
val sizeOfString = 1000 // How large to make each test string (+ 16 for the number)
val sizeOfSet = 1000 // How many test strings to add to the 'tmpSet' within each instance
val gcEveryN = 1000 // During the conversion phase, request a GC & pause every N instances processed
// NOTE: Would REALLY love to have a method to FORCE a complete, deep GC for use
// in exactly test routines such as this!!!
val pauseMillis = 100 // Number of milliseconds to pause during the gdEveryN, can be zero for no pause
// The hope is that if we pause the main thread then the JVM may actually run the GC
val reportAsMB = true // True == show memory reports as MB, false as GBs
//////// ... end of basic parameters ... ////////
val baseData = numInstances.toLong * sizeOfSet * (sizeOfString + 16)
def main(ignored:Array[String]):Unit = {
var instances:List[TestSetOOM] = Nil // Create numInstances and prepend to this list
for(i 0) Thread.sleep(pauseMillis)
ln("")
ln(f"Instances: $numInstances%,d, Size Of mutable.Set: $sizeOfSet%,d, Size of String: ${sizeOfString + 16}%,d == ${MBorGB(baseData)} base data size")
ln("")
ln(s" BASELINE -- $memUsedStr -- after initialization of all test data")
var dummy = 0L
var cnt = 0
instances.foreach { item =>
dummy += item.finalSet.size // Forces the conversion
cnt += 1
if(gcEveryN > 0 && (cnt % gcEveryN) == 0){
runtime.gc
if(pauseMillis > 0) Thread.sleep(pauseMillis)
ln(f"$cnt%,11d converted -- $memUsedStr ")
}
}
ln("")
ln(s" FINAL -- $memUsedStr")
}
val runtime = Runtime.getRuntime
// Get a memory report in either MBs or GBs
def memUsedStr = { val max = runtime.maxMemory
val ttl = runtime.totalMemory
val free = runtime.freeMemory
val used = ttl - free
s"Memory -- Max: ${MBorGB(max)}, Total: ${MBorGB(ttl)}, Used: ${MBorGB(used)}"
}
def MBorGB(amount:Long) = { val amt = amount / (if(reportAsMB) 1000000 else 1000000000L)
val as = if(reportAsMB) "MB" else "GB"
f"$amt%,9d $as"
}
// Generate strings with leading unique numbers as test data so the compiler cannot
// recognize as equal and memoize just one if they were all the same!
val emptyString = "X" * sizeOfString
var numString = 0L
def getString = { numString += 1
f"${numString}%,16d$emptyString"
}
def ln(str:String) = println(str)
}
Output of execution
Instances: 10,000, Size Of mutable.Set: 1,000, Size of String: 1,016 == 10,160 MB base data size
BASELINE -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,746 MB -- after initialization of all test data
1,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,783 MB
2,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,825 MB
3,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,867 MB
4,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,909 MB
5,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,951 MB
6,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,992 MB
7,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,034 MB
8,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,076 MB
9,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,117 MB
10,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,158 MB
FINAL -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,162 MB

AWS Deequ Checks Error: isGreaterThanOrEqualTo is not a member of com.amazon.deequ.VerificationRunBuilder

I run the following command on Databricks Notebook with com.amazon.deequ:deequ:2.0.0-spark-3.1 library for checking data quality on input data, and I got error messages on certain functions a member of com.amazon.deequ.VerificationRunBuilder. Where are those checks such as isGreaterThanOrEqualTo, hasDataType, hasMinLength exist? I did check the https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/checks/Check.scala and they do exist in there.
%scala
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import com.amazon.deequ.constraints.Constraint;
val verificationResult: VerificationResult = { VerificationSuite()
// data to run the verification on
.onData(df)
// define a data quality check
.addCheck(
Check(CheckLevel.Error, "unitTest")
//.hasSize(_ >= 2) // at least 100 rows
.hasMax("prem_amt", _ <= 2000) // max is 10000
.hasMin("prem_amt", _ >= 1000) // max is 10000
//.hasCompleteness("pol_nbr", _ >= 0.95) // 95%+ non-null IPs
.isNonNegative("prem_amt")) // should not contain negative values
.hasMinLength("pol_nbr", _ <= 8) // minimum length is 8
.hasMaxLength("pol_nbr", _ <= 8) // maximum length is 8
.hasDataType("trans_eff_dt", ConstrainableDataTypes.Date)
.isGreaterThanOrEqualTo("trans_eff_dt","pol_eff_dt")
// compute metrics and verify check conditions
.run()
}
// convert check results to a Spark data frame
val resultDataFrame = checkResultsAsDataFrame(spark, verificationResult)
resultDataFrame.show(truncate=false)
VerificationResult.successMetricsAsDataFrame(spark, verificationResult).show(truncate=false)

Spark iterate over dataframe rows, cells

(Spark beginner) I wrote the code below to iterate over the rows and columns of a data frame (Spark 2.4.0 + Scala 2.12).
I have computed the row and cell counts as a sanity check.
I was surprised to find that the method returns 0, even though the counters are incremented during the iteration.
To be precise: while the code is running it prints messages showing that it has found
rows 10, 20, ..., 610 - as expected.
cells 100, 200, ..., 1580 -
as expected.
After the iteration is done, it prints "Found 0 cells", and returns 0.
I understand that Spark is a distributed processing engine, and that code is not executed exactly as written - but how should I think about this code?
The row/cell counts were just a sanity check; in reality I need to loop over the data and accumulate some results, but how do I prevent Spark from zeroing out my results as soon as the iteration is done?
def processDataFrame(df: sql.DataFrame): Int = {
var numRows = 0
var numCells = 0
df.foreach { row =>
numRows += 1
if (numRows % 10 == 0) println(s"Found row $numRows") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells % 100 == 0) println(s"Found cell $numCells") // prints 100,200,...,15800
numCells += 1
}
}
println(s"Found $numCells cells") // prints 0
numCells
}
Spark have accumulators variables which provides you functionality like counting in a distributed environment. You can use a simple long and int type of accumulator. Even custom datatype of accumulator can also be implemented quite easily in Spark.
In your code changing your counting variables to accumulator variables like below will give you the correct result.
val numRows = sc.longAccumulator("numRows Accumulator") // string name only for debug purpose
val numCells = sc.longAccumulator("numCells Accumulator")
df.foreach { row =>
numRows.add(1)
if (numRows.value % 10 == 0) println(s"Found row ${numRows.value}") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells.value % 100 == 0) println(s"Found cell ${numCells.value}") // prints 100,200,...,15800
numCells.add(1)
}
}
println(s"Found ${numCells.value} cells") // prints 0
numCells.value

Async reading of large Cassandra table using Scala / Phantom-DSL

I have an issue reading a table containing >800k rows. I need to read the rows from top to bottom in order to process them.
I use Scala and Phantom for the purpose.
Here is how my table look.
CREATE TABLE raw (
id uuid PRIMARY KEY,
b1 text,
b2 timestamp,
b3 text,
b4 text,
b5 text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
So far I've tried to read the table using:
def getAllRecords : Future[Seq[row]] = select.fetch
or the more fancy Play Enumerator and combine it with a Iteratee
def getAllRecords : Enumerator = select.fetchEnumrator
Nothing of this works, it seems like cassandra/driver/my program always tries to read all records upfront, what am I missing here ?
Have you tried reviewing the code in the bigger read tests?
class IterateeBigReadPerformanceTest extends BigTest with ScalaFutures {
it should "read the correct number of records found in the table" in {
val counter: AtomicLong = new AtomicLong(0)
val result = TestDatabase.primitivesJoda.select
.fetchEnumerator run Iteratee.forEach {
r => counter.incrementAndGet()
}
result.successful {
query => {
info(s"done, reading: ${counter.get}")
counter.get() shouldEqual 2000000
}
}
}
}
This is not something that will read your records upfront. In fact we have tests than run more than one hour to guarantee sufficient GC pauses, no GC overhead, permgen/metaspace pressure remains within bounds, etc.
If anything has indeed changed it is only in error, but this should still work.

Executors run longer than timeout value

Here is a code segment of scala. I set timeout as 100 mills. Out of 10000 loops, 106 of them run more than 100 mills without throwing exceptions. The largest one is even 135 mills. Any reason why this is happening?
for (j <- 0 to 10000) {
total += 1
val executor = Executors.newSingleThreadExecutor
val result = executor.submit[Int](new Callable[Int] {
def call = try {
Thread.sleep(95)
for (i <- 0 to 1000000) {}
4
} catch {
case e: Exception => exception1 += 1
5
}
})
try {
val t1 = Calendar.getInstance.getTimeInMillis
result.get(100, TimeUnit.MILLISECONDS)
val t2 = Calendar.getInstance.getTimeInMillis
println("timediff = " + (t2 - t1).toString)
} catch {
case e: Exception => exception2 += 1
}
}
Firstly, if you're running on Windows you should be aware that the timer resolution is around 15.6 milliseconds.
Secondly, your empty loop of 1M iterations is quite likely to be removed by a compiler, and more importantly, can't be interrupted by any timeout.
The way a thread sleep works is that a thread asks the o/s to interrupt it after the given time. That's how the timeout works in the result.get call. Now you're relying on the OS thread that does this to be running at the exact time when your timeout has expired, which of course it may not be. Then there is the fact you have 10000 threads for it to interrupt which it can't do all at the exact same time.