I am running into a "OutOfMemoryError: Requested array size exceeds VM limit" error when running my Scala Spark job.
I'm running this job on an AWS EMR cluster with the following makeup:
Master: 1 m4.4xlarge 32 vCore, 64 GiB memory
Core: 1 r3.4xlarge 32 vCore, 122 GiB memory
The version of Spark I'm using is 2.2.1 on EMR release label 5.11.0.
I'm running my job in a spark shell with the following configurations:
spark-shell --conf spark.driver.memory=40G
--conf spark.driver.maxResultSize=25G
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000
--conf spark.rpc.message.maxSize=2000
--conf spark.dynamicAllocation.enabled=true
What I'm attempting to do with this job is to convert a one column dataframe of objects into a one row dataframe that contains a list of those objects.
The objects are as follows:
case class Properties (id: String)
case class Geometry (`type`: String, coordinates: Seq[Seq[Seq[String]]])
case class Features (`type`: String, properties: Properties, geometry: Geometry)
And my dataframe schema is as follows:
root
|-- geometry: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: array (containsNull = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
|-- type: string (nullable = false)
|-- properties: struct (nullable = false)
| |-- id: string (nullable = true)
I'm converting it to a list and adding it to a one row dataframe like so:
val x = Seq(df.collect.toList)
final_df.withColumn("features", typedLit(x))
I don't run into any issues when creating this list and it's pretty quick. However, there seems to be a limit to the size of this list when I try to write it out by doing either of the following:
final_df.first
final_df.write.json(s"s3a://<PATH>/")
I've tried to also convert the list to a dataframe by doing the following, but it seems to never end.
val x = Seq(df.collect.toList)
val y = x.toDF
The largest list I've been capable of getting this dataframe to work with had 813318 Features objects, each of which contains a Geometry object that contains a list of 33 elements, for a total of 29491869 elements.
Attempting to write pretty much any list larger than that gives me the following stacktrace when running my job.
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 33028"...
os::fork_and_exec failed: Cannot allocate memory (12)
18/03/29 21:41:35 ERROR FileFormatWriter: Aborting job null.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.write(UnsafeArrayWriter.java:217)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply1_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
I've tried making a million configuration changes, including throwing both more driver and executor memory at this job, but to no avail. Is there any way around this? Any ideas?
The problem is here
val x = Seq(df.collect.toList)
When you do collect on a dataframe it will send all the data of the dataframe to the driver. So if your dataframe is big this will cause driver to get out of memory.
It is to be noted that out of all the memory you assign to the executor, the heap memory which driver can you is generally 30% (if not changed). So what is happening the driver is choking with the data volume due to the collect operation.
Now the thing is you might think the dataframe is smaller in size on disk but that is because the data is serialized and saved there. When you do collect it materialize the dataframe and uses JVM to store the data. This will cause huge memory explode ( generally 5-7X).
I would recommend you to remove the collect part and use df dataframe directly. Because I recon
val x = Seq(df.collect.toList) and df are essentially same
Well, there is a dataframe aggregation function that does what you want without doing a collect on the driver. For example if you wanted to collect all "feature" columns by key: df.groupBy($"key").agg(collect_list("feature")), or if you really wanted to do that for the whole dataframe without grouping: df.agg(collect_list("feature")).
However I wonder why you'd want to do that, when it seems easier to work with a dataframe with one row per object than a single row containing the entire result. Even using the collect_list aggregation function I wouldn't be surprised if you still run out of memory.
Related
I have a simple df with 2 columns, as shown below,
+------------+---+
|file_name |id |
+------------+---+
|file1.csv |1 |
|file2.csv |2 |
+------------+---+
root
|-- file_name: string (nullable = true)
|-- id: string (nullable = true)
I wish to add a 3rd column with the count() from each file specified in the file_name column
These are large files so I wish to go for a Spark based approach for getting the count() from each file.
Assuming originalDF is the above df,
I have tried:
dfWithCounts = originalDF.withColumn("counts", lit(spark.read.csv(lit(col('file_name'))).count))
but this seems to be throwing error.
Column is not iterable
Is there way I can achieve this?
I'm using Spark 2.4.
You can't run a Spark job from within another Spark job. Assuming that the file list is not super huge you can collect originalDF to the driver and spawn individual jobs to count lines from there.
val dfWithCounts = originalDF.collect.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.toDF("file_name", "id", "count")
Optionally you can use Scala parallel collections to run these jobs in parallel.
val dfWithCounts = originalDF.collect.par.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.seq.toDF("file_name", "id", "count")
I have a spark structured streaming job which gets records from Kafka (10,000 as maxOffsetsPerTrigger). I get all those records by spark's readStream method. This dataframe has a column named "key".
I need string(set(all values in that column 'key')) to use this string in a query to ElasticSearch.
I have already tried df.select("key").collect().distinct() but it throws exception:
collect() is not supported with structured streaming.
Thanks.
EDIT:
DATAFRAME:
+-------+-------------------+----------+
| key| ex|new column|
+-------+-------------------+----------+
| fruits| [mango, apple]| |
|animals| [cat, dog, horse]| |
| human|[ram, shyam, karun]| |
+-------+-------------------+----------+
SCHEMA:
root
|-- key: string (nullable = true)
|-- ex: array (nullable = true)
| |-- element: string (containsNull = true)
|-- new column: string (nullable = true)
STRING I NEED:
'["fruits", "animals", "human"]'
You can not apply collect on streaming dataframe. streamingDf here refers reading from Kafka.
val query = streamingDf
.select(col("Key").cast(StringType))
.writeStream
.format("console")
.start()
query.awaitTermination()
It will print your data in the console. To write data in an external source, you have to give an implementation of foreachWriter. For reference, refer
In given link, data is streamed using Kafka, read by spark and written to Cassandra eventually.
Hope, it will help.
For such use case, I'd recommend using foreachBatch operator:
foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]
Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous).
In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier.
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Quoting the official documentation (with a few modifications):
The foreachBatch operation allows you to apply arbitrary operations and writing logic on the output of a streaming query.
foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
And in the same official documentation you can find a sample code that shows that you could do your use case fairly easily.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.select("key").collect().distinct()
}
I am trying to fix an outofmemory issue I am seeing in my spark setup and at this point, I am unable to conclude on a concrete analysis as to why I am seeing this. I am always seeing this issue when writing a dataframe to parquet or kafka. My dataframe has 5000 rows. It's schema is
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: array (nullable = true)
| |-- element: string (containsNull = true)
|-- E: array (nullable = true)
| |-- element: string (containsNull = true)
|-- F: double (nullable = true)
|-- G: array (nullable = true)
| |-- element: double (containsNull = true)
|-- H: integer (nullable = true)
|-- I: double (nullable = true)
|-- J: double (nullable = true)
|-- K: array (nullable = true)
| |-- element: double (containsNull = false)
Of this the column G can have a cell size of upto 16MB. My dataframe total size is about 10GB partitioned into 12 partitions. Before writing, I am attempting to create 48 partitions out of this using repartition(), but the issue is seen even if I write without repartitioning. At the time of this exception, I have only one Dataframe cached with size of about 10GB. My driver has 19GB of free memory and the 2 executors have 8 GB of free memory each. The spark version is 2.1.0.cloudera1 and scala version is 2.11.8.
I have the below settings:
spark.driver.memory 35G
spark.executor.memory 25G
spark.executor.instances 2
spark.executor.cores 3
spark.driver.maxResultSize 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1g
spark.rdd.compress true
spark.rpc.message.maxSize 2046
spark.yarn.executor.memoryOverhead 4096
The exception traceback is
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Any insights?
We finally found the issue. We were running kfold logistic regression in scala on 5000 rows of dataframe with k size as 4. After the classification is done, we basically got 4 test output dataframes of size 1250, each of them partitioned by at least 200 partitions. So in all we had greater than 800 partitions on 5000 rows of data. The code would then proceed to repartitioning this data to 48 partitions. Our system couldn't handle this repartition probably due to shuffling. To fix this we repartitioned each fold output dataframe to a smaller number (instead of doing it on the combined dataframe) and this has fixed the issue.
I have a spark Dataframe df with the following schema:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
I would like to create a new Dataframe where each row will be a Vector of Doubles and expecting to get the following schema:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows.
Also, if there are too many rows the application will crash with a heap space exception.
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
I suspect that the instruction arr.toArray is not a good Spark practice in this case. Any clarification would be very helpful.
Thank you!
It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.
It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.
You can do this very easy with UDFs:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD
However there author of the question didn't ask about differences
I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order):
df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()
root
|-- id: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- key: string (nullable = false)
|-- value: double (nullable = false)
Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:
rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)
Note that I used 20 because I have 5 workers with one core each in my cluster, and 5*4=20.
Unfortunately, the following command still takes forever to execute:
result = rdd.first()
Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used:
Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:
url = ...
properties = ...
min_max_query = """(
SELECT
CAST(min(extract(epoch FROM timestamp)) AS bigint),
CAST(max(extract(epoch FROM timestamp)) AS bigint)
FROM tablename
) tmp"""
min_epoch, max_epoch = spark.read.jdbc(
url=url, table=min_max_query, properties=properties
).first()
and use it to query the table:
numPartitions = ...
query = """(
SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
FROM tablename) AS tmp"""
spark.read.jdbc(
url=url, table=query,
lowerBound=min_epoch, upperBound=max_epoch + 1,
column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")
Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.
You could also provide a list of disjoint predicates as a predicates argument.
predicates= [
"id BETWEEN 'a' AND 'c'",
"id BETWEEN 'd' AND 'g'",
... # Continue to get full coverage an desired number of predicates
]
spark.read.jdbc(
url=url, table="tablename", properties=properties,
predicates=predicates
)
The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.
Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.