Spark "Task serialization failed: java.lang.OutOfMemoryError" - scala

I try write in parquet files 300 rows each contain column BinaryType with length ~ 14 000 000.
And as result i got this Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
Bur this work when i try write 100 and 200 rows with same value.
spark submit configuration:
<param>--driver-memory 20G</param>
<param>--num-executors 100</param>
<param>--executor-memory 20G</param>
<param>--executor-cores 3</param>
<param>--driver-cores 3</param>
<param>--conf spark.driver.maxResultSize=10G</param>
<param>--conf spark.executor.memoryOverhead=10G</param>
<param>--conf spark.driver.memoryOverhead=10G</param>
<param>--conf spark.network.timeout=1200000</param>
<param>--conf spark.memory.offHeap.enabled=true</param>
<param>--conf spark.sql.files.maxRecordsPerFile=2</param>
I really can't understand how 14 mb per row can throw an OutOfMemoryError when trying to write. If i ran show on the final DataFrame, it works without errors.
used persist(StorageLevel.MEMORY_AND_DISK_SER) at final DataFrame also didnt help

Related

Error while connecting snowflake to glue using custom jdbc connector and connection?

I am trying to connect AWS Glue with Snowflake by using JDBC custom connector and connection. However after I have created the connection and run my job and call the toDF() method to convert dynamic frame to Pyspark Dataframe I get the following error
File "/tmp/Snowflake_Test_Job.py", line 22, in <module>
df1 = Snowflake_New_Connector_node1.toDF()
File "/opt/amazon/lib/python3.7/site-packages/awsglue/dynamicframe.py", line 147, in toDF
return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/lib/python3.7/site-packages/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o95.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (179.39.34.49 executor 2): org.apache.spark.util.TaskCompletionListenerException: null
Previous exception in task: Glue ETL Marketplace: JDBC driver failure, driverClassName: net.snowflake.client.jdbc.SnowflakeDriver, error message: No suitable driver found for rendered JDBC URL
The URL I am using is according to the snowflake documentation
I am using the following JDBC jar file snowflake-jdbc-3.13.22.jar
Not sure what is going wrong here? If someone has faced a similar problem and found solution to this please share?
I tried creating connection with other JAR files as well however when I try it gives the same error. I tried creating connection from Glue Job code as well but was getting same error. The Ideal result should have been execution of the query on snowflake.

rdd to df conversion using toDF() is giving error

My code is:
```
sample1 = df_pat_jour_status_other_occurances.rdd.map(lambda x: (x.lh_pat_id, x.src_key, x.Journey_Status)).toDF()
type(sample1)
```
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1010.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1010.0 (TID 32154, LTIN214271.cts.com, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
```
Cause:
Slave and driver are using different versions in your case.
Fix:
Install Python 3.8 version in slaves.
Modify the spark/conf/spark-env.sh file and add PYSPARK_PYTHON=/usr/local/lib/python3.8

spark - java heap space issue - ExecutorLostFailure - container exited with status 143

I am reading the string which is of length more than 100k bytes and splitting the columns based on width. I have close to 16K columns which I split from above string based on width.
but while writing into parquet i am using below code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol").select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*).write.mode("overwrite").parquet("c"\home\")
here ColCount = 16000 and column_seq is seq(string) with 16K column names.
I am running this on Yarn with 16GB executor memory and 20 executors.
File size is 4GB.
I am getting the error as
Lost task 113.0 in stage 0.0 (TID 461, gsta32512.foo.com): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason:
Container marked as failed:
container_e05_1472185459203_255575_01_000183 on host: gsta32512.foo.com. Exit status: 143. Diagnostics:
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
when i checked the status on UI its showing
#java.lang.outofmemoryerror java heap space
#java.lang.outofmemoryerror gc overhead limit exceeded
Please guide on performance tuning of above mentioned code and spark submit parameter optimization

How to catch an exception that occurred on a spark worker?

val HTF = new HashingTF(50000)
val Tf = Case.map(row=>
HTF.transform(row)
).cache()
val Idf = new IDF().fit(Tf)
try
{
Idf.transform(Tf).map(x=>LabeledPoint(1,x))
}
catch {
case ex:Throwable=>
println(ex.getMessage)
}
Code like this isn't working.
HashingTF/Idf belongs to org.spark.mllib.feature.
I'm still getting an exception that says
org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
I cannot see any of my files in the error log, how do I debug this?
It seems that the worker ran out of memory.
Instant temporary Fix:
Run the application without caching.
Just remove .cache()
How to Debug:
Probably Spark UI might have the complete exception details.
check Stage details
check the logs and thread dump in Executor tab
If you find multiple exceptions or errors try to resolve it in sequence.
Most of the times resolving 1st error will resolve subsequent errors.

GCS Connector Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

We are trying to run Hive queries on HDP 2.1 using GCS Connector, it was working fine until yesterday but since today morning our jobs are randomly started failing. When we restart them manually they just work fine. I suspect it's something to do with number of parallel Hive jobs running at a given point of time.
Below is the error message:
vertexId=vertex_1407434664593_37527_2_00, diagnostics=[Vertex Input: audience_history initializer failed., java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Any help will be highly appreciated.
Thanks!