Spark job which writes data frame to hdfs is aborted FileFormatWriter.scala:196 - scala

I am trying to store a data frame to HDFS using the following Spark Scala code.
All the columns in the data frame are nullable = true
Intermediate_data_final.coalesce(100).write
.option("header", value = true)
.option("compression", "bzip2")
.mode(SaveMode.Append)
.csv(path)
But I am getting this error :
2019-08-08T17:22:21.108+0000: [GC (Allocation Failure) [PSYoungGen: 979968K->34277K(1014272K)] 1027111K->169140K(1473536K), 0.0759544 secs] [
Times: user=0.61 sys=0.18, real=0.07 secs]
2019-08-08T17:22:32.032+0000: [GC (Allocation Failure) [PSYoungGen: 1014245K->34301K(840192K)] 1149108K->263054K(1299456K), 0.0540687 secs] [
Times: user=0.49 sys=0.13, real=0.05 secs]
Job aborted.
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
Can anybody please help me with this?

Not the fix for your problem I'm afraid, but if anyone is getting this issue using pyspark then I managed to solve this by separating the query execution and the writing execution into separate commands
df.select(foo).filter(bar).map(baz).write.parquet(out_path)
Would fail with this error message (for a 3.5GB dataframe) but the below worked fine
x = df.select(foo).filter(bar).map(baz)
x.write.parquet(out_path)

Related

Druid ingestion problem - memory allocation

I am having issues on the ingestion side.
I have around 290,000 segments.
After a few days of perfect ingestion processing, Druid starts failing ingesting.
The errors I am getting are:
GC (Allocation Failure) [PSYoungGen: 1588944K->210544K(1736192K)]
2383728K->1021504K(2595328K), 0.0384942 secs] [Times: user=0.81
sys=0.03, real=0.04 secs] [Full GC (Ergonomics) [PSYoungGen:
210544K->76192K(1736192K)] [ParOldGen: 810960K->858732K(2012160K)]
1021504K->934924K(3748352K), [Metaspace: 91328K->91306K(1132544K)],
0.7209968 secs] [Times: user=17.12 sys=0.90, real=0.72 secs] [GC (Allocation Failure) [PSYoungGen: 1461664K->84288K(1727488K)]
2320396K->948612K(3739648K), 0.0516311 secs] [Times: user=1.31
sys=0.05, real=0.05 secs] [GC (Allocation Failure) [PSYoungGen:
1469760K->90160K(1751040K)] 2334084K->960068K(3763200K), 0.0242025
secs] [Times: user=0.41 sys=0.08, real=0.03 secs] [GC (Allocation
Failure) [PSYoungGen: 1495600K->93888K(1499648K)]
2365508K->969628K(3511808K), 0.0283021 secs] [Times: user=0.53
sys=0.05, real=0.03 secs] [GC (Allocation Failure) [PSYoungGen:
1499328K->100265K(1754624K)] 2375068K->981617K(3766784K), 0.0258001
secs] [Times: user=0.46 sys=0.02, real=0.03 secs] [GC (Allocation
Failure) [PSYoungGen: 1513897K->103760K(1517568K)]
2395249K->990416K(3529728K), 0.0244750 secs] [Times: user=0.50
sys=0.05, real=0.03 secs] [GC (Allocation Failure) [PSYoungGen:
1517392K->111120K(1757696K)] 2404048K->1002608K(3769856K), 0.0281102
secs] [Times: user=0.60 sys=0.04, real=0.03 secs] [GC (Allocation
Failure) [PSYoungGen: 1543184K->117936K(1747456K)]
2434672K->1014504K(3759616K), 0.0252510 secs] [Times: user=0.60
sys=0.08, real=0.03 secs] [GC (Allocation Failure) [PSYoungGen:
1550000K->119120K(1770496K)] 2446568K->1021056K(3782656K), 0.0747994
secs] [Times: user=1.04 sys=0.01, real=0.08 secs] [GC (Allocation
Failure) [PSYoungGen: 1586512K->127088K(1760256K)]
2488448K->1033752K(3772416K), 0.0293217 secs] [Times: user=0.53
sys=0.24, real=0.02 secs] [GC (Allocation Failure) [PSYoungGen:
1594480K->134816K(1791488K)] 2501144K->1045960K(3803648K), 0.0366369
secs] [Times: user=0.79 sys=0.05, real=0.04 secs] [GC (Allocation
Failure) [PSYoungGen: 1648800K->142384K(1781760K)]
2559944K->1058048K(3793920K), 0.0445734 secs] [Times: user=0.98
sys=0.11, real=0.05 secs] [GC (Allocation Failure) [PSYoungGen:
1656368K->146752K(1816576K)] 2572032K->1066928K(3828736K), 0.0641577
secs] [Times: user=0.86 sys=0.02, real=0.06 secs] [GC (System.gc())
[PSYoungGen: 273231K->143044K(1807360K)] 1193408K->1067716K(3819520K),
0.1127108 secs] [Times: user=0.88 sys=0.00, real=0.11 secs] [Full GC (System.gc()) [PSYoungGen: 143044K->0K(1807360K)] [ParOldGen:
924672K->1061214K(2012160K)] 1067716K->1061214K(3819520K), [Metaspace:
91350K->91350K(1132544K)], 2.7635234 secs] [Times: user=40.86
sys=0.43, real=2.76 secs] [GC (System.gc()) OpenJDK 64-Bit Server VM
warning: INFO: os::commit_memory(0x00000007be200000, 9437184, 0)
failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 9437184 bytes for committing reserved memory.
An error report file with more information is saved as:
/opt/druid/hs_err_pid23725.log
021-08-02 15:57:02,759 main DEBUG
AsyncLogger.ThreadNameStrategy=CACHED [GC (Metadata GC Threshold)
[PSYoungGen: 193994K->15015K(305664K)] 193994K->15103K(1005056K),
0.0359705 secs] [Times: user=0.61 sys=0.12, real=0.04 secs] [Full GC (Metadata GC Threshold) [PSYoungGen: 15015K->0K(305664K)] [ParOldGen:
88K->13777K(699392K)] 15103K->13777K(1005056K), [Metaspace:
20552K->20552K(1067008K)], 0.0417042 secs] [Times: user=0.71 sys=0.00,
real=0.04 secs] [GC (Metadata GC Threshold) [PSYoungGen:
227685K->17760K(397824K)] 241462K->31545K(1097216K), 0.0317489 secs]
[Times: user=0.55 sys=0.09, real=0.03 secs] [Full GC (Metadata GC
Threshold) [PSYoungGen: 17760K->0K(397824K)] [ParOldGen:
13785K->21057K(699392K)] 31545K->21057K(1097216K), [Metaspace:
34143K->34143K(1079296K)], 0.0494124 secs] [Times: user=0.70 sys=0.03,
real=0.05 secs] [GC (Allocation Failure) [PSYoungGen:
379904K->19134K(542208K)] 400961K->40199K(1241600K), 0.0226037 secs]
[Times: user=0.11 sys=0.01, real=0.02 secs] [GC (Metadata GC
Threshold) [PSYoungGen: 208984K->11023K(580096K)]
230049K->32097K(1279488K), 0.0193193 secs] [Times: user=0.27 sys=0.01,
real=0.02 secs] [Full GC (Metadata GC Threshold) [PSYoungGen:
11023K->0K(580096K)] [ParOldGen: 21073K->19920K(904192K)]
32097K->19920K(1484288K), [Metaspace: 57781K->57781K(1101824K)],
0.1619206 secs] [Times: user=3.53 sys=0.03, real=0.16 secs] [GC (Allocation Failure) [PSYoungGen: 536576K->39126K(580096K)]
556496K->59054K(1484288K), 0.0221522 secs] [Times: user=0.39 sys=0.05,
real=0.02 secs] [GC (Allocation Failure) [PSYoungGen:
573446K->36851K(705024K)] 593374K->111081K(1609216K), 0.0349221 secs]
[Times: user=0.91 sys=0.23, real=0.03 secs] Finished peon task Heap
PSYoungGen total 705024K, used 437257K [0x0000000740000000,
0x0000000773100000, 0x00000007c0000000) eden space 668160K, 59% used
[0x0000000740000000,0x00000007587059b8,0x0000000768c80000) from
space 36864K, 99% used
[0x000000076d480000,0x000000076f87ccc0,0x000000076f880000) to
space 73728K, 0% used
[0x0000000768c80000,0x0000000768c80000,0x000000076d480000) ParOldGen
total 904192K, used 74230K [0x0000000640000000, 0x0000000677300000,
0x0000000740000000) object space 904192K, 8% used
[0x0000000640000000,0x000000064487db20,0x0000000677300000) Metaspace
used 85948K, capacity 87888K, committed 88064K, reserved 1126400K
class space used 10616K, capacity 11184K, committed 11264K,
reserved 1048576K

spark scala 'take(10)' operation is taking too long

I got the following code within my super-simple Spark Scala application:
...
val t3 = System.currentTimeMillis
println("VertexRDD created in " + (t3 - t2) + " ms")
vertRDD.cache
val t4 = System.currentTimeMillis
println("VertexRDD size : "+vertRDD.partitions.size)
println("VertexRDD cached in " + (t4 - t3) + " ms")
vertRDD.take(10).foreach(println)
println("VertexRDD size : "+vertRDD.partitions.size)
...
I submit my app to EMR Apache Spark cluster with command
spark-submit --deploy-mode cluster --master yarn --num-executors 4 --executor-memory 6g --driver-memory 6g --class com.****.TestSpark s3://****.jar
With regards to vertRDD - in total there are 250k records there (I'm reading them from the database it is 25Mbyte of data)
. as you can from the code I'm caching RDD few lines before calling this line (#175) below
vertRDD.take(10).foreach(println) - line #175 of my app
when I'm looking into Spark history I can see that all memory and etc parameters are underutilized - it is barely 60Mb of utilization against several Gigabytes available when this line is getting excuted and what's abnormal that it is being executed for more than 15 minutes always, in some cases it even fails to finish and cluster become 'terminated with errors'.
EMR cluster I'm running it is 1m5.2xlarge master and 4m5.2xlarge cores and it in many cases fails! I can't understand WTF!
UPD. After digging in the EMR console I can see that most of time it has garbage collection working
and I also see that one of 2 workers has been stopped by YARN and this is the log there
2021-02-12T20:17:01.404+0000: [GC (Allocation Failure) [PSYoungGen: 126976K->9341K(147968K)] 126976K->9357K(486912K), 0.0076611 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2021-02-12T20:17:02.068+0000: [GC (Allocation Failure) [PSYoungGen: 136317K->9547K(147968K)] 136333K->9579K(486912K), 0.0079604 secs] [Times: user=0.03 sys=0.02, real=0.01 secs]
2021-02-12T20:17:02.317+0000: [GC (Metadata GC Threshold) [PSYoungGen: 80014K->8203K(147968K)] 80046K->8243K(486912K), 0.0047442 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2021-02-12T20:17:02.321+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 8203K->0K(147968K)] [ParOldGen: 40K->7927K(195584K)] 8243K->7927K(343552K), [Metaspace: 20290K->20290K(1067008K)], 0.0239302 secs] [Times: user=0.10 sys=0.01, real=0.02 secs]
2021-02-12T20:17:02.885+0000: [GC (Allocation Failure) [PSYoungGen: 126976K->4351K(195584K)] 134903K->12286K(391168K), 0.0042397 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2021-02-12T20:17:03.438+0000: [GC (Allocation Failure) [PSYoungGen: 195327K->9196K(258560K)] 203262K->17139K(454144K), 0.0076206 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
2021-02-12T20:17:03.511+0000: [GC (Metadata GC Threshold) [PSYoungGen: 45869K->4857K(301568K)] 53813K->12800K(497152K), 0.0045228 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2021-02-12T20:17:03.515+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 4857K->0K(301568K)] [ParOldGen: 7943K->10963K(274944K)] 12800K->10963K(576512K), [Metaspace: 33870K->33868K(1079296K)], 0.0268540 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
2021-02-12T20:17:04.638+0000: [GC (Allocation Failure) [PSYoungGen: 289792K->11772K(301568K)] 300755K->24419K(576512K), 0.0113583 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2021-02-12T20:17:07.984+0000: [GC (Metadata GC Threshold) [PSYoungGen: 273980K->14305K(448000K)] 286626K->27278K(722944K), 0.0115704 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2021-02-12T20:17:07.995+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 14305K->0K(448000K)] [ParOldGen: 12972K->23489K(372736K)] 27278K->23489K(820736K), [Metaspace: 53854K->52909K(1099776K)], 0.1044483 secs] [Times: user=0.57 sys=0.02, real=0.10 secs]
2021-02-12T20:17:10.207+0000: [GC (Allocation Failure) [PSYoungGen: 433664K->16376K(462848K)] 457153K->62952K(835584K), 0.0293058 secs] [Times: user=0.17 sys=0.02, real=0.03 secs]
2021-02-12T20:17:12.893+0000: [GC (Allocation Failure) [PSYoungGen: 462840K->27642K(481280K)] 509416K->328728K(854016K), 0.2258796 secs] [Times: user=1.57 sys=0.22, real=0.23 secs]
2021-02-12T20:17:13.119+0000: [Full GC (Ergonomics) [PSYoungGen: 27642K->0K(481280K)] [ParOldGen: 301086K->317625K(916480K)] 328728K->317625K(1397760K), [Metaspace: 63821K->63816K(1110016K)], 1.6353318 secs] [Times: user=10.11 sys=0.08, real=1.64 secs]
2021-02-12T20:17:15.068+0000: [GC (Allocation Failure) [PSYoungGen: 453632K->75168K(579584K)] 771257K->523874K(1496064K), 0.0906250 secs] [Times: user=0.59 sys=0.13, real=0.09 secs]
2021-02-12T20:17:15.514+0000: [GC (Allocation Failure) [PSYoungGen: 528800K->2329K(671232K)] 977506K->451043K(1587712K), 0.0152511 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
2021-02-12T20:17:15.945+0000: [GC (Allocation Failure) [PSYoungGen: 543001K->76277K(669696K)] 991715K->983751K(1586176K), 0.1116201 secs] [Times: user=0.54 sys=0.35, real=0.12 secs]
2021-02-12T20:17:16.057+0000: [Full GC (Ergonomics) [PSYoungGen: 76277K->0K(669696K)] [ParOldGen: 907474K->523576K(1430528K)] 983751K->523576K(2100224K), [Metaspace: 65321K->65321K(1110016K)], 0.9539858 secs] [Times: user=7.26 sys=0.01, real=0.95 secs]
2021-02-12T20:17:17.427+0000: [GC (Allocation Failure) [PSYoungGen: 540672K->7657K(679936K)] 1064248K->531242K(2110464K), 0.0102141 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
2021-02-12T20:17:17.914+0000: [GC (Allocation Failure) [PSYoungGen: 637929K->102391K(760832K)] 1161514K->1215807K(2191360K), 0.1063215 secs] [Times: user=0.58 sys=0.20, real=0.10 secs]
2021-02-12T20:17:18.020+0000: [Full GC (Ergonomics) [PSYoungGen: 102391K->0K(760832K)] [ParOldGen: 1113416K->454233K(1679872K)] 1215807K->454233K(2440704K), [Metaspace: 65779K->65764K(1112064K)], 0.0906173 secs] [Times: user=0.39 sys=0.00, real=0.09 secs]
2021-02-12T20:17:18.733+0000: [GC (Allocation Failure) [PSYoungGen: 630272K->17588K(888832K)] 1084505K->471830K(2568704K), 0.0175248 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2021-02-12T20:17:19.399+0000: [GC (Allocation Failure) [PSYoungGen: 778420K->29288K(900608K)] 1232662K->483537K(2580480K), 0.0225306 secs] [Times: user=0.05 sys=0.03, real=0.02 secs]
2021-02-12T20:17:20.012+0000: [GC (Allocation Failure) [PSYoungGen: 790120K->18446K(962560K)] 1244369K->472704K(2642432K), 0.0210335 secs] [Times: user=0.04 sys=0.01, real=0.02 secs]
2021-02-12T20:17:20.738+0000: [GC (Allocation Failure) [PSYoungGen: 866830K->18574K(975360K)] 1321088K->472840K(2655232K), 0.0235178 secs] [Times: user=0.07 sys=0.01, real=0.02 secs]
2021-02-12T20:17:21.412+0000: [GC (Allocation Failure) [PSYoungGen: 866958K->31878K(1034240K)] 1321224K->486152K(2714112K), 0.0243945 secs] [Times: user=0.04 sys=0.04, real=0.03 secs]
2021-02-12T20:17:22.599+0000: [GC (Allocation Failure) [PSYoungGen: 964742K->53206K(1047040K)] 1419016K->507488K(2726912K), 0.0283320 secs] [Times: user=0.08 sys=0.03, real=0.03 secs]
2021-02-12T20:17:23.132+0000: [GC (Allocation Failure) [PSYoungGen: 986070K->23551K(1113088K)] 1440352K->477840K(2792960K), 0.0177533 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2021-02-12T20:17:23.604+0000: [GC (Allocation Failure) [PSYoungGen: 1037311K->28486K(1121280K)] 1491600K->482783K(2801152K), 0.0183161 secs] [Times: user=0.03 sys=0.03, real=0.02 secs]
2021-02-12T20:17:24.024+0000: [GC (Allocation Failure) [PSYoungGen: 1042246K->36085K(1196032K)] 1496543K->490390K(2875904K), 0.0191460 secs] [Times: user=0.04 sys=0.03, real=0.02 secs]
2021-02-12T20:17:24.584+0000: [GC (Allocation Failure) [PSYoungGen: 1139957K->50496K(1199616K)] 1594262K->504809K(2879488K), 0.0207042 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2021-02-12T20:17:25.046+0000: [GC (Allocation Failure) [PSYoungGen: 1154368K->47787K(1273344K)] 1608681K->502108K(2953216K), 0.0271859 secs] [Times: user=0.07 sys=0.03, real=0.02 secs]
2021-02-12T20:17:25.520+0000: [GC (Allocation Failure) [PSYoungGen: 1225899K->50015K(1271296K)] 1680220K->504344K(2951168K), 0.0199173 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
2021-02-12T20:17:26.012+0000: [GC (Allocation Failure) [PSYoungGen: 1228127K->28438K(1347584K)] 1682456K->482776K(3027456K), 0.0222568 secs] [Times: user=0.04 sys=0.02, real=0.03 secs]
2021-02-12T20:17:26.519+0000: [GC (Allocation Failure) [PSYoungGen: 1290518K->21046K(1350656K)] 1744856K->475392K(3030528K), 0.0208783 secs] [Times: user=0.04 sys=0.01, real=0.02 secs]
2021-02-12T20:17:27.004+0000: [GC (Allocation Failure) [PSYoungGen: 1283126K->51072K(1436672K)] 1737472K->505426K(3116544K), 0.0248668 secs] [Times: user=0.06 sys=0.03, real=0.03 secs]
2021-02-12T20:17:27.523+0000: [GC (Allocation Failure) [PSYoungGen: 1401216K->49452K(1437184K)] 1855570K->503966K(3117056K), 0.0230231 secs] [Times: user=0.07 sys=0.00, real=0.03 secs]
2021-02-12T20:17:28.038+0000: [GC (Allocation Failure) [PSYoungGen: 1399596K->42078K(1528832K)] 1854110K->496648K(3208704K), 0.0247465 secs] [Times: user=0.06 sys=0.02, real=0.02 secs]
2021-02-12T20:17:28.670+0000: [GC (Allocation Failure) [PSYoungGen: 1491038K->24493K(1531392K)] 1945608K->479087K(3211264K), 0.0582659 secs] [Times: user=0.15 sys=0.00, real=0.06 secs]
2021-02-12T20:17:29.633+0000: [GC (Allocation Failure) [PSYoungGen: 1473453K->31079K(1612800K)] 1928047K->486008K(3292672K), 0.0336889 secs] [Times: user=0.05 sys=0.02, real=0.04 secs]
2021-02-12T20:17:30.843+0000: [GC (Allocation Failure) [PSYoungGen: 1575783K->46063K(1622528K)] 2030712K->501032K(3302400K), 0.0422580 secs] [Times: user=0.09 sys=0.01, real=0.04 secs]
2021-02-12T20:17:32.433+0000: [GC (Allocation Failure) [PSYoungGen: 1590767K->24292K(1703424K)] 2045736K->480558K(3383296K), 0.0506315 secs] [Times: user=0.08 sys=0.02, real=0.05 secs]
2021-02-12T20:17:34.324+0000: [GC (Allocation Failure) [PSYoungGen: 1659108K->24958K(1710592K)] 2115374K->481281K(3390464K), 0.0576808 secs] [Times: user=0.13 sys=0.00, real=0.06 secs]
Heap
PSYoungGen total 1710592K, used 1467342K [0x0000000740000000, 0x00000007b2400000, 0x00000007c0000000)
eden space 1634816K, 88% used [0x0000000740000000,0x0000000798093f40,0x00000007a3c80000)
from space 75776K, 32% used [0x00000007a3c80000,0x00000007a54dfb78,0x00000007a8680000)
to space 73728K, 0% used [0x00000007adc00000,0x00000007adc00000,0x00000007b2400000)
ParOldGen total 1679872K, used 456322K [0x0000000640000000, 0x00000006a6880000, 0x0000000740000000)
object space 1679872K, 27% used [0x0000000640000000,0x000000065bda0b40,0x00000006a6880000)
Metaspace used 71040K, capacity 76834K, committed 76948K, reserved 1116160K
class space used 9093K, capacity 9852K, committed 9876K, reserved 1048576K
I'm still in WTF mode on why it can't process 25Mb of data...
It seems like you have a huge number of jvm objects on your tasks. The answer would be in two parts:
Decrease paralelism by passing --executor-cores 4 and increase memory --executor-memory 8g
pass additional JVM params to you master and executors to change GC to CG1
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC"
make sure you run on yarn
--master yarn
Actions like take or count trigger DAG execution which takes some time to execute. You can do the following things to reduce the time:
Cache or persist intermediate results
If data is small, run it on a single machine
Use cloudwatch to Monitor your EMR cluster to check available Yarn memory and container pending ratio during the run, these indicate if your job is lacking resources.

Zeppelin exception with Spark Basic Features

I teach a class on Scala and Spark. I've been demonstrating Zeppelin for five years now (and using it for a bit longer than that).
For the last couple of years, whenever I demonstrate Zeppelin, using the distribution out-of-the-box, I can only show the Spark Basic Features Notebook. When I do this, all of the paragraphs come up as they should. If I try to change the age in the age field, or simply try to re-run any of the paragraphs, I get an exception.
I repeat: this is out-of-the-box. I downloaded the 0.9.0-preview2 version, started the daemon, and opened the provided notebook. Any ideas? I'm on a MacBook Pro with OS 10.15.7. I also have Spark spark-3.0.1-bin-hadoop2.7 installed.
Here's the error that I get:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.zeppelin.spark.SparkSqlInterpreter.internalInterpret(SparkSqlInterpreter.java:105)
at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:776)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:668)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:130)
at org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:39)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: Table or view not found: bank; line 2 pos 5;
'Sort ['age ASC NULLS FIRST], true
+- 'Aggregate ['age], ['age, count(1) AS value#4L]
+- 'Filter ('age < 30)
+- 'UnresolvedRelation [bank]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:106)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:176)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:176)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:176)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
... 15 more

java.lang.IllegalArgumentException: Illegal sequence boundaries Spark

I am using Azure Databricks and Scala. I wanna show() a Dataframe but I obtained an error that I can not understand and I would like to solve it. The lines of code that I have are:
println("----------------------------------------------------------------Printing schema")
df.printSchema()
println("----------------------------------------------------------------Printing dataframe")
df.show()
println("----------------------------------------------------------------Error before")
The Standard output is the following one, the message "----------------------------------------------------------------Error before" it does not appears.
> ----------------------------------------------------------------Printing schema
> root
> |-- processed: integer (nullable = false)
> |-- processDatetime: string (nullable = false)
> |-- executionDatetime: string (nullable = false)
> |-- executionSource: string (nullable = false)
> |-- executionAppName: string (nullable = false)
>
> ----------------------------------------------------------------Printing dataframe
> 2020-02-18T14:19:00.069+0000: [GC (Allocation Failure) [PSYoungGen: 1497248K->191833K(1789440K)] 2023293K->717886K(6063104K),
> 0.0823288 secs] [Times: user=0.18 sys=0.02, real=0.09 secs]
> 2020-02-18T14:19:40.823+0000: [GC (Allocation Failure) [PSYoungGen: 1637209K->195574K(1640960K)] 2163262K->721635K(5914624K),
> 0.0483384 secs] [Times: user=0.17 sys=0.00, real=0.05 secs]
> 2020-02-18T14:19:44.843+0000: [GC (Allocation Failure) [PSYoungGen: 1640950K->139092K(1809920K)] 2167011K->665161K(6083584K),
> 0.0301711 secs] [Times: user=0.11 sys=0.00, real=0.03 secs]
> 2020-02-18T14:19:50.910+0000: Track exception: Job aborted due to stage failure: Task 59 in stage 62.0 failed 4 times, most recent
> failure: Lost task 59.3 in stage 62.0 (TID 2672, 10.139.64.6, executor
> 1): java.lang.IllegalArgumentException: Illegal sequence boundaries:
> 1581897600000000 to 1581811200000000 by 86400000000
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.processNext(Unknown
> Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$15$$anon$2.hasNext(WholeStageCodegenExec.scala:659)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
> at org.apache.spark.scheduler.Task.run(Task.scala:112)
> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Driver stacktrace:.
> 2020-02-18T14:19:50.925+0000: Track message: Process finished with exit code 1. Metric: Writer. Value: 1.0.
It's hard to know exactly without seeing your code, but I had a similar error and the other answer (about int being out of range) led me astray.
The java.lang.IllegalArgumentException you are getting is confusing but is actually quite specific:
Illegal sequence boundaries: 1581897600000000 to 1581811200000000 by 86400000000
This error is complaining that that you are using a sequence() spark SQL function and you are telling it to go from 1581897600000000 to 1581811200000000 by 86400000000. It's easy to miss because of the big numbers, but this an instruction to go from a larger number to a smaller number by an increment of a positive integer. E.g., from 12 to 6 by 3.
This is not allowed according to the DataBricks documentation:
start - an expression. The start of the range.
stop - an expression. The end the range (inclusive).
step - an optional expression. The step of the
range. By default step is 1 if start is less than
or equal to stop, otherwise -1. For the temporal
sequences it’s 1 day and -1 day respectively. If
start is greater than stop then the step must be
negative, and vice versa.
Additionally, I believe the other answer's focus on the int column is misleading. The large numbers mentioned in the illegal sequence error look like they are coming from a date column. You don't have any DateType columns but your string columns are named like date columns; presumably you are using them in a sequence function and they are getting coerced into dates.
You can get this error when you attempt to
sequence(start_date, end_date, [interval])
on a table which has some of start_dates less than end_dates and others greater
When applying this function all of date ranges should be either positive or negative, not mixed
Your schema is expecting an int, an int in Java has a maximum size of [-2 147 483 648 to +2 147 483 647].
So I would change the schema from int to long.

Unable to write spark dataframe to a parquet file format to C drive in PySpark

I am using the following command to try to write a spark (2.4.4 using Ananaconda 3 Jupyter Notebook) dataframe to a parquet file in Pyspark and get a very strange error message that I cannot resolve. I would appreciate any insights any has.
df.write.mode("overwrite").parquet("test/")
Error message is as follows:
--------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-37-2b4a1d75a5f6> in <module>()
1 # df.write.partitionBy("AB").parquet("C:/test.parquet",mode='overwrite')
----> 2 df.write.mode("overwrite").parquet("test/")
3 # df.write.mode('SaveMode.Overwrite').parquet("C:/test.parquet")
C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py in parquet(self, path, mode, partitionBy, compression)
841 self.partitionBy(partitionBy)
842 self._set_opts(compression=compression)
--> 843 self._jwrite.parquet(path)
844
845 #since(1.6)
C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o862.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 52.0 failed 1 times, most recent failure: Lost task 0.0 in stage 52.0 (TID 176, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\583621\OneDrive - Booz Allen Hamilton\Personal\Teaching\PySpark Essentials for Data Scientists\PySpark DataFrame Essentials\test\_temporary\0\_temporary\attempt_20191206164455_0052_m_000000_176\part-00000-2cd01dbe-9e3f-44a5-88e1-e904822024c2-c000.snappy.parquet
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:248)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
... 32 more
Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\583621\OneDrive - Booz Allen Hamilton\Personal\Teaching\PySpark Essentials for Data Scientists\PySpark DataFrame Essentials\test\_temporary\0\_temporary\attempt_20191206164455_0052_m_000000_176\part-00000-2cd01dbe-9e3f-44a5-88e1-e904822024c2-c000.snappy.parquet
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:248)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
1
# Now something a bit more complicated: Read in a full parquet
2
parquet
You need to set a Hadoop home.
You can get the WINUTILS.EXE binary from a Hadoop redistribution. There is a repository of this for some Hadoop versions on github.
Then
1) Either You can Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE.
2)or Configure in code as
import sys
import os
os.environ['HADOOP_HOME'] = "C:/Mine/Spark/hadoop-2.6.0"
sys.path.append("C:/Mine/Spark/hadoop-2.6.0/bin")
Hope this helps !