Pyspark dataframe not grouping by correctly - pyspark

I've had an application running successfully for months. Recently it started to fail because a method that takes in a pyspark DataFrame, groups it, and outputs the response somehow outputs a corrupt dataframe.
Here's example code of what I'm doing when everything fails:
from pyspark.sql.functions import sum, avg
group_by = pyspark_df_in.groupBy("dimension1", "dimension2", "dimension3")
pyspark_df_out = group_by.agg(sum("metric1").alias("MyMetric1"), sum("metric2").alias("MyMetric2")
If I print print(pyspark_df_in.head(1)), I correctly get the first row of my dataset. After grouping by a few dimensions and printing print(pyspark_df_out.head(2)), I get the following error. I get a similar error trying to do basically anything with this new grouped by dataset (and I know that the group by should result in data as I've eyeballed it to make sure).
19/10/08 15:13:00 WARN TaskSetManager: Stage 9 contains a task of very large size (282 KB). The maximum recommended task size is 100 KB.
19/10/08 15:13:00 ERROR Executor: Exception in task 1.0 in stage 9.0 (TID 40)
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:862)
at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 ERROR Executor: Exception in task 2.0 in stage 9.0 (TID 41)
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:862)
at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID 40, localhost, executor driver): java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:862)
at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 ERROR TaskSetManager: Task 1 in stage 9.0 failed 1 times; aborting job
19/10/08 15:13:00 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 39, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 3.0 in stage 9.0 (TID 42, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 49, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 46, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 6.0 in stage 9.0 (TID 45, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 11.0 in stage 9.0 (TID 50, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 9.0 in stage 9.0 (TID 48, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 8.0 in stage 9.0 (TID 47, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 4.0 in stage 9.0 (TID 43, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 5.0 in stage 9.0 (TID 44, localhost, executor driver): TaskKilled (Stage cancelled)
Info about my environment:
Spark Context version = 2.4.3
Python version = 3.7
OS = Linux CentOS 7
Has anyone run into this issue or have any ideas how I can debug/fix it?

Related

Delta Lake: File Not Found Exception

I am using Delta Lake to perform merge operation, for which I am trying to convert my Parquet files to delta format which are partitioned over time:
val source = spark.read.parquet("s3a://data-lake/source/")
source
.write
.option("maxRecordsPerFile",20000)
.mode("overwrite")
.partitionBy("time")
//.option("fs.s3a.committer.name", "partitioned") (I even tried using s3a committers)
.format("delta")
.save("s3a://data-lake/target/")
The data is over around 250G and my spark configs are:
spark.cores.max 420
spark.default.parallelism 10000
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.driver.extraJavaOptions -Xms20g
spark.driver.memory 28g
spark.executor.cores 2
spark.executor.memory 28G
In the logs it shows File Not Found Error and eventually kills executors after running for some time:
20/05/16 11:51:02 WARN TaskSetManager: Lost task 208.0 in stage 2.0 (TID 3294, 172.16.145.25, executor 14): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://data-lake/target/time=20190101/part-00208-2c9b5ddd-f2c1-4b8c-9d77-eacb0055ff82.c000.snappy.parquet
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
... 53 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried saving it as parquet format, it does work as expected. But it takes significantly more time to process.
If it's already in parquet, you can copy the folder in s3 to the target place without spark, and initialize it as a delta table with the following line:
import io.delta.DeltaTable
DeltaTable.forPath("s3a://data-lake/target/"))
More info here.

ConnectionLoss for /hbase/meta-region-server

We encountered an error when run spark-submit command on AWS EMR cluster.
Our spark job previously was running on HDP cluster without any additional configuration in a scala code. When we start the job all executors throw errors about connection to Zookeeper node that don't exist on core nodes of EMR.
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, ip-10-239-130-44.ec2.internal, executor 3): org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location for replica 0
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:332)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:153)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:269)
at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:286)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:217)
at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:314)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:142)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server
at org.apache.hadoop.hbase.client.ConnectionImplementation.get(ConnectionImplementation.java:2009)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateMeta(ConnectionImplementation.java:785)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:752)
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:735)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:927)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:755)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:325)
... 46 more
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:189)
at org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
... 1 more
The Spark command:
spark-submit \
--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark \
--conf spark.executor.memory=4g \
--conf spark.executor.cores=2 \
--class app_name \
--master yarn \
--deploy-mode client \
--jars ... -s hbase -i 13181_qatopilot -o /home/hadoop/s/d 2> last_ja-datascience.log
It seems that spark executor tried to connect on it's localhost to the port 2181 when actually Zookeeper node don't exist on executor:
20/02/02 09:00:19 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
20/02/02 09:00:19 WARN ReadOnlyZKClient: 0x19a6f47f to localhost:2181 failed for get of /hbase/hbaseid, code = CONNECTIONLOSS, retries = 28
20/02/02 09:00:20 INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
20/02/02 09:00:20 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

Can't Consume Messages When Using Kafka v.0.10.0.x

I'm using kafka v.0.10.2 on my cluster.
I can produce message fine using v.0.8.x and v0.10.2
BUT when consuming messages using client v0.10.0.x I'm having errors below;
WARN [ConsumerFetcherThread-console-consumer-myconsumer-0-1002], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#16090d7a. Possible cause: java.nio.BufferUnderflowException (kafka.consumer.ConsumerFetcherThread)
ok,now my kafka.clien is v.0.8.x
but ihave a new problem
6 15:07:13 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoop11, executor 10): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: kafka.common.KafkaException: String exceeds the maximum size of 32767.
at kafka.api.ApiUtils$.shortStringLength(ApiUtils.scala:73)
at kafka.api.TopicData$.headerSize(FetchResponse.scala:107)
at kafka.api.TopicData.<init>(FetchResponse.scala:113)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:103)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:196)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:212)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
what show i do
String exceeds the maximum size of 32767.
Tell me about my final plan
I upgraded the version to 0.10.2

Scala error with gbt classifier FileNotFoundException

I have a problem when I fit my model, the df_train DataFrame already have the features vectors and the label vector, but I get this error, and I don't have any idea of what is going on. I tried everything but still I don't know what happens. Maybe it is a problem of memory with Spark or something else.
I try something else, I run the same script in a cluster of Hadoop and works fine, but when I run it in my local machine appears that error
val gbt = new GBTClassifier().setFeaturesCol("features").setLabelCol("label").setMaxIter(50)
val model = gbt.fit(df_train)
java.io.FileNotFoundException: C:\Users\USER\AppData\Local\Temp\blockmgr-634044d0-c0c8-43c2-b383-aa4c4a1f4f32\36\temp_shuffle_54d1a4b0-8738-4f54-becf-39d8c0d7a129 (El sistema no puede encontrar la ruta especificada)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:217)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1386)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/21 14:27:52 ERROR BypassMergeSortShuffleWriter: Error while deleting file C:\Users\USER\AppData\Local\Temp\blockmgr-634044d0-c0c8-43c2-b383-aa4c4a1f4f32\36\temp_shuffle_54d1a4b0-8738-4f54-becf-39d8c0d7a129
18/03/21 14:27:52 ERROR Executor: Exception in task 0.0 in stage 710.0 (TID 706)
java.io.FileNotFoundException: C:\Users\USER\AppData\Local\Temp\blockmgr-634044d0-c0c8-43c2-b383-aa4c4a1f4f32\36\temp_shuffle_54d1a4b0-8738-4f54-becf-39d8c0d7a129 (El sistema no puede encontrar la ruta especificada)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/21 14:27:52 WARN TaskSetManager: Lost task 0.0 in stage 710.0 (TID 706, localhost, executor driver): java.io.FileNotFoundException: C:\Users\USER\AppData\Local\Temp\blockmgr-634044d0-c0c8-43c2-b383-aa4c4a1f4f32\36\temp_shuffle_54d1a4b0-8738-4f54-becf-39d8c0d7a129 (El sistema no puede encontrar la ruta especificada)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/21 14:27:52 ERROR TaskSetManager: Task 0 in stage 710.0 failed 1 times; aborting job
18/03/21 14:27:52 INFO TaskSchedulerImpl: Removed TaskSet 710.0, whose tasks have all completed, from pool
18/03/21 14:27:52 INFO TaskSchedulerImpl: Cancelling stage 710
18/03/21 14:27:52 INFO DAGScheduler: ShuffleMapStage 710 (flatMap at RandomForest.scala:921) failed in 0,044 s due to Job aborted due to stage failure: Task 0 in stage 710.0 failed 1 times, most recent failure: Lost task 0.0 in stage 710.0 (TID 706, localhost, executor driver): java.io.FileNotFoundException: C:\Users\USER\AppData\Local\Temp\blockmgr-634044d0-c0c8-43c2-b383-aa4c4a1f4f32\36\temp_shuffle_54d1a4b0-8738-4f54-becf-39d8c0d7a129 (El sistema no puede encontrar la ruta especificada)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
18/03/21 14:27:52 INFO DAGScheduler: Job 706 failed: collectAsMap at RandomForest.scala:928, took 0,045648 s

Random Forest Classifier - OutOfMemoryError - PySpark

I am using Random Forest Classifier for a text dataset(approx 60Mb).
I am making my features from hashing TF like -
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
features = hashingTF.transform(removeStopwords)
rb = RandomForestClassifier().setFeaturesCol("features").setLabelCol("labels")
After that I am making a pipeline and creating a model using my training data. But when I run my code I am getting following error -
18/02/09 14:31:03 WARN TaskSetManager: Stage 9 contains a task of very large size (42550 KB). The maximum recommended task size is 100 KB.
18/02/09 14:47:20 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$5.apply(ShuffleBlockFetcherIterator.scala:390)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$5.apply(ShuffleBlockFetcherIterator.scala:390)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:342)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:348)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:395)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:154)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
18/02/09 14:47:20 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 10,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$5.apply(ShuffleBlockFetcherIterator.scala:390)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$5.apply(ShuffleBlockFetcherIterator.scala:390)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:342)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:348)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:395)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:154)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
18/02/09 14:47:20 WARN TaskSetManager: Lost task 0.0 in stage 10.0 (TID 10, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
But when I change this line -
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=2000) then Random Classifier is working.
Do any body know what I need to do to solve this error because default numFeatures for HashingTF is 2^20 and I am restricting it to only 2000 and this reduces my accuracy.
Same code I tried with Naive Bayes and LinearSVC it is working fine without limiting numFeatures in HashingTF but my accuracy is coming only 60%. So I want to check with RFC to increase accuracy.
I even tried to increase the heap size by this -
export _JAVA_OPTIONS=-Xmx4096m
But this is also not helping me.