HBase MemStore and BlockCache exceeds the threshold - scala

I want to connect spark jobs with my remote HBase. I got the following jars:
--jars spark-hbase-connector_2.10-1.0.3.jar,hbase-shaded-protobuf-3.4.1.jar,hbase-0.90.3.jar
So I am importing protobuff, connector to HBase (https://github.com/nerdammer/spark-hbase-connector) and hbase itself. However I get:
java.lang.RuntimeException: Current heap configuration for MemStore
and BlockCache exceeds the threshold required for successful cluster
operation. The combined value cannot exceed 0.8. Please check the
settings for hbase.regionserver.global.memstore.upperLimit and
hfile.block.cache.size in your configuration.
In the hbase-0.90.3.jar, the hbase-default.xml specifies
hbase.regionserver.global.memstore.upperLimit
0.4
Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap
and
hfile.block.cache.size
0.2
Percentage of maximum heap (-Xmx setting) to allocate to block cache
used by HFile/StoreFile. Default of 0.2 means allocate 20%.
Set to 0 to disable.
0.4+0.2 isn't more than 0.8. Is there anything that can cause this error?
In my code I specify
sparkConf.set("spark.hbase.host", "remoteHost")
The port to zookeeper in hbase-defaults is 2181 and it is the same in my hbase on remote server.
Thanks in advance

Related

how to resolve Java heap space error in kafka stream on deployment environment using kubernetes

I am working on Kafka streaming using Docker. I have deployed this kafka-stream module using a docker image of it on a kubernetes pod. When I started writing data in Kafka topics, it wrote a few records, but after some time, it started showing multiple errors .
I have Kafka topics with partition 6 for every topic and the replication factor is 3.
Below are the errors:
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test.aggregator.s3-1:316987 ms has passed since batch creation
The broker is either slow or in bad state (like not having enough replicas) in responding the request, or the connection to broker was interrupted sending the request or receiving the response.
Consider overwriting `max.block.ms` and /or `delivery.timeout.ms` to a larger value to wait longer for such scenarios and avoid timeout errors
Exception handler choose to CONTINUE processing in spite of this error but written offsets would not be recorded. (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:221)
Heartbeat thread failed due to unexpected error (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1392)
java.lang.OutOfMemoryError: Java heap space
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:101) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:27) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.CompactArrayOf.read(CompactArrayOf.java:84) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:114) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:27) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.CompactArrayOf.read(CompactArrayOf.java:84) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:114) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:325) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:720) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:834) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:306) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1321) [org.apache.kafka-kafka-clients-2.6.0.jar:?]
[2022-07-20 10:38:10,977] ERROR [kafka-producer-network-thread | stream-consumer-f7fb108f-6d7c-4736-a69f-a885a3eddc47-StreamThread-2-producer] Uncaught exception in thread 'kafka-producer-network-thread | stream-consumer-f7fb108f-6d7c-4736-a69f-a885a3eddc47-StreamThread-2-producer': (org.apache.kafka.common.utils.KafkaThread:49)
java.lang.OutOfMemoryError: Java heap space
Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
Please suggest me what I am missing here,
because I am running the same module in my local environment so it is working fine.

Spark creating index in elasticsearch results in OutOfMemoryError

In a spark job we are creating index in elasticsearch from a file 0f ~120GB. The file is devided into 3830 patitions.After job starts putting data into elasticsearch index, garbage collector messages appear in the elasticsearch logs ending with java.lang.OutOfMemoryError: Java heap space.
[WARN ][monitor.jvm] [gc][old][209][6] duration [57.6s], collections [1]/[57.9s], total [57.6s]/[3.7m], memory [24.2gb]->[18.7gb]/[24.9gb], all_pools {[young] [8.6mb]->[10.3mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [24.2gb]->[18.7gb]/[24.3gb]}
java.lang.OutOfMemoryError: Java heap space
I restarted the elasticsearch nodes and ran job again but, the result was same.
Following are elasticsearch properties in spark config
conf.set("es.nodes", "xx.xx.xxx.xx:9200")
conf.set("es.scroll.size", "10000")
conf.set("es.index.auto.create", "true")
We are using Spark 1.6 and elasticsearch 2.0.2. There are three elasticsearch nodes in the cluster each with 25 GB heap space.

Spark Application - High "Executor Computing Time"

I have a Spark application that is now running for 46 hours. While majority of its jobs complete within 25 seconds, specific jobs take hours. Some details are provided below:
Task Time Shuffle Read Shuffle Write
7.5 h 2.2 MB / 257402 2.9 MB / 128601
There are other similar task times off-course having values of 11.3 h, 10.6 h, 9.4 h etc. each of them spending bulk of the activity time on "rdd at DataFrameFunctions.scala:42.". Details for the stage reveals that the time spent by executor on "Executor Computing time". This executor runs at DataNode 1, where the CPU utilization is very normal about 13%. Other boxes (4 more worker nodes) have very nominal CPU utilization.
When the Shuffle Read is within 5000 records, this is extremely fast and completes with 25 seconds, as stated previously. Nothing is appended to the logs (spark/hadoop/hbase), neither anything is noticed at /tmp or /var/tmp location which will indicate some disk related activity is in progress.
I am clueless about what is going wrong. Have been struggling with this for quite some time now. The versions of software used are as follows:
Hadoop : 2.7.2
Zookeeper : 3.4.9
Kafka : 2.11-0.10.1.1
Spark : 2.1.0
HBase : 1.2.6
Phoenix : 4.10.0
Some configurations on the spark default file.
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.history.fs.logDirectory hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.yarn.jars hdfs://SDCHDPMAST1:8111/user/appuser/spark/share/lib/*.jar
spark.driver.maxResultSize 5G
spark.deploy.zookeeper.url SDCZKPSRV01
spark.executor.memory 12G
spark.driver.memory 10G
spark.executor.heartbeatInterval 60s
spark.network.timeout 300s
Is there any way I can reduce the time spent on "Executor Computing time"?
The job performing on the specific dataset is skewed. Because of the skewness, jobs are taking more than expected.

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop.
The code exits with the following exception after running a small number of iterations, around 3000 iterations.
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
Exception in thread "dag-scheduler-event-loop" 16/11/22 13:37:32 WARN NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.MpscLinkedQueue.offer(MpscLinkedQueue.java:126)
at io.netty.util.internal.MpscLinkedQueue.add(MpscLinkedQueue.java:221)
at io.netty.util.concurrent.SingleThreadEventExecutor.fetchFromScheduledTaskQueue(SingleThreadEventExecutor.java:259)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:346)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Pattern.compile(Pattern.java:1047)
at java.lang.String.replace(String.java:2180)
at org.apache.spark.util.Utils$.getFormattedClassName(Utils.scala:1728)
at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:57)
at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:57)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:57)
at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:87)
Can someone help please, is this error being caused because of a large number of calls to spark method?
Its RpcTimeoutException .. so spark.network.timeout (spark.rpc.askTimeout) could be tuned with larger-than-default values in order to handle complex workload. You can start with these values and adjust accordingly to your workloads.
Please see latest
spark.network.timeout 120s Default timeout for all network
interactions. This config will be used in place of
spark.core.connection.ack.wait.timeout,
spark.storage.blockManagerSlaveTimeoutMs,
spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or
spark.rpc.lookupTimeout if they are not configured.
Also consider increasing executor memory i.e spark.executor.memory and most imp thing is review your code, to check whether that is candidate for further optimization.
Solution : value 600 is based on requirement
set by SparkConf: conf.set("spark.network.timeout", "600s")
set by spark-defaults.conf: spark.network.timeout 600s
set when calling spark-submit: --conf spark.network.timeout=600s
The above stack trace is also shown java heap space its OOM error so once try to increase the memory and run it and regarding timeout its rpc timeout so you can set spark.network.timeout with timeout value according to your need...
pls increase the executer memory so that OOM will go away else make chnage in code so that your RDD wont have big memory foot print.
--executer-memory = 3G
Just increase the spark.executor.heartbeatInterval to 20s, the error says that.
You are seeing this issue due to the executor memory.
Try increasing the memory to (x 2) so the containers don't time out while waiting on the remaining containers.
For posterity: I was getting similar errors, but changing memory/timeout settings was not helping at all.
In my case the problem was that somebody was calling socket.setdefaulttimeout in a library function that I was calling before creating the Spark session. setdefaulttimeout affected all new sockets created after that point, including the socket that Spark used to communicate with YARN, so that connection would time out unexpectedly.
Needless to say, don't do this.

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.