Kafka Connect running out of heap space. Already setting `-Xmx12g` - apache-kafka

My Kafka Connect sink is running out of heap space. There are other threads like this: Kafka Connect running out of heap space
where the issue is just running with the default memory setting. Previously, raising it to 2g fixed my issue. However, when adding a new sink, the heap error came back. I raised Xmx to 12g, and I still get the error.
In my systemd service file, I have:
Environment="KAFKA_HEAP_OPTS=-Xms512m -Xmx12g"
I'm still getting the heap errors even with a very high Xmx setting. I also lowered my flush.size to 1000, which I thought would help. FYI, this connector is targeting 11 different Kafka topics. Does that impose unique memory demands?
How can I fix or diagnose further?
FYI, this is with Kafka 0.10.2.1 and Confluent Platform 3.2.2. Do more recent versions provide any improvements here?
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at io.confluent.connect.s3.storage.S3OutputStream.<init>(S3OutputStream.java:67)
at io.confluent.connect.s3.storage.S3Storage.create(S3Storage.java:197)
at io.confluent.connect.s3.format.avro.AvroRecordWriterProvider$1.write(AvroRecordWriterProvider.java:67)
at io.confluent.connect.s3.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:393)
at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:197)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:429)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-03-13 20:31:46,398] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:450)
[2018-03-13 20:31:46,401] ERROR Task avro-s3-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:141)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:451)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Currently, the memory requirements of the S3 connector depend on the number of outstanding partitions and the s3.part.size. Try setting the latter to 5MB (the minimum allowed). The default is 25MB.
Also read here, for a more detailed explanation of sizing suggestions:
https://github.com/confluentinc/kafka-connect-storage-cloud/issues/29

Firstly, I know nothing about Kafka.
However, as a general rule, when a process meets some kind of capacity limit, and you can't raise that limit, then you must throttle the process somehow. Suggest you explore the possibility of a periodic pause. Maybe a sleep for 10 milliseconds very 100 milliseconds. Something like that.
Another thing you can try is to pin your Kafka process to one specific CPU. This can sometimes have amazingly beneficial effects.

Related

IOError(Stalefile) exception being thrown by Kafka Streams RocksDB

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.
If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

io.debezium.DebeziumException: The db history topic or its content is fully or partially missing

I am facing frequent issues related to db history topic which is created by the connector itself. There is a temporary solution (by changing the name of the db history topic) which I tried but it's not the better way to handle it. Also, the retention byte is set to -1.
This is the error stack.
ERROR WorkerSourceTask{id=cdcit.ventures.sandbox.streamdomain.streamsubdomain.order-filter-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
io.debezium.DebeziumException: The db history topic or its content is fully or partially missing. Please check database history topic configuration and re-execute the snapshot.
at io.debezium.relational.HistorizedRelationalDatabaseSchema.recover(HistorizedRelationalDatabaseSchema.java:47)
at io.debezium.connector.sqlserver.SqlServerConnectorTask.start(SqlServerConnectorTask.java:87)
at io.debezium.connector.common.BaseSourceTask.start(BaseSourceTask.java:101)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:213)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2020-09-04 19:12:26,445] ERROR WorkerSourceTask{id=cdcit.ventures.sandbox.streamdomain.streamsubdomain.order-filter-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
You must use a single database history topic per connector. The topic must not be used by more than one connector.
please change the value of parameter "name" in the config "connector.properties" to a new name.
Thanks.

JDBC Sink connector throwing java.sql.BatchUpdateException

I started a Sink JDBC some weeks ago. Everything was fine until the logs started to thow this error:
[2019-06-27 11:35:44,121] WARN Write of 500 records failed, remainingRetries=10 (io.confluent.connect.jdbc.sink.JdbcSinkTask:68)
java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.10] [Error 1338] [SQLState HY000] A failure occurred while executing a PreparedStatement batch request. Details of the failure can be found in the exception chain that is accessible with getNextException.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:138)
at com.teradata.jdbc.jdbc_4.TDPreparedStatement.executeBatchDMLArray(TDPreparedStatement.java:276)
at com.teradata.jdbc.jdbc_4.TDPreparedStatement.executeBatch(TDPreparedStatement.java:2754)
at io.confluent.connect.jdbc.sink.BufferedRecords.flush(BufferedRecords.java:99)
at io.confluent.connect.jdbc.sink.BufferedRecords.add(BufferedRecords.java:78)
at io.confluent.connect.jdbc.sink.JdbcDbWriter.write(JdbcDbWriter.java:62)
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:66)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:429)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have already tried to low the batch.size property, evento as low as 100 and it is still failing.
Added connector status:
{"name":"teradata-sink-K_C_OSUSR_DGL_DFORM_I1-V2",
"connector":{
"state":"RUNNING",
"worker_id":"10.28.148.64:41029"},
"tasks":[{"state":"FAILED","trace":"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:451)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)\n","id":0,"worker_id":"10.28.148.64:41029"}]}
I faced similar issue while trying to write to teradata from spark using JDBC. Apparently, this happens with teradata only. It happens when you try to write to one table using multiple JDBC connections parallely. In my case, spark was forking multiple jdbc connection to teradata while writing the data to Teradata.
Your options are:
You have to restrict your application not to make multiple jdbc
connections, you can monitor that from Viewpoint if you have access.
You can try providing a jdbc type=FASTLOAD and see if it is working.
like this
Or go for TPT which handles this issue internally.

Kafka can not delete old log segments on Windows

I have run into an issue with Kafka on Windows where it attempts to delete log segments, but it cannot due to another process having access to the files. This is caused by Kafka holding access to the file itself and trying to delete a file it has open. The bug is below for reference.
I have found two JIRA bugs that have been filed on this issue https://issues.apache.org/jira/browse/KAFKA-1194 and https://issues.apache.org/jira/browse/KAFKA-2170. The first being logged under version 0.8.1 and the second for version 0.10.1.
I have personally tried versions 0.10.1 and 0.10.2. Neither of them have the bug fixed in them.
My question is, does anyone know of patch that can fix this issue or know if the Kafka people have a fix for this that will be rolling out soon.
Thanks.
kafka.common.KafkaStorageException: Failed to change the log file suffix from to .deleted for log segment 6711351
at kafka.log.LogSegment.kafkaStorageException$1(LogSegment.scala:340)
at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:342)
at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:981)
at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:971)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:673)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:673)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Log.deleteOldSegments(Log.scala:673)
at kafka.log.Log.deleteRetentionSizeBreachedSegments(Log.scala:717)
at kafka.log.Log.deleteOldSegments(Log.scala:697)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:474)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:472)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.LogManager.cleanupLogs(LogManager.scala:472)
at kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(LogManager.scala:200)
at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.FileSystemException: c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log -> c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log.deleted: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:711)
at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:210)
... 28 more
Suppressed: java.nio.file.FileSystemException: c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log -> c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log.deleted: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:708)
... 29 more
Im having a similar issue running kafka on local , kafka server seems stopping on execution upon failure to delete the log file. For the solution to avoid this to happen i have to increase the log retention for the log to avoid auto deletion.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=500
setting the log to xxx hours will avoid this upon running on local, but for production I think for linux based system , this should not happen.
In case you need to delete the log file , delete it manually where your logs located ,then restart the kafka .

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop.
The code exits with the following exception after running a small number of iterations, around 3000 iterations.
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
Exception in thread "dag-scheduler-event-loop" 16/11/22 13:37:32 WARN NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.MpscLinkedQueue.offer(MpscLinkedQueue.java:126)
at io.netty.util.internal.MpscLinkedQueue.add(MpscLinkedQueue.java:221)
at io.netty.util.concurrent.SingleThreadEventExecutor.fetchFromScheduledTaskQueue(SingleThreadEventExecutor.java:259)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:346)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Pattern.compile(Pattern.java:1047)
at java.lang.String.replace(String.java:2180)
at org.apache.spark.util.Utils$.getFormattedClassName(Utils.scala:1728)
at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:57)
at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:57)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:57)
at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:87)
Can someone help please, is this error being caused because of a large number of calls to spark method?
Its RpcTimeoutException .. so spark.network.timeout (spark.rpc.askTimeout) could be tuned with larger-than-default values in order to handle complex workload. You can start with these values and adjust accordingly to your workloads.
Please see latest
spark.network.timeout 120s Default timeout for all network
interactions. This config will be used in place of
spark.core.connection.ack.wait.timeout,
spark.storage.blockManagerSlaveTimeoutMs,
spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or
spark.rpc.lookupTimeout if they are not configured.
Also consider increasing executor memory i.e spark.executor.memory and most imp thing is review your code, to check whether that is candidate for further optimization.
Solution : value 600 is based on requirement
set by SparkConf: conf.set("spark.network.timeout", "600s")
set by spark-defaults.conf: spark.network.timeout 600s
set when calling spark-submit: --conf spark.network.timeout=600s
The above stack trace is also shown java heap space its OOM error so once try to increase the memory and run it and regarding timeout its rpc timeout so you can set spark.network.timeout with timeout value according to your need...
pls increase the executer memory so that OOM will go away else make chnage in code so that your RDD wont have big memory foot print.
--executer-memory = 3G
Just increase the spark.executor.heartbeatInterval to 20s, the error says that.
You are seeing this issue due to the executor memory.
Try increasing the memory to (x 2) so the containers don't time out while waiting on the remaining containers.
For posterity: I was getting similar errors, but changing memory/timeout settings was not helping at all.
In my case the problem was that somebody was calling socket.setdefaulttimeout in a library function that I was calling before creating the Spark session. setdefaulttimeout affected all new sockets created after that point, including the socket that Spark used to communicate with YARN, so that connection would time out unexpectedly.
Needless to say, don't do this.