Dataflow-Apache beam insert into ReplicatedReplacingMergeTree - apache-beam

I have ReplicatedReplacingMergeTree(table1), I have built 3 MVs on this also of the same type. I am trying to insert streaming data(using dataflow(beam)) into table1. It used to work as normal ReplacingMergeTree when I change to replicated I get the error below. what is the problem ?
2020-02-24 17:07:37.365 ISTError message from worker: ru.yandex.clickhouse.except.ClickHouseException: ClickHouse exception, code: 306, host: 35.202.46.77, port: 8123; Code: 306, e.displayText() = DB::Exception: Stack size too large. Stack address: 0x7feae15fe000, frame address: 0x7feae19fe210, stack size: 4197872, maximum stack size: 8392704 (version 19.17.4.11 (official build)) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:58) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:28) ru.yandex.clickhouse.ClickHouseStatementImpl.checkForErrorAndThrow(ClickHouseStatementImpl.java:875) ru.yandex.clickhouse.ClickHouseStatementImpl.sendStream(ClickHouseStatementImpl.java:851) ru.yandex.clickhouse.Writer.send(Writer.java:106) ru.yandex.clickhouse.Writer.send(Writer.java:141) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:764) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:758) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.flush(ClickHouseIO.java:427) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.finishBundle(ClickHouseIO.java:403) Caused by: java.lang.Throwable: Code: 306, e.displayText() = DB::Exception: Stack size too large. Stack address: 0x7feae15fe000, frame address: 0x7feae19fe210, stack size: 4197872, maximum stack size: 8392704 (version 19.17.4.11 (official build)) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:53) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:28) ru.yandex.clickhouse.ClickHouseStatementImpl.checkForErrorAndThrow(ClickHouseStatementImpl.java:875) ru.yandex.clickhouse.ClickHouseStatementImpl.sendStream(ClickHouseStatementImpl.java:851) ru.yandex.clickhouse.Writer.send(Writer.java:106) ru.yandex.clickhouse.Writer.send(Writer.java:141) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:764) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:758) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.flush(ClickHouseIO.java:427) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.finishBundle(ClickHouseIO.java:403) org.apache.beam.sdk.io.clickhouse.AutoValue_ClickHouseIO_WriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.finishBundle(SimpleDoFnRunner.java:232) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:423) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.finish(ParDoOperation.java:56) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1350) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:152) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1073) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) ru.yandex.clickhouse.except.ClickHouseException: ClickHouse exception, code: 306, host: 35.202.46.77, port: 8123; Code: 306, e.displayText() = DB::Exception: Stack size too large. Stack address: 0x7fe5abdfc000, frame address: 0x7fe5ac1fc210, stack size: 4197872, maximum stack size: 8392704 (version 19.17.4.11 (official build)) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:58) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:28) ru.yandex.clickhouse.ClickHouseStatementImpl.checkForErrorAndThrow(ClickHouseStatementImpl.java:875) ru.yandex.clickhouse.ClickHouseStatementImpl.sendStream(ClickHouseStatementImpl.java:851) ru.yandex.clickhouse.Writer.send(Writer.java:106) ru.yandex.clickhouse.Writer.send(Writer.java:141) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:764) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:758) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.flush(ClickHouseIO.java:427) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.finishBundle(ClickHouseIO.java:403) Caused by: java.lang.Throwable: Code: 306, e.displayText() = DB::Exception: Stack size too large. Stack address: 0x7fe5abdfc000, frame address: 0x7fe5ac1fc210, stack size: 4197872, maximum stack size: 8392704 (version 19.17.4.11 (official build)) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:53) ru.yandex.clickhouse.except.ClickHouseExceptionSpecifier.specify(ClickHouseExceptionSpecifier.java:28) ru.yandex.clickhouse.ClickHouseStatementImpl.checkForErrorAndThrow(ClickHouseStatementImpl.java:875) ru.yandex.clickhouse.ClickHouseStatementImpl.sendStream(ClickHouseStatementImpl.java:851) ru.yandex.clickhouse.Writer.send(Writer.java:106) ru.yandex.clickhouse.Writer.send(Writer.java:141) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:764) ru.yandex.clickhouse.ClickHouseStatementImpl.sendRowBinaryStream(ClickHouseStatementImpl.java:758) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.flush(ClickHouseIO.java:427) org.apache.beam.sdk.io.clickhouse.ClickHouseIO$WriteFn.finishBundle(ClickHouseIO.java:403) org.apache.beam.sdk.io.clickhouse.AutoValue_ClickHouseIO_WriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.finishBundle(SimpleDoFnRunner.java:232) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:423) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.finish(ParDoOperation.java:56) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1350) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:152) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1073) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748)

Related

File domain patron error when running spark streaming

I faced this error when running my application for several hours.
My spark application read stream from a streaming hudi table (hudi table that is constantly updated) and write to a parquet file. There is another stream read that same parquet file and write to another hudi table. The flow is as follow
Hudi -> stream 1 -> parquet -> stream 2 -> hudi
I can see the error appears when stream 2 read from the parquet file. The underlying storage is OneFS
User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Failed to get file domain patron for path /path/_temporary. Error: Name: _temporary Status: STATUS_OBJECT_NAME_NOT_FOUND
=== Streaming Query ===
Identifier: [id = 72e6b29c-a641-47ff-82fc-ccd8146a4226, runId = 22af0796-7cb4-4599-b29f-ee95bda27cb3]
Current Committed Offsets: {FileStreamSource[hdfs://path]: {"logOffset":60}}
Current Available Offsets: {FileStreamSource[hdfs://path]: {"logOffset":60}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
FileStreamSource[hdfs://path]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:356)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Failed to get file domain patron for path path/_temporary. Error: Name: _temporary Status: STATUS_OBJECT_NAME_NOT_FOUND
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:578)
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2086)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:944)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:927)
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:872)
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:868)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1696)
at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:220)
at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
at org.apache.spark.sql.execution.streaming.FileStreamSource.allFilesUsingInMemoryFileIndex(FileStreamSource.scala:248)
at org.apache.spark.sql.execution.streaming.FileStreamSource.fetchAllFiles(FileStreamSource.scala:301)
at org.apache.spark.sql.execution.streaming.FileStreamSource.fetchMaxOffset(FileStreamSource.scala:128)
at org.apache.spark.sql.execution.streaming.FileStreamSource.latestOffset(FileStreamSource.scala:325)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$3(MicroBatchExecution.scala:394)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:385)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:

java.lang.NoSuchMethodError: org.apache.kafka.common.protocol.Readable.readArray([B)V

We started getting below exception while we upgraded spring-kafka to 2.8.9 and kafka-clients to 3.0.1. Please suggest.
laris-default-group-id] Request joining group due to: consumer pro-actively leaving the group
2022-11-04--16-41-17-047 [T: U: D: Tx:/ URI: M:] [org.springframework.kafka.KafkaListenerEndpointContainer#9-0-C-1] ERROR org.springframework.kafka.listener.KafkaMessageListenerContainer - Stopping container due to an Error
java.lang.NoSuchMethodError: org.apache.kafka.common.protocol.Readable.readArray([B)V
at org.apache.kafka.common.message.SyncGroupResponseData.read(SyncGroupResponseData.java:173)
at org.apache.kafka.common.message.SyncGroupResponseData.<init>(SyncGroupResponseData.java:102)
at org.apache.kafka.common.requests.SyncGroupResponse.parse(SyncGroupResponse.java:61)
at org.apache.kafka.common.requests.AbstractResponse.parseResponse(AbstractResponse.java:135)
at org.apache.kafka.common.requests.AbstractResponse.parseResponse(AbstractResponse.java:109)
at org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:720)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:865)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:427)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:366)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:511)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1262)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1233)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1166)
at brave.kafka.clients.TracingConsumer.poll(TracingConsumer.java:93)
at brave.kafka.clients.TracingConsumer.poll(TracingConsumer.java:87)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollConsumer(KafkaMessageListenerContainer.java:1529)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doPoll(KafkaMessageListenerContainer.java:1519)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1343)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1255)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)```

Python Apache beam SDK: ReadFromKafka can't consume the data (error)?

Running env:
OS: ubuntu 20.04
kafka version: 2.12-2.0.1
apach-beam library version: apache-beam==2.32.0
Procedure:
shell 1: Run below code
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.external.kafka import ReadFromKafka
pipeline_options = PipelineOptions(["--runner=DirectRunner"])
def run():
with beam.Pipeline(options=pipeline_options) as p:
_ = (
p
| 'ReadData' >> ReadFromKafka(
consumer_config={"bootstrap.servers": "localhost:9092"},
topics=["my-first-topic"],
)
| 'PrintData' >> beam.Map(print)
)
if __name__ == "__main__":
run()
output of shell1:
WARNING:apache_beam.runners.interactive.interactive_environment:You have limited Interactive Beam features since your ipython kernel is not connected to any notebook frontend.
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
2.32.0: Pulling from apache/beam_java11_sdk
Digest: sha256:a45f89584071950d371966abf910869c456179ab54c7b5213e3f4e2a54bd2753
Status: Image is up to date for apache/beam_java11_sdk:2.32.0
docker.io/apache/beam_java11_sdk:2.32.0
shell 2:
$ cd kafka_2.12-2.0.1/bin && ./kafka-console-producer.sh --topic "my-first-topic" --broker-list localhost:9092
>2
>3
>4
output of shell 1:
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
2.32.0: Pulling from apache/beam_java11_sdk
Digest: sha256:a45f89584071950d371966abf910869c456179ab54c7b5213e3f4e2a54bd2753
Status: Image is up to date for apache/beam_java11_sdk:2.32.0
docker.io/apache/beam_java11_sdk:2.32.0
ERROR:root:severity: ERROR
timestamp {
seconds: 1630485467
nanos: 764000000
}
message: "Client failed to deque and process the value"
trace: "org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Unable to encode element \'org.apache.beam.sdk.io.kafka.KafkaRecord#4c9edf30\' with coder \'KafkaRecordCoder(ByteArrayCoder,ByteArrayCoder)\'.\n\tat org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1683)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2205)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2374)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)\n\tat org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)\n\tat org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:750)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$WindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2092)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.outputWithTimestamp(DoFnOutputReceivers.java:87)\n\tat org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn.processElement(ReadFromKafkaDoFn.java:378)\n\tat org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForWindowObservingSizedElementAndRestriction(FnApiDoFnRunner.java:1048)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$1000(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:637)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:632)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\n\tat org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:221)\n\tat org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:43)\n\tat org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:25)\n\tat org.apache.beam.fn.harness.data.QueueingBeamFnDataClient$ConsumerAndData.accept(QueueingBeamFnDataClient.java:316)\n\tat org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:219)\n\tat org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:329)\n\tat org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:140)\n\tat org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:110)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.IllegalArgumentException: Unable to encode element \'org.apache.beam.sdk.io.kafka.KafkaRecord#4c9edf30\' with coder \'KafkaRecordCoder(ByteArrayCoder,ByteArrayCoder)\'.\n\tat org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)\n\tat org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$SampleByteSizeDistribution.tryUpdate(PCollectionConsumerRegistry.java:385)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:259)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\nCaused by: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]\n\tat org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:63)\n\tat org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:56)\n\tat org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)\n\tat org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:72)\n\tat org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:63)\n\tat org.apache.beam.sdk.io.kafka.KafkaRecordCoder.encode(KafkaRecordCoder.java:70)\n\tat org.apache.beam.sdk.io.kafka.KafkaRecordCoder.encode(KafkaRecordCoder.java:40)\n\tat org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:297)\n\tat org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$SampleByteSizeDistribution.tryUpdate(PCollectionConsumerRegistry.java:385)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:259)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2205)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2374)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)\n\tat org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)\n\tat org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:750)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$WindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2092)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.outputWithTimestamp(DoFnOutputReceivers.java:87)\n\tat org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn.processElement(ReadFromKafkaDoFn.java:378)\n\tat org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForWindowObservingSizedElementAndRestriction(FnApiDoFnRunner.java:1048)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner.access$1000(FnApiDoFnRunner.java:139)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:637)\n\tat org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:632)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)\n\tat org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)\n\tat org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:221)\n\tat org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:43)\n\tat org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:25)\n\tat org.apache.beam.fn.harness.data.QueueingBeamFnDataClient$ConsumerAndData.accept(QueueingBeamFnDataClient.java:316)\n\tat org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:219)\n\tat org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:329)\n\tat org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:140)\n\tat org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:110)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"
instruction_id: "bundle_116"
log_location: "org.apache.beam.fn.harness.data.QueueingBeamFnDataClient"
thread: "31"
ERROR:root:severity: ERROR
timestamp {
seconds: 1630485467
nanos: 770000000
}
message: "Exception while trying to handle InstructionRequest bundle_116"
trace: "org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Unable to encode element \'org.apache.beam.sdk.io.kafka.KafkaRecord#4c9edf30\' with coder \'KafkaRecordCoder(ByteArrayCoder,ByteArrayCoder)\'.
...
/home/newdisk/miniconda3/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in _run_bundle(self, runner_execution_context, bundle_context_manager, data_input, data_output, input_timers, expected_timer_output, bundle_manager)
767 expected_timer_output)
768
--> 769 result, splits = bundle_manager.process_bundle(
770 data_input, data_output, input_timers, expected_timer_output)
771 # Now we collect all the deferred inputs remaining from bundle execution.
/home/newdisk/miniconda3/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in process_bundle(self, inputs, expected_outputs, fired_timers, expected_output_timers, dry_run)
1118
1119 if result.error:
-> 1120 raise RuntimeError(result.error)
1121
1122 if result.process_bundle.requires_finalization:
RuntimeError: org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Unable to encode element 'org.apache.beam.sdk.io.kafka.KafkaRecord#4c9edf30' with coder 'KafkaRecordCoder(ByteArrayCoder,ByteArrayCoder)'.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1683)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2205)
at org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2374)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)
at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:750)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$WindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2092)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.outputWithTimestamp(DoFnOutputReceivers.java:87)
at org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn.processElement(ReadFromKafkaDoFn.java:378)
at org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForWindowObservingSizedElementAndRestriction(FnApiDoFnRunner.java:1048)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$1000(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:637)
at org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:632)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:221)
at org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:43)
at org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:25)
at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient$ConsumerAndData.accept(QueueingBeamFnDataClient.java:316)
at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:219)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:329)
at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:140)
at org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:110)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Unable to encode element 'org.apache.beam.sdk.io.kafka.KafkaRecord#4c9edf30' with coder 'KafkaRecordCoder(ByteArrayCoder,ByteArrayCoder)'.
at org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
at org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$SampleByteSizeDistribution.tryUpdate(PCollectionConsumerRegistry.java:385)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:259)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
Caused by: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:63)
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:56)
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:72)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:63)
at org.apache.beam.sdk.io.kafka.KafkaRecordCoder.encode(KafkaRecordCoder.java:70)
at org.apache.beam.sdk.io.kafka.KafkaRecordCoder.encode(KafkaRecordCoder.java:40)
at org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:297)
at org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$SampleByteSizeDistribution.tryUpdate(PCollectionConsumerRegistry.java:385)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:259)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2205)
at org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2374)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:142)
at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:750)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1680)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$2500(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$WindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2092)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.outputWithTimestamp(DoFnOutputReceivers.java:87)
at org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn.processElement(ReadFromKafkaDoFn.java:378)
at org.apache.beam.sdk.io.kafka.ReadFromKafkaDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForWindowObservingSizedElementAndRestriction(FnApiDoFnRunner.java:1048)
at org.apache.beam.fn.harness.FnApiDoFnRunner.access$1000(FnApiDoFnRunner.java:139)
at org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:637)
at org.apache.beam.fn.harness.FnApiDoFnRunner$4.accept(FnApiDoFnRunner.java:632)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:266)
at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:218)
at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:221)
at org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:43)
at org.apache.beam.sdk.fn.data.DecodingFnDataReceiver.accept(DecodingFnDataReceiver.java:25)
at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient$ConsumerAndData.accept(QueueingBeamFnDataClient.java:316)
at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:219)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:329)
at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:140)
at org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:110)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Looks like it is related with key_deserializer and value_deserializer args of ReadFromKafka, So I tried to change them:
# key_deserializer="org.apache.kafka.common.serialization.StringSerializer",
# value_deserializer="org.apache.kafka.common.serialization.StringSerializer",
But it raised an another error:
RuntimeError: java.lang.RuntimeException: Failed to build transform beam:external:java:kafkaio:typedwithoutmetadata:v1 from spec urn: "beam:external:java:kafkaio:typedwithoutmetadata:v1"
payload: "\n\213\002\n\035\n\017consumer_config\032\n*\b\n\002\020\a\022\002\020\a\n\020\n\006topics\032\006\032\004\n\002\020\a\n\026\n\020key_deserializer\032\002\020\a\n\030\n\022value_deserializer\032\002\020\a\n\027\n\017start_read_time\032\004\b\001\020\004\n\027\n\017max_num_records\032\004\b\001\020\004\n\025\n\rmax_read_time\032\004\b\001\020\004\n\037\n\031commit_offset_in_finalize\032\002\020\b\n\026\n\020timestamp_policy\032\002\020\a\022$6a700b0b-2839-492d-8629-9b3268d90919\022\272\001\t\002p\000\000\000\000\001\021bootstrap.servers\016localhost:9092\000\000\000\001\016my-first-topic6org.apache.kafka.common.serialization.StringSerializer6org.apache.kafka.common.serialization.StringSerializer\000\016ProcessingTime"
What's wrong with them? Anything that I'm missing?
I had this issue as well. It turned out the smoking gun was this line in the traceback:
Caused by: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
What it's saying is that there's a null value value somewhere and it can't encode it. For me, it turned out I was sending key=None in my kafka-python producer.send function and changing the key value to a string fixed the issue. For example:
with open("../data/sample_data.json") as fp:
for line in fp.readlines():
producer.send(KAFKA_TOPIC, key=str.encode("foo"), value=str.encode(line))
You can also set the key_serializer and value_serializer in the KafkaProducer object.
A heads up: even after fixing the null read there are still problems with reading from Kafka in Python outside of the Dataflow Runner. See: https://issues.apache.org/jira/browse/BEAM-11998
I had the same problem. As i figured out, ReadFromKafka wants to read the Data in a key:value format. My solution was to start the kafka-console-producer.sh with the option:
--property "parse.key=true" --property "key.separator=:"
and add data in the format key:value (f.e. name:peter instead of peter)

Map and transform methods with using serializable class on DStream elements

I am writing app that will be extract logs, so I implemented function (Lisitng 1) which is taking string as parametr and extracts valuable informations (regexs: Listing 2) from it. I wanted that this method could be send to other workers so I impelemnt serializable class.
I have problem with apply this method on DStreams. Here is my streams minning solution:
def streamMinner(): Unit = {
val ssc = new StreamingContext(sc, Seconds(2))
val logsStream = ssc.textFileStream("logs/")
// Not works
val extractLogs = logsStream.map( log => new Matcher().matchLog(log))
extractLogs.print(1)
// Works
// val words = logsStream.transform( rdd => rdd.map( log => matchLog(log)))
// words.print()
ssc.start()
ssc.awaitTermination()
}
Problem is in line where every element of logsStream is maped with new object of Matcher class (new Matcher().matchLog(log)
Apache Spark gave my below errors:
ERROR YarnScheduler: Lost executor 2 on host1: Container marked as failed: container_e743_1499728610705_0043_01_000003 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
ERROR YarnScheduler: Lost executor 5 on host2: Container marked as failed: container_e743_1499728610705_0043_01_000006 on host: host2. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000006
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
...
ERROR YarnScheduler: Lost executor 6 ...
ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
17/07/11 09:41:09 ERROR JobScheduler: Error running job streaming job 1499758850000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host1): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_e743_1499728610705_0043_01_000007 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000007
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1335)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.take(RDD.scala:1309)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:768)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:767)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host1): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_e743_1499728610705_0043_01_000007 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000007
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1335)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.take(RDD.scala:1309)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:768)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:767)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
scala> 17/07/11 09:41:11 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from host3/11.11.11.11:11111 is closed
17/07/11 09:41:11 ERROR YarnScheduler: Lost executor 4 on host3: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 7741519719369750843 to host3/11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 1 on host2: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 7734757459881277232 to host3//11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 3 on host4: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 5816053641531447955 to host3//11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 7 on host2: Slave lost
17/07/11 09:41:13 ERROR TransportClient: Failed to send RPC 8774007142277591342 to host3/11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:13 ERROR YarnScheduler: Lost executor 8 on host1: Slave lost
17/07/11 09:41:19 ERROR YarnScheduler: Lost executor 1 on host3: Container marked as failed: container_e743_1499728610705_0043_02_000002 on host: host3. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_02_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
When I comment lines:
val extractLogs = logsStream.map( log => new Matcher().matchLog(log))
extractLogs.print(1)
and I uncomment lines:
// val words = logsStream.transform( rdd => rdd.map( log => matchLog(log)))
// words.print()
Everything works fine. My question is why? I'm afraid that solution that works may not be parallelized on cluster because method matchLog is not serializable. Someone has a similar problem or know how to deal with it?
Lisitng 1:
case class logValues2(time_stamp: String, action: String, protocol: String, connection_id: String, src_ip: String, dst_ip: String, src_port: String, dst_port: String, duration: String, bytes: String, user: String) extends Serializable
class Matcher extends Serializable {
def matchLog(x: String): logValues2 = {
var dst_ip = " "
var dst_port = " "
var time_stamp = time_stamp_reg.findAllIn(x).mkString(",")
var action = action_reg.findAllIn(x).mkString(",")
var protocol = protocol_reg.findAllIn(x).mkString(",")
var connection_id = connection_id_reg.findAllIn(x).mkString(",")
var ips = ips_reg.findAllIn(x).mkString(" ").split(""" """)
var src_ip = ips(0)
if (ips.length > 1) {
dst_ip = ips(1)
} else {
dst_ip = " "
}
var ports = ports_reg.findAllIn(x).mkString(" ").split(""" """)
var src_port = ports(0)
if (ports.length > 1) {
dst_port = ports(1)
} else {
dst_port = " "
}
var duration = duration_reg.findAllIn(x).mkString(",")
var bytes = bytes_reg.findAllIn(x).mkString(",")
var user = user_reg.findAllIn(x).mkString(",")
var logObject = logValues2(time_stamp, action, protocol, connection_id, src_ip, dst_ip, src_port, dst_port, duration, bytes, user)
return logObject
}
Above method is implemented also separately (no within Matcher class).
UPDATE:
My Regular expressions: Listing 2:
val time_stamp_reg = """^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)""".r
val action_reg = """((?<=:\s)\w{4,10}(?=\s\w{2})|(?<=\w\s)(\w{7,9})(?=\s[f]))""".r
val protocol_reg = """(?<=[\w:]\s)(\w+)(?=\s[cr])""".r
val connection_id_reg = """(?<=\w\s)(\d+)(?=\sfor)""".r
val ips_reg = """(?<=[\d\w][:\s])(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?=\/\d+|\z| \w)""".r
val ports_reg = """(?<=\d\/)(\d{1,6})(?=\z|[\s(])""".r
val duration_reg = """(?<=duration\s)(\d{1,2}:\d{1,2}:\d{1,2})(?=\s|\z)""".r
val bytes_reg = """(?<=bytes\s)(\d+)(?=\s|\z)""".r
val user_reg = """(?<=\\\\)(\d+)(?=\W)""".r

nutch2.0 with cassandra

Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
Caused by: java.io.IOException
at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:88)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 8 more
Caused by: java.lang.NullPointerException
at org.apache.gora.cassandra.store.CassandraMapping.<init>(CassandraMapping.java:117)
at org.apache.gora.cassandra.store.CassandraMappingManager.get(CassandraMappingManager.java:84)
at org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:84)
at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:85)
... 10 more
I just run nutch2.0 on cassandra. It's the output of crawl, and the output of TestGoreStorage is as following:
Starting!
Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:204)
Caused by: java.io.IOException
at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:88)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 3 more
Caused by: java.lang.NullPointerException
at org.apache.gora.cassandra.store.CassandraMapping.<init>(CassandraMapping.java:117)
at org.apache.gora.cassandra.store.CassandraMappingManager.get(CassandraMappingManager.java:84)
at org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:84)
at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:85)
... 5 more
I can connect cassandra with cassandra-cli, and just check out the nutch from svn.
Here is the effect config in gora.properties:
gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://210.44.138.8/nutchtest
gora.sqlstore.jdbc.user=sa
gora.sqlstore.jdbc.password=
gora.cassandrastore.servers=210.44.138.8:9160
and the config in gora-cassandra-mapping:
<keyspace name="webpage" cluster="My Cluster" host="210.44.138.8">
<family name="p"/>
<family name="f"/>
<family name="sc" type="super"/>
</keyspace>
210.44.138.8 is a node of my cluster, and the name of cluster is "My Cluster",
more info: closed firewall, run in eclipse. I'm very pleasure if someone give me any help.
I'm not sure if I had the exact same problem, but I found that in the gora-cassandra-mapping.xml file I had to add a keyspace attribute (keyspace="ks1") to the class element:
<keyspace name="ks1" cluster="My Cluster" host="1.2.3.4">
...
</keyspace>
<class keyspace="ks1" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
...
</class>