Spark Avro throws exception on file write: NoSuchMethodError - scala

Any file write attempt of Avro format fails with the stack trace below.
We are using Spark 2.4.3 (with user provided Hadoop), Scala 2.12, and we load the Avro package at runtime with either spark-shell:
spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.3
or spark-submit:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3 ...
The spark Session reports loading the Avro package successfully.
... in either case, the moment we attempt to write any data to an avro format, like:
df.write.format("avro").save("hdfs:///path/to/outputfile.avro")
or with a select:
df.select("recordidstring").write.format("avro").save("hdfs:///path/to/outputfile.avro")
... produces the same stacktrace error (this copy from spark-shell):
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$1(SchemaConverters.scala:176)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$prepareWrite$2(AvroFileFormat.scala:119)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
We are able to write other formats (text-delimited, json, ORC, parquet) without any trouble.
We are using HDFS (Hadoop v3.1.2) as the filestore.
I have experimented with different package versions of Avro (e.g. 2.11, lower) which either raises the same error or fails to load entirely due to incompatibility. This error occurs with all of Python, Scala (using shell or spark-submit) and Java (using spark-submit).
There appears to be an Open Issue on apache.org JIRA for this, but this is a year old now without any resolution. I've bumped that issue, but also wondering if the community had a fix? Any help much appreciated.

I had the same exception on the latest Spark. When I added the following dependencies into the pom it disappeared.
<properties>
....
<spark.version>3.1.2</spark.version>
<avro.version>1.10.2</avro.version>
</properties>
<dependencies>
....
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
</dependencies>
It seems you definitely have a lack of required dependencies in classpath, where you are launching your application.

Based on a comment in the linked error, you should specify avro with at least 1.8.0 version, something like this:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3,org.apache.avro:avro:1.9.2 ...
(You might want to try with the other order too.)

buddy, I met the same error as yours, but I updated my spark version to 2.11 2.4.4 and the problem disappeared.

This issue appears to be specific to our configuration on our local cluster - single node builds of HDFS (locally on windows, other linux etc) allow avro to write fine. We will rebuild the problem cluster but I'm confident the issue a bad config on that cluster only - solution - rebuild.

Related

Flink: Adding flink-sql-connector-kafka to fat-jar

I use Flink SQL (version 1.11) and would like to process data from Kafka. For this I wrote a job from the scala template and added the dependency to pom.xml.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-sql-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
When I want to run the job in the cluster, I get the following error:
Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kafka' that implements 'org.apache.flink.table.factories.DynamicTableSinkFactory' in the classpath.
Available factory identifiers are:
blackhole
print
If I add the flink-sql-connector-kafka jar to the /lib folder it works but then can't use the SQL client because it then loads once from its own lib folder and this connector and it is already loaded in the cluster. Then comes following error:
java.lang.ClassCastException: cannot assign instance of org.apache.commons.collections.map.LinkedMap to field org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.pendingOffsetsToCommit of type org.apache.commons.collections.map.LinkedMap in instance of org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
How can I add the flink-sql-connector-kafka to the fat-jar? Or should these SQL connectors rather be added to the /lib folder?

Could not instantiate the executor. Make sure a planner module is on the classpath

For a scala project I use the StreamTableEnvironment and when running my code in IntelliJ everything works fine. However when I try to export my project to a jar (I create a fat jar using sbt assembly), no suitable table factory can be found. I've looked inside the jar and the classes it needs are included. Here the complete stacktrace:
Exception in thread "main" org.apache.flink.table.api.TableException: Could not instantiate the executor. Make sure a planner module is on the classpath
at org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImpl$.lookupExecutor(StreamTableEnvironmentImpl.scala:328)
at org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImpl$.create(StreamTableEnvironmentImpl.scala:284)
at org.apache.flink.table.api.scala.StreamTableEnvironment$.create(StreamTableEnvironment.scala:366)
at org.tudelft.plugins.SQLService$.setupEnv(SQLService.scala:40)
at org.tudelft.plugins.SQLStage.main(SQLStage.scala:19)
at org.codefeedr.stages.OutputStage.transform(OutputStage.scala:45)
at org.codefeedr.pipeline.Pipeline.$anonfun$startMock$1(Pipeline.scala:240)
at org.codefeedr.pipeline.Pipeline.$anonfun$startMock$1$adapted(Pipeline.scala:238)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.codefeedr.pipeline.Pipeline.startMock(Pipeline.scala:238)
at org.tudelft.Main$.main(Main.scala:34)
at org.tudelft.Main.main(Main.scala)
Caused by: org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a suitable table factory for 'org.apache.flink.table.delegation.ExecutorFactory' in
the classpath.
Reason: No factory implements 'org.apache.flink.table.delegation.ExecutorFactory'.
The following properties are requested:
class-name=org.apache.flink.table.executor.StreamExecutorFactory
streaming-mode=true
The following factories have been considered:
at org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:243)
at org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:186)
at org.apache.flink.table.factories.TableFactoryService.findAllInternal(TableFactoryService.java:172)
at org.apache.flink.table.factories.TableFactoryService.findAll(TableFactoryService.java:126)
at org.apache.flink.table.factories.ComponentFactoryService.find(ComponentFactoryService.java:48)
at org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImpl$.lookupExecutor(StreamTableEnvironmentImpl.scala:312)
... 16 more```
Similar but slightly different to what OP needed: I have unit tests using the Table API that were failing with the same error message, even though the same pipeline worked fine when submitted to a real flink cluster.
org.apache.flink.table.api.TableException: Could not instantiate the executor. Make sure a planner module is on the classpath
The solution was to add:
<dependency>
<!-- this is needed to use the Table API from unit tests -->
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>test</scope>
</dependency>
I believe flink-table-planner-blink is no longer available in recent versions (I'm using 1.15), and has instead replaced flink-table-planner.
Probably a bit late to answer, but I had the same issue and the solution was to add the blink planner dependency (original answer here: https://issues.apache.org/jira/browse/FLINK-14031)
It's useful for me
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.11</artifactId>
<version>${flink.version}</version>
</dependency>

Kafka streams on Windows 7 looking for rocksdb's dll

I have a program of Kafka streams. It's Windows 64 bit machine and standalone Kafka server is running on it. Java version is java 8.
In code pom has dependencies of Kafka client and streams APIs and the versions are latest i.e. 0.10.2.
Whenever I am running the streams app, it is looking for rocksdb's dll file in my user home and failed.
Anyone faced the same issue? Same code is running on different windows 64 machine.
pom
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>0.10.2.1</version>
</dependency>
</dependencies>
Stacktrace
Exception in thread "StreamThread-1" java.lang.UnsatisfiedLinkError: C:\Users\abcd\AppData\Local\Temp\librocksdbjni8989756873626302713.dll: Can't find dependent libraries
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at org.rocksdb.NativeLibraryLoader.loadLibraryFromJar(NativeLibraryLoader.java:78)
at org.rocksdb.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:56)
at org.rocksdb.RocksDB.loadLibrary(RocksDB.java:64)
at org.rocksdb.RocksDB.<clinit>(RocksDB.java:35)
at org.rocksdb.Options.<clinit>(Options.java:22)
at org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:117)
at org.apache.kafka.streams.state.internals.Segment.openDB(Segment.java:38)
at org.apache.kafka.streams.state.internals.Segments.getOrCreateSegment(Segments.java:76)
at org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.put(RocksDBSegmentedBytesStore.java:73)
at org.apache.kafka.streams.state.internals.ChangeLoggingSegmentedBytesStore.put(ChangeLoggingSegmentedBytesStore.java:55)
at org.apache.kafka.streams.state.internals.MeteredSegmentedBytesStore.put(MeteredSegmentedBytesStore.java:101)
at org.apache.kafka.streams.state.internals.RocksDBWindowStore.put(RocksDBWindowStore.java:110)
at org.apache.kafka.streams.state.internals.RocksDBWindowStore.put(RocksDBWindowStore.java:102)
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:65)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.kstream.internals.KStreamFilter$KStreamFilterProcessor.process(KStreamFilter.java:44)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:70)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:197)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:627)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)

UnsatisfiedLinkError When sending compressed (snappy) message to kafka

In my java web app I'm sending messages to kafka.
I would like to compress my messages before sending it so I'm setting in my producer properties:
props.put("compression.codec", "2");
As I understand "2" stands for snappy, but when sending a message I'm getting:
java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:66)
at kafka.message.SnappyCompression.<init>(CompressionUtils.scala:61)
at kafka.message.CompressionFactory$.apply(CompressionUtils.scala:82)
at kafka.message.CompressionUtils$.compress(CompressionUtils.scala:109)
at kafka.message.MessageSet$.createByteBuffer(MessageSet.scala:71)
at kafka.message.ByteBufferMessageSet.<init>(ByteBufferMessageSet.scala:44)
at kafka.producer.async.DefaultEventHandler$$anonfun$3.apply(DefaultEventHandler.scala:94)
at kafka.producer.async.DefaultEventHandler$$anonfun$3.apply(DefaultEventHandler.scala:82)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.mutable.HashMap.map(HashMap.scala:45)
at kafka.producer.async.DefaultEventHandler.serialize(DefaultEventHandler.scala:82)
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:44)
at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:116)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:95)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:71)
at scala.collection.immutable.Stream.foreach(Stream.scala:526)
at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:70)
at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:41)
To resolve it I tried adding snappy dependency to my pom:
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>${snappy-version}</version>
<scope>provided</scope>
</dependency>
and add the jar to my jetty server under /lib/ext
but still getting this error.
If I set "0" instead of "2" in the "compression.codec" property I do not get the exception, as expected.
what should I do in order to be able to use snappy compression?
This is my snappy version (should I use a different one?):
1.1.0.1
I'm deploying my app on jetty 8.1.9 which runs on Ubuntu 12.10.
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.1.1.3</version>
</dependency>
I had the same issue and the code above solves my problem. The jar contains native libraries for all OS. Below are my development environment:
JDK version: 1.7.0_76
Kafka version: 2.10-0.8.2.1
Zookeeper version: 3.4.6

java.lang.NoClassDefFoundError: scala/reflect/ClassManifest

I am getting an error when trying to run an example on spark. Can anybody please let me know what changes do i need to do to my pom.xml to run programs with spark.
Currently Spark only works with Scala 2.9.3. It does not work with later versions of Scala. I saw the error you describe when I tried to run the SparkPi example with SCALA_HOME pointing to a 2.10.2 installation. When I pointed SCALA_HOME at a 2.9.3 installation instead, things worked for me. Details here.
You should add dependecy for scala-reflect to your maven build:
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.10.2</version>
</dependency>
Ran into the same issue using the Scala-Redis 2.9 client (incompatible with Scala 2.10) and including a dependency to scala-reflect does not help. Indeed, scala-reflect is packaged as its own jar but does not include the Class missing which is deprecated since Scala 2.10.0 (see this thread).
The correct answer is to point to an installation of Scala which includes this class (In my case using the Scala-Redis client, the answer of McNeill helped. I pointed to Scala 2.9.3 using SBT and everything worked as expected)
In my case, the error is raised in Kafka's api. I change the dependency from
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.9.2</artifactId>
<version>0.8.1.1</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.1</version>
</dependency>
fixed the problem.