Apache Beam UnsupportedOperationException on upgrade from 2.29.0 to 2.32.0 - apache-beam

I have a pipeline that has been working for a year fine on version 2.29.0. However this week we can't build anymore due to this dependency that has been removed from Redhat Maven repo. I upgraded our pipeline to Beam version 2.32.0. Our pipeline uses the SparkRunner and the version of Spark is: 3.2.0. However the new Beam version is throwing an exception. Below is the error I receive plus the stack trace. Any body has any idea what I need to configure/change to get it working again?
java.lang.UnsupportedOperationException: Found StateId annotations on org.apache.beam.sdk.transforms.GroupIntoBatches$GroupIntoBatchesDoFn, but DoFn cannot yet be used with state in the SparkRunner.
at org.apache.beam.runners.spark.translation.TranslationUtils.rejectStateAndTimers(TranslationUtils.java:271)
at org.apache.beam.runners.spark.translation.streaming.StreamingTransformTranslator$9.evaluate(StreamingTransformTranslator.java:418)
at org.apache.beam.runners.spark.translation.streaming.StreamingTransformTranslator$9.evaluate(StreamingTransformTranslator.java:409)
at org.apache.beam.runners.spark.SparkRunner$Evaluator.doVisitTransform(SparkRunner.java:449)
at org.apache.beam.runners.spark.SparkRunner$Evaluator.visitPrimitiveTransform(SparkRunner.java:438)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:593)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:240)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469)
at org.apache.beam.runners.spark.translation.streaming.SparkRunnerStreamingContextFactory.call(SparkRunnerStreamingContextFactory.java:88)
at org.apache.beam.runners.spark.translation.streaming.SparkRunnerStreamingContextFactory.call(SparkRunnerStreamingContextFactory.java:46)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.$anonfun$getOrCreate$1(JavaStreamingContext.scala:628)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:841)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:627)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:180)
at org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:96)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)

Based on Beam Capability Matrix and this answer Apache beam StateSpec in spark, stateful processing is not supported in streaming pipelines with SparkRunner.

Related

java.io.InvalidClassException: org.apache.flink.api.common.operators.ResourceSpec; incompatible types for field cpuCores

I am using Beam version 1.24 with Flink Session cluster version 1.11 with beam-runners-flink-1.9. When I run the job with remote FlinkRunner in streaming mode, I get the following error. Any insights will be appreciated. I couldn't put all the stack trace as StackOverflows wouldn't let me post with the entire stack trace.
Thanks,
Rao.
Caused by: java.io.InvalidClassException: org.apache.flink.api.common.operators.ResourceSpec; incompatible types for field cpuCores
This was due to Flink and Beam version incompatibility; using the Maven artifact that matched the service version solved the issue. Here is the Maven snippet:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-flink-1.11</artifactId>
<version>2.25.0</version>
</dependency>

Spark + Kafka Integration error. NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider

I am using Kafka 2.5.0 and Spark 3.0.0. I'm trying to import some data from kafka into spark. The following code snippet gives me an erorr:
spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "topic1").load()
The error I get says
java.lang.NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider
This error is mainly due to spark-kafka dependency conflicts.
You can check the supporting scala version in maven repository if not yet.
If the error still occurring then you can share more details like the groupId, artifactId along with version.

ClassNotFoundException while creating Spark Session

I am trying to create a Spark Session in Unit Test case using the below code
val spark = SparkSession.builder.appName("local").master("local").getOrCreate()
but while running the tests, I am getting the below error:
java.lang.ClassNotFoundException: org.apache.hadoop.fs.GlobalStorageStatistics$StorageStatisticsProvider
I have tried to add the dependency but to no avail. Can someone point out the cause and the solution to this issue?
It can be because of two reasons.
1. You may have incompatible versions of spark and Hadoop stacks. For example, HBase 0.9 is incompatible with spark 2.0. It will result in the class/method not found exceptions.
2. You may have multiple version of the same library because of dependency hell. You may need to run the dependency tree to make sure this is not the case.

NoSuchMethod exception in Flink when using dataset with custom object array

I have a problem with Flink
java.lang.NoSuchMethodError: org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(Lorg/apache/flink/api/common/typeinfo/TypeInformation;)Lorg/apache/flink/api/java/typeutils/ObjectArrayTypeInfo;
at LowLevel.FlinkImplementation.FlinkImplementation$$anon$6.<init>(FlinkImplementation.scala:28)
at LowLevel.FlinkImplementation.FlinkImplementation.<init>(FlinkImplementation.scala:28)
at IRLogic.GmqlServer.<init>(GmqlServer.scala:15)
at it.polimi.App$.main(App.scala:20)
at it.polimi.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
the line with the problem is this one
implicit val regionTypeInformation =
api.scala.createTypeInformation[FlinkDataTypes.FlinkRegionType]
in the FlinkRegionType I have an Array of custom object
I developed the app with the maven plugin in the IDE and everything is working good, but when I move to the version I downloaded from the website I get the error above
I am using Flink 0.9
I was thinking that some library may be missing but I am using maven for handling everything. Moreover running through the code of ObjectArrayTypeInfo.java it doesn't seem to be the problem
A NoSuchMethodError commonly indicates a version mismatch between the libraries a Flink program was compiled with and the system the program is executed on. Especially if the same code works in an IDE setup where compile and execution libraries are the same.
In such case, you should check the version of the Flink dependencies, for example in the Maven POM file.

How to fix akka version compatibility issues?

I was thinking of using spark and redis together with SBT.
It runs fine if I comment out the spark dependency, if I include the spark dependency I get:
Exception in thread "main" java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContextExecutor;
at redis.RedisClientActorLike.<init>(Redis.scala:31)
at redis.RedisClient.<init>(Redis.scala:69)
I have no issues when I do not include "redisscala". When I do include redisscala, then I get weird errors about Akka.
How do I get around this?
It appears that those versions of Spark and rediscala are using incompatible versions of Akka. Spark 1.1.0 is using Akka 2.2.3, and rediscala 1.3.1 is using Akka 2.3.4. There are some changes between Akka 2.2.x and 2.3.x that are causing issues, and your project currently has both as transient dependencies.
You either need to downgrade rediscala to 1.2 (which uses Akka 2.2.x), or upgrade Spark to 1.2-snapshot (which uses Akka 2.3.x).