Hadoop 3 gcs-connector doesn't work properly with latest version of spark 3 standalone mode - scala

I wrote a simple Scala application which reads a parquet file from GCS bucket. The application uses :
JDK 17
Scala 2.12.17
Spark SQL 3.3.1
gcs-connector of hadoop3-2.2.7
The connector is taken from Maven, imported via sbt (Scala build tool). I'm not using the latest, 2.2.9, version because of this issue.
The application works perfectly in local mode, so I tried to switch to the standalone mode.
What I did is these steps:
Downloaded Spark 3.3.1 from here
Started the cluster manually like here
I tried to run the application again and faced this error:
[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error] at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error] at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error] at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error] at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error] at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error] at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error] ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error] ... 24 more
Somehow it cannot detect connector's file system: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
My spark configuration is pretty basic:
spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"

I ve found out that the maven version of GCS hadoop connector, is missing dependecies internally.
Ive fixed it by either:
downloading the connector from here https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage and providing to spark configuration on startup. (but it is not recommended to use in production, as the site is clearly stating)
providing missing dependencies for the connector.
to resolve the second option, I did unpack the gcs hadoop connector jar file, looked for the pom.xml, copy dependencies to a new stand alone xml file, and download them using mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/ command
here is example pom.xml that Ive created, please note I am using the 2.2.9 version of the connector
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<name>TMP_PACKAGE_NAME</name>
<description>
jar dependencies of gcs hadoop connector
</description>
<!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
-->
<groupId>TMP_PACKAGE_GROUP</groupId>
<artifactId>TMP_PACKAGE_NAME</artifactId>
<version>0.0.1</version>
<dependencies>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client-jackson2</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.1-jre</version>
</dependency>
<dependency>
<groupId>com.google.oauth-client</groupId>
<artifactId>google-oauth-client</artifactId>
<version>1.34.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util-hadoop</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.auto.value</groupId>
<artifactId>auto-value-annotations</artifactId>
<version>1.10.1</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>google-extensions</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger-system-backend</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
hope this helps

This is caused by the fact that Spark uses an old Guava library version and you used a non-shaded GCS connector jar. To make it work, you just need to use shaded GCS connector jar from Maven, for example: https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar

Related

MyBatis-Guice fails to initialize with java 17

I am getting the following exception when starting mybatis with java17.
java.lang.NoSuchMethodError: 'void org.mybatis.guice.AbstractMyBatisModule.bindInterceptor(com.google.inject.matcher.Matcher, com.google.inject.matcher.Matcher, org.aopalliance.intercept.MethodInterceptor[])'
Maven dependencies:
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>3.5.11</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-guice</artifactId>
<version>3.18</version>
</dependency>
<dependency>
<groupId>com.google.inject</groupId>
<artifactId>guice</artifactId>
<version>5.1.0</version>
</dependency>
I tried downgrading to mybatis-guice version 3.12 but it did not help.
works in intelliJ does not work on a standalone server
The issue there was a 3rd party dependency that was pulling in guice no_aop from 4.0.1. once excluded it worked fine.
Verify no guice dependencies using
mvn dependency:tree

SAP jscoverage-plugin does not work on new Java/JDK

I have run the auto test of my UI5 app locally successfully. When I tried to generate the cobertura coverage report with:
mvn verify -f pom.xml com.sap.ca:jscoverage-plugin:1.36.0:jscoverage
I got following error.
[ERROR] Failed to execute goal
com.sap.ca:jscoverage-plugin:1.36.0:jscoverage (default-cli) on
project sap.support.sccd: Execution default-cli of goal
com.sap.ca:jscoverage-plugin:1.36.0:jscoverage failed: A required
class was missing while executing
com.sap.ca:jscoverage-plugin:1.36.0:jscoverage:
javax/xml/bind/JAXBException
The javax is not available in JDK>7 anymore for a such long time. Why SAP jscoverage-plugin still use it? Is there any method to get rid of the issue.
I have tried all following dependencies with version 2.3.2 and 3.0.0 in my pom:
<dependency>
<groupId>org.glassfish.jaxb</groupId>
<artifactId>jaxb-runtime</artifactId>
<version>2.3.2</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>jakarta.xml.bind</groupId>
<artifactId>jakarta.xml.bind-api</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>com.sun.xml.bind</groupId>
<artifactId>jaxb-impl</artifactId>
<version>2.3.2</version>
<scope>runtime</scope>
</dependency>
However, they can't solve the problem.
How can I mitigate the issue and go on cobertura coverage report generation with this plugin?

Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$

Hi I try to run spark on my local laptop.
I created a mvn project in intelijidea and in my main class I have one line like bellow and when I try to run a project I got the error like below
val spark = SparkSession.builder().master("local").getOrCreate()
21/11/02 18:02:35 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module #0x34e9fd99) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module #0x34e9fd99
at org.apache.spark.storage.StorageUtils$.(StorageUtils.scala:213)
at org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:460)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at Main$.main(Main.scala:8)
at Main.main(Main.scala)
My dependency in pom
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.3</version>
<scope>provided</scope>
</dependency>
Any idea how to resolve this problem ?
At the time of writing this answer, Spark does not support Java 17 - only Java 8/11 (source: https://spark.apache.org/docs/latest/).
In my case uninstalling Java 17 and installing Java 8 (e.g. OpenJDK 8) fixed the problem and I started using Spark on my laptop.
UPDATE:
Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.

java.lang.NoSuchMethodError: org.jboss.logging.Logger.debugf(Ljava/lang/String;I)V Exception while migrating from 12.1.3 to 12.2

I am doing migration from JDev 12.1.3 to 12.2.0. Now there was some problem durnng compilation but it has been resolved by modifiying class path.
Now when I go for deployment then it showing me one exception. That is ...
"java.lang.NoSuchMethodError: org.jboss.logging.Logger.debugf(Ljava/lang/String;I)V"
I am using following dependencies..
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<version>5.2.0.Final</version>
<type>jar</type>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.hibernate.common</groupId>
<artifactId>hibernate-commons-annotations</artifactId>
<version>5.0.1.Final</version>
<type>jar</type>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.hibernate.javax.persistence</groupId>
<artifactId>hibernate-jpa-2.1-api</artifactId>
<version>1.0.0.Final</version>
<type>jar</type>
<scope>provided</scope>
</dependency>
Apart from these dependencies I also looked into "C:\oracle_home12c\wlserver\modules" folder and found there was one "org.jboss.logging.jboss-logging.jar" file.
I replaced the maven repo file from "jboss-logging-3.3.0.Final".
I also looked into "jboss-logging-3.3.0.Final" jar file. There is one Logger class file but it does not contain the debugf(String) method.
Exception type;
Caused by: java.lang.NoSuchMethodError: org.jboss.logging.Logger.debugf(Ljava/lang/String;I)V
at org.hibernate.internal.NamedQueryRepository.checkNamedQueries(NamedQueryRepository.java:149)
at org.hibernate.internal.SessionFactoryImpl.checkNamedQueries(SessionFactoryImpl.java:764)
at org.hibernate.internal.SessionFactoryImpl.<init>(SessionFactoryImpl.java:495)
at org.hibernate.boot.internal.SessionFactoryBuilderImpl.build(SessionFactoryBuilderImpl.java:444)
at org.hibernate.jpa.boot.internal.EntityManagerFactoryBuilderImpl.build(EntityManagerFactoryBuilderImpl.java:802)
at org.hibernate.jpa.HibernatePersistenceProvider.createContainerEntityManagerFactory(HibernatePersistenceProvider.java:135)
at weblogic.persistence.BasePersistenceUnitInfo.initializeEntityManagerFactory(BasePersistenceUnitInfo.java:611)
Can someone help to get rid off this problem?
Thanks in advance
If you are using the glassfish server the the problem might be with the lib provided by glassfish. This error is caused by the use of incompatible version.
Just create a glassfish-web.xml file in the WEB-INF directory. The contents of the file are shown below:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE glassfish-web-app PUBLIC "-//GlassFish.org//DTD GlassFish Application Server 3.1 Servlet 3.0//EN" "http://glassfish.org/dtds/glassfish-web-app_3_0-1.dtd">
<glassfish-web-app>
 <class-loader delegate="false"/>
</glassfish-web-app>
This ensures that glassfish does not load it's internal libraries, but libraries from your project.
I don't know how JDev classloading works, but maybe it takes the a wrong version of jboss-logging which doesn't have the method? Try removing that from the modules dir. And make sure that your app bundles jboss-logging in the .war. It should be brought in by Hibernate's dependencies. If not, add it to the pom.xml.

How to run a spark example program in Intellij IDEA

First on the command line from the root of the downloaded spark project I ran
mvn package
It was successful.
Then an intellij project was created by importing the spark pom.xml.
In the IDE the example class appears fine: all of the libraries are found. This can be viewed in the screenshot.
However , when attempting to run the main() a ClassNotFoundException on SparkContext occurs.
Why can Intellij not simply load and run this maven based scala program? And what can be done as a workaround?
As one can see below, the SparkContext is looking fine in the IDE: but then is not found when attempting to run:
The test was run by right clicking inside main():
.. and selecting Run GroupByTest
It gives
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
at org.apache.spark.examples.GroupByTest$.main(GroupByTest.scala:36)
at org.apache.spark.examples.GroupByTest.main(GroupByTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkContext
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
Here is the run configuration:
Spark lib isn't your class_path.
Execute sbt/sbt assembly,
and after include "/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar" to your project.
This may help IntelliJ-Runtime-error-tt11383. Change module dependencies from provide to compile. This works for me.
You need to add the spark dependency. If you are using maven just add these lines to your pom.xml:
<dependencies>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
...
</dependencies>
This way you'll have the dependency for compiling and testing purposes but not in the "jar-with-dependencies" artifact.
But if you want to execute the whole application in an standalone cluster running in your intellij you can add a maven profile to add the dependency with compile scope. Just like this:
<properties>
<scala.binary.version>2.11</scala.binary.version>
<spark.version>1.2.1</spark.version>
<spark.scope>provided</spark.scope>
</properties>
<profiles>
<profile>
<id>local</id>
<properties>
<spark.scope>compile</spark.scope>
</properties>
<dependencies>
<!--<dependency>-->
<!--<groupId>org.apache.hadoop</groupId>-->
<!--<artifactId>hadoop-common</artifactId>-->
<!--<version>2.6.0</version>-->
<!--</dependency>-->
<!--<dependency>-->
<!--<groupId>com.hadoop.gplcompression</groupId>-->
<!--<artifactId>hadoop-gpl-compression</artifactId>-->
<!--<version>0.1.0</version>-->
<!--</dependency>-->
<dependency>
<groupId>com.hadoop.gplcompression</groupId>
<artifactId>hadoop-lzo</artifactId>
<version>0.4.19</version>
</dependency>
</dependencies>
<activation>
<activeByDefault>false</activeByDefault>
<property>
<name>env</name>
<value>local</value>
</property>
</activation>
</profile>
</profiles>
<dependencies>
<!-- SPARK DEPENDENCIES -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>${spark.scope}</scope>
</dependency>
</dependencies>
I also added an option to my application to start a local cluster if --local is passed:
private def sparkContext(appName: String, isLocal:Boolean): SparkContext = {
val sparkConf = new SparkConf().setAppName(appName)
if (isLocal) {
sparkConf.setMaster("local")
}
new SparkContext(sparkConf)
}
Finally you have to enable "local" profile in Intellij in order to get proper dependencies. Just go to "Maven Projects" tab and enable the profile.