Flink Scala Missing Import - scala

In my Flink project I cannot find certain libraries for connectors (specifically I need to ingest a CSV once and read several TBs of parquet data in either batch or streaming mode). I think I have all the required packages, but I am still getting:
[ERROR] import org.apache.flink.connector.file.src.FileSource
[ERROR] ^
[ERROR] C:\Users\alias\project\...\MyFlinkJob.scala:46: error: not found: type FileSource
My POM.xml is rather large, but I think I have the relevant imports:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet</artifactId>
<version>1.15.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId>
<version>1.11.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-bulk_2.12</artifactId>
<version>1.14.6</version>
</dependency>
I am using the following versions:
<scala.version>2.12.16</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<log4j.version>2.17.1</log4j.version>
<flink.version>1.15.1</flink.version>
Do I need a different import path for Scala than Java?
I wish the Flink documentation had the imports in example code snippets as I spend a long time trying to figure out the imports. What are recommended ._ imports?
I've looked through the symbols in the package but didn't find FileSystem. I looked for different tutorials and example projects showing how to read/listen-to parquet and CSV files with Flink. I made some progress this way, but of the few examples I found in Scala (not Java) for using Parquet files as a source the imports still didn't work even after adding their dependencies and running mvn clean install.

I tried using GitHub's advance search to find a public Scala project using FileSource and eventually found one with the following dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${project.version}</version>
</dependency>
This package was missing on index.scala-lang.org where I thought I should be looking for dependencies (this is my first Scala project so I thought that was the place to find packages like PyPi in Python). It seems that MVN Repository may be a better place to look.

Flink 1.15 has a Scala-free classpath, which has resulted in a number of Flink artifacts no longer having a Scala suffix. You can read all about it in the dedicated Flink blog on this topic: https://flink.apache.org/2022/02/22/scala-free.html
You can also see in that blog how you can use any Scala version with Flink instead of being limited to Scala 2.12.6.
TL;DR: you should use the Java APIs in your application. The Scala APIs will also be deprecated as of Flink 1.17.
Last but not least: don't mix & match Flink version. That won't work.

Related

Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?

I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed. I faced an error concerning the reading of avro files. Here's my code :
sparkSession.read.format("com.databricks.spark.avro").load(input)
This code failed as expected. Then I added this dependency to my pom.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.0</version>
</dependency>
Which made my code run successfully. And this is the part that I don't understand , I'm still using the module com.databricks.spark.avro in my code. Why is adding org.apache.spark.avro dependency solved my problem, knowing that I'm not really using it in my code?
I was expecting that I will need to change my code to something like this:
sparkSession.read.format("avro").load(input)
This is historic artifact of the fact that initially Spark Avro support was added by Databricks in their proprietary Spark Runtime as com.databricks.spark.avro format, when Sark Avro support was added to open-source Spark as avro format then, for backward compatibility, support of the com.databricks.spark.avro format was retained if spark.sql.legacy.replaceDatabricksSparkAvro.enabled property is set to true:
If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.

How to fix the issue with Java 9 Modularity after adding dependency on mongock?

We are developing Spring Boot application that is using MongoDB as storage.
And we want to add to our project the DB migration tool: mongock.
in pom.xml I added a new dependency:
<dependency>
<groupId>com.github.cloudyrock.mongock</groupId>
<artifactId>mongock-spring</artifactId>
<version>3.3.2</version>
</dependency>
And IntelliJ Idea advised me to add the following lines to module-info.java:
requires mongock.spring;
requires mongock.core;
After that I am not able anymore to build the project, I am getting the following error:
Module 'com.acme.project-name' reads package 'com.github.cloudyrock.mongock' from both 'mongock.spring' and 'mongock.core'
I do not know a lot about Java 9 Modularity, that why I am stuck with resolving this issue, please advice.
If it's worth solving this issue one could upgrade to the latest release of the artifact.
<dependency>
<groupId>com.github.cloudyrock.mongock</groupId>
<artifactId>mongock-spring</artifactId>
<version>4.0.1.alpha</version>
</dependency>
You can understand what the issue means over this and this Q&A.
If you were to analyze the issue, you could straightforward notice that with the version 3.3.2, there are two artifacts that are brought in as a dependency under external libraries - mongock-spring and mongock-core. Further, if you look at the JARs, you would see their package structure is the same( i.e. both have classes within com.github.cloudyrock.mongock) and that is the reason for conflict that you see while they both are introduced in the modulepath.
Edit: Extended discussions to be moved over to #mongock/issues/212.

Error through remote Spark Job: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem

Problem
I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream function.
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
Subsequently, I am unable to read any data from blob storage.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4 prebuilt for Hadoop 2.7.
Fix
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3 from 3.1.x to match the version in my local spark/jars folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4.
Summary
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
In my pom.xml I excluded the hadoop-client dependency automatically imported by the spark-core jar. This dependency was version 2.6.5 which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe version I had under C://winutils/bin to be the version I required and also added the corresponding hadoop.dll. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR
Issue was the auto imported hadoop-client dependency which was fixed by excluding it & adding the new winutils.exe and hadoop.dll under C://winutils/bin.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.
Problem:
I was facing same issue while running fat jar with hadoop 2.7 and spark 2.4 on cluster with hadoop 3.x ,
I was using maven shade plugin.
Observation:
While building fat jar it was including jar org.apache.hadoop:hadoop-hdfs:jar:2.6.5 which has class class org.apache.hadoop.hdfs.web.HftpFileSystem.
Which was causing problem in hadoop 3
Solution:
I have excluded this jar while building fat jar as below.Issue got resolved.

using AWS java SDK in Scala

I'm modifying some parts of spark core which is written in Scala. Towards that, I want to call AWS Java API. As far as I know, it is possible to import java libraries in Scala code as there are already java library calls and import in Scala code like this:
import java.util.concurrent.{ScheduledFuture, TimeUnit}
Here they are importing some built-in java libraries. But, I do want to import AWS Java SDK. In their official documentation, they say that to use the SDK we should add the dependency to the project pom.xml file to be able to build the project using mv:
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.106</version>
</dependency>
</dependencies>
I'm wondering whether this is enough or not? Can I now import AWS Java classes in spark Scala source code?
Can I now import AWS Java classes in spark Scala source code?
Yes

How to build spark application using Scala IDE and Maven?

I'm new to Scala, Spark and Maven and would like to build spark application described here. It uses the Mahout library.
I have Scala IDE install and would like to use Maven to build the dependencies (which are the Mahout library as well as Spark lib). I couldn't find a good tutorial to start. Could someone help me figure it out?
First try compiling simple application with Maven in Scala IDE. The key of Maven project is directory structure and pom.xml. Although I don't use Scala IDE, this document seems helpful.
http://scala-ide.org/docs/tutorials/m2eclipse/
Next step is to add dependency on Spark in pom.xml you can follow this document.
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
For latest version of Spark and Mahout artifacts you can check them here:
http://mvnrepository.com/artifact/org.apache.spark
http://mvnrepository.com/artifact/org.apache.mahout
Hope this helps.
You need following tools to get started ( based on recent availability) -
Scala IDE for Eclipse – Download latest version of Scala IDE from
here.
Scala Version – 2.11 ( make sure scala compiler is set to
this version as well)
Spark Version 2.2 ( provided in maven
dependency)
winutils.exe
For running in Windows environment , you need hadoop binaries in
windows format. winutils provides that and we need to set
hadoop.home.dir system property to bin path inside which winutils.exe
is present. You can download winutils.exe here and place at path
like this – c:/hadoop/bin/winutils.exe
And, you can define Spark Core Dependency in your Maven POM.XML for your project, to get started with.
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
And in your Java/Scala class define this property, to run on your local environmet on Windows -
System.setProperty("hadoop.home.dir", "c://hadoop//");
More details and full setup details can be found here.