using AWS java SDK in Scala - scala

I'm modifying some parts of spark core which is written in Scala. Towards that, I want to call AWS Java API. As far as I know, it is possible to import java libraries in Scala code as there are already java library calls and import in Scala code like this:
import java.util.concurrent.{ScheduledFuture, TimeUnit}
Here they are importing some built-in java libraries. But, I do want to import AWS Java SDK. In their official documentation, they say that to use the SDK we should add the dependency to the project pom.xml file to be able to build the project using mv:
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.106</version>
</dependency>
</dependencies>
I'm wondering whether this is enough or not? Can I now import AWS Java classes in spark Scala source code?

Can I now import AWS Java classes in spark Scala source code?
Yes

Related

Flink Scala Missing Import

In my Flink project I cannot find certain libraries for connectors (specifically I need to ingest a CSV once and read several TBs of parquet data in either batch or streaming mode). I think I have all the required packages, but I am still getting:
[ERROR] import org.apache.flink.connector.file.src.FileSource
[ERROR] ^
[ERROR] C:\Users\alias\project\...\MyFlinkJob.scala:46: error: not found: type FileSource
My POM.xml is rather large, but I think I have the relevant imports:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet</artifactId>
<version>1.15.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId>
<version>1.11.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-bulk_2.12</artifactId>
<version>1.14.6</version>
</dependency>
I am using the following versions:
<scala.version>2.12.16</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<log4j.version>2.17.1</log4j.version>
<flink.version>1.15.1</flink.version>
Do I need a different import path for Scala than Java?
I wish the Flink documentation had the imports in example code snippets as I spend a long time trying to figure out the imports. What are recommended ._ imports?
I've looked through the symbols in the package but didn't find FileSystem. I looked for different tutorials and example projects showing how to read/listen-to parquet and CSV files with Flink. I made some progress this way, but of the few examples I found in Scala (not Java) for using Parquet files as a source the imports still didn't work even after adding their dependencies and running mvn clean install.
I tried using GitHub's advance search to find a public Scala project using FileSource and eventually found one with the following dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${project.version}</version>
</dependency>
This package was missing on index.scala-lang.org where I thought I should be looking for dependencies (this is my first Scala project so I thought that was the place to find packages like PyPi in Python). It seems that MVN Repository may be a better place to look.
Flink 1.15 has a Scala-free classpath, which has resulted in a number of Flink artifacts no longer having a Scala suffix. You can read all about it in the dedicated Flink blog on this topic: https://flink.apache.org/2022/02/22/scala-free.html
You can also see in that blog how you can use any Scala version with Flink instead of being limited to Scala 2.12.6.
TL;DR: you should use the Java APIs in your application. The Scala APIs will also be deprecated as of Flink 1.17.
Last but not least: don't mix & match Flink version. That won't work.

Not able to compile library extend from mllib

I am working on a ml project using apache spark and maven. I create two library for the project - one called "rmml" which extends Spark mllib library by adding a new FactorizationMachine Algorithm to "org.apache.spark.mllib.regression" and the other library called "dataprocess" uses this new algorithm I added.
In Intellij on my laptop, I am able to call and run the FM algorithm fine in "dataprocess", and I am able to compile "rmml", however I hit an error when try to compile "dataprocess" library with "error: object FactorizationMachine is not a member of package "org.apache.spark.mllib.regression". I am not a java developer, so I am having a hard time figure this out. Any help would be great, thanks!
This pom of "dataprocess" library that imports "rmml"
<dependency>
<groupId>com.something</groupId>
<artifactId>rmml</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
This is pom of "rmml" project
<groupId>com.something</groupId>
<artifactId>rmml</artifactId>
<version>1.0-SNAPSHOT</version>
And here is class path and file path of "rmml" project
As Luis suggested "need to publish rmml to the local repository"

How to build spark application using Scala IDE and Maven?

I'm new to Scala, Spark and Maven and would like to build spark application described here. It uses the Mahout library.
I have Scala IDE install and would like to use Maven to build the dependencies (which are the Mahout library as well as Spark lib). I couldn't find a good tutorial to start. Could someone help me figure it out?
First try compiling simple application with Maven in Scala IDE. The key of Maven project is directory structure and pom.xml. Although I don't use Scala IDE, this document seems helpful.
http://scala-ide.org/docs/tutorials/m2eclipse/
Next step is to add dependency on Spark in pom.xml you can follow this document.
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
For latest version of Spark and Mahout artifacts you can check them here:
http://mvnrepository.com/artifact/org.apache.spark
http://mvnrepository.com/artifact/org.apache.mahout
Hope this helps.
You need following tools to get started ( based on recent availability) -
Scala IDE for Eclipse – Download latest version of Scala IDE from
here.
Scala Version – 2.11 ( make sure scala compiler is set to
this version as well)
Spark Version 2.2 ( provided in maven
dependency)
winutils.exe
For running in Windows environment , you need hadoop binaries in
windows format. winutils provides that and we need to set
hadoop.home.dir system property to bin path inside which winutils.exe
is present. You can download winutils.exe here and place at path
like this – c:/hadoop/bin/winutils.exe
And, you can define Spark Core Dependency in your Maven POM.XML for your project, to get started with.
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
And in your Java/Scala class define this property, to run on your local environmet on Windows -
System.setProperty("hadoop.home.dir", "c://hadoop//");
More details and full setup details can be found here.

Scala and persistence framework version incompatible

I try to use slick and squeryl framework for data persistence with scala. I don't want to use Play framework, but just the persistence framework, but when I import slick (or squeryl) jar file, I encountered the issue below:
slick_2.10.1-2.0.0-M1.jar of <project_name> build path is cross-compiled with an incompatible version of Scala (2.10.1). In case this report is mistaken, this check can be disabled in the compiler preference page.
I used scala jar (2.11.6) under scala plugin on Eclipse, and I can run simple scala application. I can also get access to mysql dbms with jdbc. This problem appears when I import the slick (or squeryl) jar files. Is it because the framework does not support scala 2.11? Is downgrade scala version the solution? If so, can anyone point me a direction on how to downgrade the scala version under Eclipse scala plugin. Thank you very much
If you are using scala 2.11 you need to use this dependency for slick:
<dependency>
<groupId>com.typesafe.slick</groupId>
<artifactId>slick_2.11</artifactId>
<version>3.0.0</version>
</dependency>
The previous answer should resolve your issue with slick. If you'd like to use Squeryl, the dependency should be
<dependency>
<groupId>org.squeryl</groupId>
<artifactId>squeryl_2.11</artifactId>
<version>0.9.6-RC3</version>
</dependency>
Or, if you want to use 0.9.5
<dependency>
<groupId>org.squeryl</groupId>
<artifactId>squeryl_2.11</artifactId>
<version>0.9.5-7</version>
</dependency>
Libraries in Scala are only binary compatible with the minor version of Scala they were compiled against. You'll see that in these examples the correct scala version is appended to the artifact ID with an underscore.
If you have the ability to use SBT instead of Maven, I would recommend it. SBT can choose the proper version for you when you reference a dependency like the following
libraryDependencies += "org.squeryl" % "squeryl_2.11" % "0.9.6-RC3"

Why does json4s need a Scala compiler as a runtime dependency

I've discovered that by using json4s native
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.10</artifactId>
<version>3.2.9</version>
</dependency>
brings scalap and scala-compiler dependencies.
Why does it need it?
Does it generate code on the fly at runtime?
Why doesn't it use macros that do this processing at compile time?
The people of json4s have answered me in this issue the following:
Because we need to read the byte code to find out information about scala primitives. This is more necessary on 2.9 than it is on 2.10