Not able to compile library extend from mllib - scala

I am working on a ml project using apache spark and maven. I create two library for the project - one called "rmml" which extends Spark mllib library by adding a new FactorizationMachine Algorithm to "org.apache.spark.mllib.regression" and the other library called "dataprocess" uses this new algorithm I added.
In Intellij on my laptop, I am able to call and run the FM algorithm fine in "dataprocess", and I am able to compile "rmml", however I hit an error when try to compile "dataprocess" library with "error: object FactorizationMachine is not a member of package "org.apache.spark.mllib.regression". I am not a java developer, so I am having a hard time figure this out. Any help would be great, thanks!
This pom of "dataprocess" library that imports "rmml"
<dependency>
<groupId>com.something</groupId>
<artifactId>rmml</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
This is pom of "rmml" project
<groupId>com.something</groupId>
<artifactId>rmml</artifactId>
<version>1.0-SNAPSHOT</version>
And here is class path and file path of "rmml" project

As Luis suggested "need to publish rmml to the local repository"

Related

Flink Scala Missing Import

In my Flink project I cannot find certain libraries for connectors (specifically I need to ingest a CSV once and read several TBs of parquet data in either batch or streaming mode). I think I have all the required packages, but I am still getting:
[ERROR] import org.apache.flink.connector.file.src.FileSource
[ERROR] ^
[ERROR] C:\Users\alias\project\...\MyFlinkJob.scala:46: error: not found: type FileSource
My POM.xml is rather large, but I think I have the relevant imports:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet</artifactId>
<version>1.15.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId>
<version>1.11.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-bulk_2.12</artifactId>
<version>1.14.6</version>
</dependency>
I am using the following versions:
<scala.version>2.12.16</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<log4j.version>2.17.1</log4j.version>
<flink.version>1.15.1</flink.version>
Do I need a different import path for Scala than Java?
I wish the Flink documentation had the imports in example code snippets as I spend a long time trying to figure out the imports. What are recommended ._ imports?
I've looked through the symbols in the package but didn't find FileSystem. I looked for different tutorials and example projects showing how to read/listen-to parquet and CSV files with Flink. I made some progress this way, but of the few examples I found in Scala (not Java) for using Parquet files as a source the imports still didn't work even after adding their dependencies and running mvn clean install.
I tried using GitHub's advance search to find a public Scala project using FileSource and eventually found one with the following dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${project.version}</version>
</dependency>
This package was missing on index.scala-lang.org where I thought I should be looking for dependencies (this is my first Scala project so I thought that was the place to find packages like PyPi in Python). It seems that MVN Repository may be a better place to look.
Flink 1.15 has a Scala-free classpath, which has resulted in a number of Flink artifacts no longer having a Scala suffix. You can read all about it in the dedicated Flink blog on this topic: https://flink.apache.org/2022/02/22/scala-free.html
You can also see in that blog how you can use any Scala version with Flink instead of being limited to Scala 2.12.6.
TL;DR: you should use the Java APIs in your application. The Scala APIs will also be deprecated as of Flink 1.17.
Last but not least: don't mix & match Flink version. That won't work.

using AWS java SDK in Scala

I'm modifying some parts of spark core which is written in Scala. Towards that, I want to call AWS Java API. As far as I know, it is possible to import java libraries in Scala code as there are already java library calls and import in Scala code like this:
import java.util.concurrent.{ScheduledFuture, TimeUnit}
Here they are importing some built-in java libraries. But, I do want to import AWS Java SDK. In their official documentation, they say that to use the SDK we should add the dependency to the project pom.xml file to be able to build the project using mv:
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.106</version>
</dependency>
</dependencies>
I'm wondering whether this is enough or not? Can I now import AWS Java classes in spark Scala source code?
Can I now import AWS Java classes in spark Scala source code?
Yes

Dependency error while setting up spark with java in eclipse

I am new to spark and Java. I was trying to setup the spark environment in Eclipse using maven dependency. I am using Java 1.8 with Scala 2.11.7. I gave created a Scala project in Eclipse and created a maven dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
Now I am getting an error as "Failure to transfer org.spark-project.spark:unused:jar:1.0.0 from https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced."
I am new too. These r my steps working with eclipse, scala 2.11, maven and spark: 1.- Create a maven project 2.- Use this POM as basic orientation https://github.com/entradajuan/Spark0/blob/master/Spark0/pom.xml 3.- Convert the project to Scala: right buton->configure->convert to Scala 4.- Then create a package in src/main/java 5.- Create an scala object with its def main in the new package
So far I am always starting the same way. I dont use Java in my projects.
It works fine also when running maven build "clean install" for getting jar.

Access CQ5 project bundles on same instance

I have two project bundles my local CQ/AEM server. Project A contains some java Util class methods which can be utilized in project B as well.
While developing, how do I import my project A classes in project B to access the methods so that I do not have to duplicate the methods again?
I tried adding dependency in my Project B bundle pom.xml as below. Is this correct?
<dependency>
<groupId>com.project-a</groupId>
<artifactId>cq-project-a</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
I get missing artifact error for this.
"Missing artifact com.project-a:cq-project-a:jar:1.0-SNAPSHOT"
Please suggest how the import can be done.
Thanks
I guess you forgot to build project a using mvn install. The dependency will be searched in your local maven repo.
This solution may fix you issue: update your pom.xml on project a, modify groupId, artifactId, version, packaging tags and make sure they look likes:
<groupId>com.project-a</groupId>
<artifactId>cq-project-a</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>bundle</packaging>
Then run mvn clean install on project a, run mvn clean install one more time one project b. I applied this to my last project, I hope it works for you.

How to build spark application using Scala IDE and Maven?

I'm new to Scala, Spark and Maven and would like to build spark application described here. It uses the Mahout library.
I have Scala IDE install and would like to use Maven to build the dependencies (which are the Mahout library as well as Spark lib). I couldn't find a good tutorial to start. Could someone help me figure it out?
First try compiling simple application with Maven in Scala IDE. The key of Maven project is directory structure and pom.xml. Although I don't use Scala IDE, this document seems helpful.
http://scala-ide.org/docs/tutorials/m2eclipse/
Next step is to add dependency on Spark in pom.xml you can follow this document.
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
For latest version of Spark and Mahout artifacts you can check them here:
http://mvnrepository.com/artifact/org.apache.spark
http://mvnrepository.com/artifact/org.apache.mahout
Hope this helps.
You need following tools to get started ( based on recent availability) -
Scala IDE for Eclipse – Download latest version of Scala IDE from
here.
Scala Version – 2.11 ( make sure scala compiler is set to
this version as well)
Spark Version 2.2 ( provided in maven
dependency)
winutils.exe
For running in Windows environment , you need hadoop binaries in
windows format. winutils provides that and we need to set
hadoop.home.dir system property to bin path inside which winutils.exe
is present. You can download winutils.exe here and place at path
like this – c:/hadoop/bin/winutils.exe
And, you can define Spark Core Dependency in your Maven POM.XML for your project, to get started with.
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
And in your Java/Scala class define this property, to run on your local environmet on Windows -
System.setProperty("hadoop.home.dir", "c://hadoop//");
More details and full setup details can be found here.