The error java.lang.ClassNotFoundException when running spark-submit - scala

I want run spark-submit for my Scala Spark application. These are the steps I did:
1) execute Maven Clean and Package from IntellijIDEA to get myTest.jar
2) execute the following spark-submit command:
spark-submit --name 28 --master local[2] --class org.test.consumer.TestRunner \
/usr/tests/test1/target/myTest.jar \
$arg1 $arg2 $arg3 $arg4 $arg5
This is the TestRunner object that I want to run:
package org.test.consumer
import org.test.consumer.kafka.KafkaConsumer
object TestRunner {
def main(args: Array[String]) {
val Array(zkQuorum, group, topic1, topic2, kafkaNumThreads) = args
val processor = new KafkaConsumer(zkQuorum, group, topic1, topic2)
processor.run(kafkaNumThreads.toInt)
}
}
But the spark-submit command fails with the following message:
java.lang.ClassNotFoundException: org.test.consumer.TestRunner
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
I don't really understand why the object TestRunner cannot be found, if the package is specified correctly... Has it something to do with the usage of object instead of class?
UPDATE:
The project structure (the folder scala is currently marked as Sources):
/usr/tests/test1
.idea
src
main
docker
resources
scala
org
test
consumer
kafka
KafkaConsumer.scala
TestRunner.scala
test
target
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.test.abc</groupId>
<artifactId>consumer</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.sedis</groupId>
<artifactId>sedis_2.11</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>com.lambdaworks</groupId>
<artifactId>jacks_2.11</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib-local_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>com.github.nscala-time</groupId>
<artifactId>nscala-time_2.11</artifactId>
<version>2.12.0</version>
</dependency>
</dependencies>
</project>

#FiofanS, the problem is in your directory structure.
Maven uses a convention over configuratation policy. It means, by default, maven expects that you will follow the set of rules that it has defined. For example, it expects you to put all your code in src/main/java directory (See Maven Standard Directory Structure). But you don't have your code in src/main/java directory. Instead, you have it in src/main/scala directory. By default, maven will not consider src/main/scala as source location.
Although , maven expects you to follow the rules it has defined, but it doesn't enforce them. It also provides you ways in which you can configure things based on your preference.
In your case, you will have to explicitly instruct maven to consider src/main/scala also as one of your source location.
To do this, you will have to use the Maven Build Helper Plugin.
Add the below piece of code within the <project>...</project> tag in your pom.xml
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>1.7</version>
<executions>
<execution>
<id>add-source</id>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>src/main/scala</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
This should solve your problem.

Related

Failed to load class error due to seemingly fine code

I have been intermittently receiving the "Failed to load class [main class]" error due to seemingly working code and would like to understand why?
[ec2-user#ip-xxx localtesting]$ spark-submit --master yarn --num-executors 1 --executor-cores 1 --class com.monster.SanitycheckForNewLine /home/ec2-user/localtesting/NextGenETL-testing.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Error: Failed to load class com.monster.SanitycheckForNewLine.
22/06/07 20:32:14 INFO ShutdownHookManager: Shutdown hook called
22/06/07 20:32:14 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-8150c3fa-715f-4475-ac15-b9881168c4ee
In this case, the specific line that is causing the error is in a property file named service_type.properties and the property itself is:
from=SCHEMA_NAME.SERVICE_TYPE ST
Now my code works when I set the property in question to be a random string (I used the character as a test). Unfortunately, I have tried manually typing out the line of code, but the error persists (meaning that I am not looking at a special char lookalike issue) and I checked the newline character.
At this point, I am trying to understand both why this issue is occurring and how to more easily find the random lines of code that have seemingly no issues, but break in runtime (I have seen this issue appear in the scala code itself too - I am just using the property file as an easy example).
Regarding my env, I am running maven with the following pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId><.../></groupId>
<artifactId><.../></artifactId>
<version><.../></version>
<properties>
<scala.version>2.12.2</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<scala.plugin.version>2.12.2</scala.plugin.version>
<aws.sdk.bom.version>2.17.140</aws.sdk.bom.version>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
</properties>
<dependencyManagement>
<dependencies>
<!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bom -->
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bom</artifactId>
<version>${aws.sdk.bom.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>2.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.amazon.emr/emr-dynamodb-hadoop -->
<dependency>
<groupId>com.amazon.emr</groupId>
<artifactId>emr-dynamodb-hadoop</artifactId>
<version>4.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
<version>1.12.44</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-ssm -->
<!--<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-ssm</artifactId>
<version>1.12.44</version>
</dependency>-->
<!-- <dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>aws-sdk-java</artifactId>
<version>2.17.15</version>
</dependency>-->
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>dynamodb</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>ssm</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>secretsmanager</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>aws-xml-protocol</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>sts</artifactId>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.6.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
<.../>
</plugins>
</build>
<scm>
<.../>
</scm>
</project>
SanitycheckForNewLine.scala:
import com.monster.utilities.constants.{info, log}
object SanitycheckForNewLine {
def main(args: Array[String]): Unit = {
log(info, "I'm sane!")
}
}

Why "java.lang.ClassNotFoundException: Failed to find data source: kinesis" with spark-streaming-kinesis-asl dependency?

My setup:
scala:2.11.8
spark:2.3.0.cloudera4
I have already add this in my .pom file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.11</artifactId>
<version>2.3.0</version>
</dependency>
However, when I run my spark-streaming code to consume data from kinesis, it returns:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kinesis.
I got a similar error when I consume data from Kafka and solved it by indicating the dependent jar in the submit command. But it seems this doesn't work this time:
sudo -u hdfs spark2-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.3.0 --class com.package.newkinesis --master yarn sparktest-1.0-SNAPSHOT.jar
How to address this issue? Any help is appreciated.
My code:
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val kinesis = spark.readStream
.format("kinesis")
.option("streamName", kinesisStreamName)
.option("endpointUrl", kinesisEndpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", awsAccessKeyId)
.option("awsSecretKey", awsSecretKey)
.load()
kinesis.writeStream.format("console").start().awaitTermination()
My full .pom file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.netease</groupId>
<artifactId>sparktest</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<includes>
<include>org/apache/spark/*</include>
</includes>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
</project>
tl;dr It won't work.
You use spark-streaming-kinesis-asl_2.11 dependency that is for the older Spark Streaming API with the new Spark Structured Streaming and hence the exception.
You have to find a compatible Spark Structured Streaming data source for AWS Kinesis which is not officially supported by the Apache Spark project.

Log4j warning while using axis2-wsdl2code-maven-plugin to generate SOAP client

I am using axis2-wsdl2code-maven-plugin to generate my SOAP service client. The plugin itself works correctly and it generates a correct SOAP client, but I get following warning in console with every build:
log4j:WARN No appenders could be found for logger (org.apache.axiom.locator.DefaultOMMetaFactoryLocator).
log4j:WARN Please initialize the log4j system properly.
I know I need to configure Log4j properties, but I haven't found any functional way to do this in context of axis2-wsdl2code-maven-plugin...
This is my pom.xml file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<artifactId>powerauth-java-client-axis</artifactId>
<version>0.13.0</version>
<name>powerauth-java-client-axis</name>
<description>PowerAuth 2.0 Service Client - Axis</description>
<parent>
<groupId>io.getlime.security</groupId>
<artifactId>powerauth-parent</artifactId>
<version>0.13.0</version>
<relativePath>../pom.xml</relativePath>
</parent>
<dependencies>
<dependency>
<groupId>org.apache.axis2</groupId>
<artifactId>axis2</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.axis2</groupId>
<artifactId>axis2-adb</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.axis2</groupId>
<artifactId>axis2-transport-http</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.axis2</groupId>
<artifactId>axis2-transport-local</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.ws.commons.axiom</groupId>
<artifactId>axiom-api</artifactId>
<version>1.2.20</version>
</dependency>
<dependency>
<groupId>org.apache.ws.commons.axiom</groupId>
<artifactId>axiom-impl</artifactId>
<version>1.2.20</version>
</dependency>
<dependency>
<groupId>org.apache.ws.security</groupId>
<artifactId>wss4j</artifactId>
<version>1.6.19</version>
</dependency>
<dependency>
<groupId>wsdl4j</groupId>
<artifactId>wsdl4j</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<!-- tag::wsdl[] -->
<plugin>
<groupId>org.apache.axis2</groupId>
<artifactId>axis2-wsdl2code-maven-plugin</artifactId>
<version>1.6.4</version>
<executions>
<execution>
<goals>
<goal>wsdl2code</goal>
</goals>
<configuration>
<packageName>io.getlime.powerauth.soap</packageName>
<wsdlFile>${basedir}/src/main/resources/soap/wsdl/service.wsdl</wsdlFile>
</configuration>
</execution>
</executions>
</plugin>
<!-- end::wsdl[] -->
</plugins>
</build>
</project>
This is actually a bug. The fix for AXIS2-5364 added a dependency on log4j to axis2-wsdl2code-maven-plugin. The problem is that some of the code executed by the plugin uses Commons Logging and therefore started to use log4j. That generates the warning you have seen because in a Maven environment, log4j isn't configured.
What the plugin should do is to redirect logs to SLF4J, because that API is supported by recent Maven versions. The -X option (which enables debug logging on the Maven command line) then works for these logs as well.
This problem will be addressed in AXIS2-5827.

How to write my own job in samza

Recently I am trying to do some stream processing work on Samza framework. I have deployed the hello-samza example successfully. However, when I try to write my own job, I have no idea where to start my work.
I have read this document, but I still can't get the point. So can anyone help me:
What is my code's architecture (source code, lib code, and configuration).
Which directory will my code pushed in.
What other work I need to do to get my codes run.
Your suggestion will help me a lot, many thanks!
It is very simple to build your own Jobs. First get hello samza:
git clone https://git.apache.org/samza-hello-samza.git hello-samza
The next step is to set up the system by these command:
bin/grid bootstrap
Please make sure that all is going good by jps
The next step is to remove apache-rat-plugin from pom.xml instead of building your project inside hello-samza.
When you remome that you can add a java file Job in src folder(MyTask.java) and also a .properties file inside config directory (My.Task.properties)
This is a sample-empty Job(MyTask.java).
package com.samza;
public class MyTask implements StreamTask {
private static final SystemStream OUTPUT_STREAM = new SystemStream("kafka","topicOut");
public void process (IncomingMessageEnvelope envelope, MessageCollector collector,
TaskCoordinator coordinator) throws Exception {
// Do something useful
}
}
Don't forget to implement a .properties file.
If you hava non-error code, build with maven like:
mvn clean package
mkdir -p deploy/samza
tar -xvf ./samza-job-package/target/samza-job-package-0.10.0-dist.tar.gz -C deploy/samza
After that and your server is up (if it is not you can started by ./bin/grid start all) you can deploy your Job by deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/MyTask.properties
and consume the result by kafka client-consumer
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic outTopic
If you follow the Hello Samza instructions, you will have a fully functioning Zookeeper, Kafka, and Yarn/Samza cluster running on your local computer. With that project there are the Wikipedia feed related tasks that you can run to test things.
However, like you, I had some trouble coming up with the proper directory structure and build settings for new tasks (without the cluster management stuff). So, I created hello-samza-base by stripping out everything unnecessary for new tasks outside of hello-samza. I included instructions in the README on building new tasks.
As far as deployment goes, that is a bit more complex. Do some reading on creating Zookeeper, Kafka, and Yarn clusters.
I created Samza jobs parting via a Maven Eclipse project. The dependencies for version 0.9.2 where loaded in the pom.xml file with this content (I had some version issues, so you may have some work there):
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.acio.samza</groupId>
<artifactId>samzafroga</artifactId>
<version>0.0.1</version>
<name>samzafroga</name>
<dependencies>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-api</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-core_2.10</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-log4j</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-shell</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-yarn_2.10</artifactId>
<exclusions>
<exclusion>
<artifactId>jdk.tools</artifactId>
<groupId>jdk.tools</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv_2.10</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv-rocksdb_2.10</artifactId>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kafka_2.10</artifactId>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
</dependency>
<dependency>
<groupId>org.schwering</groupId>
<artifactId>irclib</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>${jettyVersion}</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-webapp</artifactId>
<version>${jettyVersion}</version>
</dependency>
</dependencies>
<properties>
<!-- maven specific properties -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<samza.version>0.8.0</samza.version>
<jettyVersion>7.6.16.v20140903</jettyVersion>
</properties>
<repositories>
<repository>
<id>apache-releases</id>
<url>https://repository.apache.org/content/groups/public</url>
</repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>https://oss.sonatype.org/content/groups/scala-tools</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.rat</groupId>
<artifactId>apache-rat-plugin</artifactId>
<version>0.9</version>
<configuration>
<excludes>
<exclude>*.patch</exclude>
<exclude>**/target/**</exclude>
<exclude>*.json</exclude>
<exclude>.vagrant/**</exclude>
<exclude>.git/**</exclude>
<exclude>*.md</exclude>
<exclude>docs/**</exclude>
<exclude>config/**</exclude>
<exclude>bin/**</exclude>
<exclude>.gitignore</exclude>
<exclude>**/.cache/**</exclude>
<exclude>deploy/**</exclude>
<exclude>**/.project</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<!-- plugin to build the tar.gz file filled with examples -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.3</version>
<configuration>
<descriptors>
<descriptor>src/assembly/bin.xml</descriptor>
</descriptors>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-api</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-core_2.10</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-log4j</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-shell</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-yarn_2.10</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv_2.10</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv-rocksdb_2.10</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kafka_2.10</artifactId>
<version>${samza.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.schwering</groupId>
<artifactId>irclib</artifactId>
<version>1.10</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
<version>1.8.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>
</dependencyManagement>
</project>
The basic code of a job is this one:
package xxxx;
import java.util.Map;
import org.apache.samza.config.Config;
import org.apache.samza.system.IncomingMessageEnvelope;
import org.apache.samza.system.OutgoingMessageEnvelope;
import org.apache.samza.system.SystemStream;
import org.apache.samza.task.MessageCollector;
import org.apache.samza.task.StreamTask;
import org.apache.samza.task.TaskContext;
import org.apache.samza.task.TaskCoordinator;
public class Redirect implements StreamTask {
private final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "samzaout");
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator)
{
String msg = (String)envelope.getMessage();
// Transformation
String outmsg = "xxx-" + msg + "-xxx";
collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, outmsg));
}
}
Once you have it compiled you need to group it into a jar file and place it in a location accesible for all the samza nodes, web or hdfs.
Reference this from the properties file you will have to create to launch it. Look for examples in the porject web page.
Read that document some more, look at the hello-samza example some more, and if you deployed it to YARN read about it some more. All the answers you're looking for are there.
There are three jobs in hello-samza. Pick one and follow it, configuration, launching scripts etc.
Here's from the hello-samza page how to launch the wikipedia-feed job
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
The properties file shows where the compiled job/task code is among other things. The source code for the wikipedia-feed job/task is here:
https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/task/WikipediaFeedStreamTask.java
Just modify this job, or copy and modify, to get yours going.

Groovy Eclipse compiler plugin for maven not compiling anything

I'm trying to understand the correct setup of a Java and Groovy project in Maven, compiling the source files with groovy-eclipse-compiler.
According to the plugin site, if you have files in src/main/java and src/test/java the compiler should find both Java and Groovy by default.
Setting up the build like the example makes target/classes empty.
<build>
...
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<compilerId>groovy-eclipse-compiler</compilerId>
<!-- set verbose to be true if you want lots of uninteresting messages -->
<!-- <verbose>true</verbose> -->
</configuration>
<dependencies>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-eclipse-compiler</artifactId>
<version>2.7.0-01</version>
</dependency>
</dependencies>
</plugin>
...
</plugins>
</build>
The only way I found to make it work was to use the build-helper-maven-plugin. This is the option used by the Archetype too:
mvn archetype:generate \
-DarchetypeGroupId=org.codehaus.groovy \
-DarchetypeArtifactId=groovy-eclipse-quickstart \
-DarchetypeVersion=2.5.2-SNAPSHOT \
-DgroupId=foo \
-DartifactId=bar \
-Dversion=1 \
-DinteractiveMode=false \
-DarchetypeRepository=https://nexus.codehaus.org/content/repositories/snapshots/
So, it's the page outdated? Isn't enough have Java files in the src folder?
I'm able to get this to work. Here is what I did:
create the archetype as you did above.
edit the pom.xml to the file below
mv everything from src/main/groovy to src/main/java and the same for the tests.
remove the empty groovy dirs
rename all *.java files to *.groovy
mvn clean compile
I get some expected compile errors due to differences in Java vs groovy syntax, but all groovy files are recognized.
Does this work for you? If not what is different?
Here is the pom file I am using:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>foo</groupId>
<artifactId>bar</artifactId>
<version>1</version>
<name>bar Groovy Eclipse Maven Java App</name>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<compilerId>groovy-eclipse-compiler</compilerId>
</configuration>
<dependencies>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-eclipse-compiler</artifactId>
<version>2.7.0-01</version>
<exclusions>
<exclusion>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-eclipse-batch</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-eclipse-batch</artifactId>
<version>2.1.1-01</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-all</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
EDIT
Since I misunderstood the actual question, and you do want to keep the files separate, then there are 2 options:
using the build-helper-plugin as you have done
Use the groovy-eclipse-compiler mojo:
<plugin>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-eclipse-compiler</artifactId>
<version>2.7.0-01</version>
<extensions>true</extensions>
</plugin>
</build>
This is all described on the groovy-eclipse-compiler page.