Scala on eclipse : reading csv as dataframe throw a java.lang.ArrayIndexOutOfBoundsException - eclipse

Trying to read a simple csv file and load it in a dataframe throw a java.lang.ArrayIndexOutOfBoundsException.
As I am new to Scala I may have missed something trivial, however a thorough search both in google and stackoverflow lead nothing.
The code is the following:
import org.apache.spark.sql.SparkSession
object TransformInitial {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
val df = session.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("data_sets/small_test.csv")
df.show()
}
}
small_test.csv is as simple as possible:
v1,v2,v3
0,1,2
3,4,5
Here is the actual pom of this Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Scala_tests</groupId>
<artifactId>Scala_tests</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<resources>
<resource>
<directory>src</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Execution of the code throw the following
java.lang.ArrayIndexOutOfBoundsException:
18/11/09 12:03:31 INFO FileSourceStrategy: Pruning directories with:
18/11/09 12:03:31 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
18/11/09 12:03:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
18/11/09 12:03:31 INFO FileSourceScanExec: Pushed Filters:
18/11/09 12:03:31 INFO CodeGenerator: Code generated in 413.859722 ms
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.IterableLike.foreach(IterableLike.scala:71)
at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
at scala.collection.TraversableLike.map(TraversableLike.scala:234)
at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:170)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:169)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.immutable.List.flatMap(List.scala:352)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:169)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:22)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:30)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:78)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:467)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:351)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:283)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:169)
at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:223)
at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:348)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:210)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:153)
at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1203)
at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1157)
at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:481)
at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:679)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:107)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3559)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2927)
at org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:142)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:232)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:68)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at TransformInitial$.main(TransformInitial.scala:9)
at TransformInitial.main(TransformInitial.scala)
For the record eclipse version is 2018-09 (4.9.0).
I've hunted for special characters in the csv with a cat -A. It yield nothing.
I'm out of options, something trivial must be missing but I can't put a finger on it.

I'm not sure exactly what is causing your error, since the code works for me. It could be related to the version of the Scala compiler that you are using, since there's no information about that in your Maven file.
I have posted my complete solution—using SBT— to GitHub. To exectute the code, you'll need to install SBT, cd to the checked out source's root folder, then run the following command:
$ sbt run
BTW, I changed your code to take advantage of more idiomatic Scala conventions, and also used the csv function to load your file. The new Scala code looks like this:
import org.apache.spark.sql.SparkSession
// Extending App is more idiomatic than writing a "main" function.
object TransformInitial
extends App {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
// As of Spark 2.0, it's easier to read CSV files.
val df = session.read.option("header", "true").option("inferSchema", "true").csv("data_sets/small_test.csv")
df.show()
// Shutdown gracefully.
session.stop()
}
Note that I also removed the redundant delimiter option.

Downgrading scala version to 2.11 fixed for me.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>

Related

Spark Scala read csv file using s3a

I am trying to read a csv (native) file from an S3 bucket using a locally running Spark - Scala. I am able to read the file using the http protocol but I intend to use the s3a protocol.
Below is the configuration setup before the call.
val awsId = System.getenv("AWS_ACCESS_KEY_ID")
val awsKey = System.getenv("AWS_SECRET_ACCESS_KEY")
sc.hadoopConfiguration.set("fs.s3a.access.key", awsId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", awsKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider");
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "us-east-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl.disable.cache", "true")
here
Read the file and print the first 5 rows from the rdd/dataframe
val fileAPath = Files.s3aPath(Files.input);
println("reading file s3", fileAPath)
// s3a://bucket-name/dataSets/policyoutput.csv
val df = sc.textFile(fileAPath);
df.take(5).foreach(println);
I am getting the below exception
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: FD92FDC175C64AA2, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: IuloUEASgqnY4lrSMpbyJpwgFfCFbttxuxmJ9hGHMUgZTbO/UR/YyDgjix+3rBe0Y4MQHPzNvhA=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
Any help / direction for further investigation will be much appreciated.
Thanks
Anyone else struggling with this I had to update the version of hadoop-client
additionally the links below were quite helpful
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
https://disqus.com/by/cfeduke/?utm_source=reply&utm_medium=email&utm_content=comment_author
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
pom details below
<properties>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.8.0</hadoop.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
</dependency>

Fuse Error deploying bundle IndexOutOfBoundsException

I am new to FUSE. I am trying to start a simple REST service.
I am using Jboss Fuse 6.3.
The bundle installs but i cannot start it without the error.
After installing the bundle, it appears as "Active", but with the tag "Failure".
The log prints the following error:
12:29:17,046 | ERROR | Thread-53 | BlueprintContainerImpl | 23 - org.apache.aries.blueprint.core - 1.4.5 | Unable to start blueprint container for bundle null/0.0.0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)[:1.8.0_151]
at java.util.ArrayList.get(Unknown Source)[:1.8.0_151]
at org.apache.aries.blueprint.container.BlueprintContainerImpl.readDirectives(BlueprintContainerImpl.java:214)
at org.apache.aries.blueprint.container.BlueprintContainerImpl.doRun(BlueprintContainerImpl.java:296)
at org.apache.aries.blueprint.container.BlueprintContainerImpl.run(BlueprintContainerImpl.java:270)
at org.apache.aries.blueprint.container.BlueprintExtender.createContainer(BlueprintExtender.java:294)
at org.apache.aries.blueprint.container.BlueprintExtender.createContainer(BlueprintExtender.java:263)
This is my code:
Picture of project structure
Pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.fusesource.example</groupId>
<artifactId>rest-service</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>rest-service</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.servicemix.specs</groupId>
<artifactId>org.apache.servicemix.specs.jsr311-api-1.1.1</artifactId>
<version>1.9.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.servicemix</groupId>
<artifactId>servicemix-http</artifactId>
<version>2013.01</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.16</version>
</dependency>
</dependencies>
<build>
<defaultGoal>install</defaultGoal>
<plugins>
<plugin>
<groupId>org.apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId>
<version>2.3.6</version>
<extensions>true</extensions>
<configuration>
<instructions>
<Bundle-SymbolicName>${project.artifactId}</Bundle-SymbolicName>
<Import-Package>* </Import-Package>
</instructions>
</configuration>
</plugin>
</plugins>
</build>
</project>
Blueprint.xml:
<?xml version = "1.0" encoding = "UTF-8"?>
<blueprint xmlns = "http://www.osgi.org/xmlns/blueprint/v1.0.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xmlns:jaxrs = "http://cxf.apache.org/blueprint/jaxrs"
xmlns:cxf="http://cxf.apache.org/blueprint/core"
xsi:schemaLocation = "http://www.osgi.org/xmlns/blueprint/v1.0.0 http://www.osgi.org/xmlns/blueprint/v1.0.0/blueprint.xsd
http://cxf.apache.org/blueprint/jaxrs http://cxf.apache.org/schemas/blueprint/jaxrs.xsd
http://cxf.apache.org/blueprint/core http://cxf.apache.org/schemas/blueprint/core.xsd">
<jaxrs:server id="servicios" address="/serviciosPrueba">
<jaxrs:serviceBeans>
<ref component-id="miServicio" />
</jaxrs:serviceBeans>
</jaxrs:server>
<bean id="miServicio" class="com.rest.Servicio" />
</blueprint>
Java:
package com.rest;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
#Path("/servicioPrueba")
public class Servicio {
#GET
#Path("/getData")
#Produces(MediaType.APPLICATION_JSON)
public String getUser() {
String reponse = "This is standard response from REST";
return reponse;
}
}
Thank you very much.
In pom.xml
Please change
<packaging>jar</packaging>
to
<packaging>bundle</packaging>
otherwise it will miss the necessary OSGi headers
I resolved the second error adding specific versions of dependencies in pom.xml:
<configuration>
<instructions>
<Bundle-SymbolicName>${project.artifactId}</Bundle-SymbolicName>
<Import-Package>
javax.ws.rs;version="[2.0,3)",
javax.ws.rs.core;version="[2.0,3)",
*
</Import-Package>
</instructions>
</configuration>

“value $ is not a member of StringContext” - Missing Scala plugin?

I'm using maven with scala archetype. I'm getting that error:
“value $ is not a member of StringContext”
I already tried to add several things in pom.xml, but nothing worked very well...
My code:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("\n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("\n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
// Create a Linear Regression Model object
val lr = new LinearRegression()
// Fit the model to the data
// Note: Later we will see why we should split
// the data first, but for now we will fit to all the data.
val lrModel = lr.fit(output)
// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// Summarize the model over the training set and print out some metrics!
// Explore this in the spark-shell for more methods to call
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: ${trainingSummary.objectiveHistory.toList}")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"MSE: ${trainingSummary.meanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
and my pom.xml is that:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>outrotest</artifactId>
<version>1.0-SNAPSHOT</version>
<name>${project.artifactId}</name>
<description>My wonderfull scala app</description>
<inceptionYear>2015</inceptionYear>
<licenses>
<license>
<name>My License</name>
<url>http://....</url>
<distribution>repo</distribution>
</license>
</licenses>
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.5</scala.version>
<scala.compat.version>2.11</scala.compat.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-junit_${scala.compat.version}</artifactId>
<version>2.4.16</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-core_${scala.compat.version}</artifactId>
<version>2.4.16</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>2.2.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<!-- If you have classpath issue like NoDefClassError,... -->
<!-- useManifestOnlyJar>false</useManifestOnlyJar -->
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</project>
I have no idea about how to fix it. Does anybody have any idea?
Add this.. it will work
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._ // << add this
You can use the col function instead just import it like this :
import org.apache.spark.sql.functions.col
And then change the $"column" to col("column")
Hope it helps
#Apurva's answer initially worked for me in that the error vanished from IntelliJ
But then it resulted in "Could not find implicit value for spark" during sbt compile phase
I found a strange work-around by importing spark.implicits._ from SparkSession referenced from DataFrame instead of one obtained by getOrCreate
import df.sparkSession.implicits._
where df is a DataFrame
This could be because my code was placed inside a case class that received an implicit val spark: SparkSession parameter; but I'm not really sure as to why this fix worked for me
I'm using spark 1.6. The above answers are great but unfortunately doesn't work in 1.6
The way I solved it was by using df.col("column-name")
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timestamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.elasticsearch.spark.serialization.ReflectionUtils$.org$elasticsearch$spark$serialization$ReflectionUtils$$checkCaseClass(ReflectionUtils.scala:42)
at org.elasticsearch.spark.serialization.ReflectionUtils$$anonfun$checkCaseClassCache$1.apply(ReflectionUtils.scala:84)
it is seems scala version uncompatible,but i see the document of spark ,spark 2.10 and scala 2.11.8 is ok.
that is my pom.xml and that is just a test for spark to write to elasticsearch with es-hadoop,i have no idea how to solve this exception. `
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.jhTian</groupId>
<artifactId>sparkLink</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>${project.artifactId}</name>
<description>My wonderfull scala app</description>
<inceptionYear>2015</inceptionYear>
<licenses>
<license>
<name>My License</name>
<url>http://....</url>
<distribution>repo</distribution>
</license>
</licenses>
<properties>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.compat.version>2.11</scala.compat.version>
</properties>
<repositories>
<repository>
<id>ainemo</id>
<name>xylink</name>
<url>http://10.170.209.180:8081/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version><!-- 2.64 -->
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--<dependency>-->
<!--<groupId>org.scala-lang</groupId>-->
<!--<artifactId>scala-compiler</artifactId>-->
<!--<version>${scala.version}</version>-->
<!--</dependency>-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.3.0 </version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-core_${scala.compat.version}</artifactId>
<version>2.4.16</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>2.2.4</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>'
this is my code
import org.apache.spark.{SparkConf, SparkContext}
import org.elasticsearch.spark._
/**
* Created by jhTian on 2017/4/19.
*/
object EsWrite {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
.set("es.nodes", "1.1.1.1")
.set("es.port", "9200")
.set("es.index.auto.create", "true")
.setAppName("es-spark-demo")
val sc = new SparkContext(sparkConf)
val job1 = Job("C开发工程师","http://job.c.com","c公司","10000")
val job2 = Job("C++开发工程师","http://job.c++.com","c++公司","10000")
val job3 = Job("C#开发工程师","http://job.c#.com","c#公司","10000")
val job4 = Job("Java开发工程师","http://job.java.com","java公司","10000")
val job5 = Job("Scala开发工程师","http://job.scala.com","java公司","10000")
// val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
// val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
// val rdd=sc.makeRDD(Seq(numbers,airports))
val rdd=sc.makeRDD(Seq(job1,job2,job3,job4,job5))
rdd.saveToEs("job/info")
sc.stop()
}
}
case class Job(jobName:String, jobUrl:String, companyName:String, salary:String)'
Generally NoSuchMethodError implies the caller was compiled with a different version than was found on the classpath at runtime (or you have multiple versions on the CP).
In your case, I'd guess that es-hadoop is built against a different version of Scala I've not used maven in a little while but I think the command you need to get some useful into is mvn depdencyTree. Use the output to see which version of Scala es-hadoop is built with and then configure your project to use the same Scala version.
To get stable/reproducible builds I'd recommend using something like the maven-enforcer-plugin:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-enforcer-plugin</artifactId>
<version>1.4.1</version>
<executions>
<execution>
<id>enforce</id>
<configuration>
<rules>
<dependencyConvergence />
</rules>
</configuration>
<goals>
<goal>enforce</goal>
</goals>
</execution>
</executions>
</plugin>
it can be annoying initially but once you have all your dependencies sorted you shouldn't get issues like this anymore.
use dependency like this
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.2.2</version>
</dependency>
for spark 2.0 and scala 2.11

Compilation error using maven: "annotation are not supported in -source 1.3"

I had created a Restful web service simple project using the Eclipse. When I try to compile it, I'm getting the following error.
Not sure why the annotations are not supported, I did the same using the NetBeans IDE,
and it worked fine without any issue.
**[INFO] Compilation failure
C:\Users\gopc\workspace\RestfulService\src\restfu\Hello.java:[18,1] error: annotations are not supported in -source 1.3
[INFO] ------------------------------------------------------------------------
[DEBUG] Trace
org.apache.maven.BuildFailureException: Compilation failure
C:\Users\gopc\workspace\RestfulService\src\restfu\Hello.java:[18,1] error: annotations are not supported in -source 1.3
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:715)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalWithLifecycle(DefaultLifecycleExecutor.java:556)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:535)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:387)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:348)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:180)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:328)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: org.apache.maven.plugin.CompilationFailureException: Compilation failure
C:\Users\gopc\workspace\RestfulService\src\restfu\Hello.java:[18,1] error: annotations are not supported** in -source 1.3
This is my pom file structure:
POM file
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>RestfulService</groupId>
<artifactId>RestfulService</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>${basedir}/src</sourceDirectory>
<outputDirectory>${basedir}/build/classes</outputDirectory>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>jersey-server</groupId>
<artifactId>jersey-server</artifactId>
<version>1.4</version>
<scope>system</scope>
<systemPath>${basedir}/lib/jersey-server-1.4.jar</systemPath>
</dependency>
<dependency>
<groupId>javax-ws</groupId>
<artifactId>javax-ws</artifactId>
<version>1.4</version>
<scope>system</scope>
<systemPath>${basedir}/lib/javax.ws.rs.jar</systemPath>
</dependency>
</dependencies>
</dependencyManagement>
</project>
Hello.java
package restfu;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
// Plain old Java Object it does not extend as class or implements
// an interface
// The class registers its methods for the HTTP GET request using the #GET annotation.
// Using the #Produces annotation, it defines that it can deliver several MIME types,
// text, XML and HTML.
// The browser requests per default the HTML MIME type.
//Sets the path to base URL + /hello
#Path("/hello")
public class Hello {
// This method is called if TEXT_PLAIN is request
#GET
#Produces(MediaType.TEXT_PLAIN)
public String sayPlainTextHello() {
return "Hello Jersey";
}
// This method is called if XML is request
#GET
//#Produces(MediaType.TEXT_XML)
#Produces("application/xml")
public String sayXMLHello() {
return "<?xml version=\"1.0\"?>" + "<hello> Hello Jersey" + "</hello>";
}
// This method is called if HTML is request
#GET
#Produces(MediaType.TEXT_HTML)
public String sayHtmlHello() {
return "<html> " + "<title>" + "Hello Jersey" + "</title>"
+ "<body><h1>" + "Hello Jersey" + "</body></h1>" + "</html> ";
}
// This method is called if JSON is request
#GET
//#Produces("appplication/json")
#Produces(MediaType.APPLICATION_JSON)
public String sayJsonHello() {
return "{ \"HTML\": { \"title\": \"Hello Jersey\", \"body\": { \"h1\": \"Hello Jersey\" }}" ;
}
}
Enhance your pom with the following:
<project>
[...]
<build>
[...]
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.5</source>
<target>1.5</target>
</configuration>
</plugin>
</plugins>
[...]
</build>
[...]
</project>
You need to give at least 1.5 but better would be 1.6.
I would suggest not to change the default locations of source code. Better move your source code into the default location of maven: src/main/java and remove the configuration from your pom:
<build>
<sourceDirectory>${basedir}/src</sourceDirectory>
<outputDirectory>${basedir}/target/classes</outputDirectory>
</build>
Add java version 1.5 or above in pluginManagement (if possible in the parent pom):
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.5</source>
<target>1.5</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
thanks i created the lib folder in the basedir then copied the library files into it, then added the dependency as follows, then compilation problem is resolved
and it created the jar file and the class files.
<dependencies>
<dependency>
<groupId>jersey-server</groupId>
<artifactId>jersey-server</artifactId>
<version>1.4</version>
<scope>system</scope>
<systemPath>${basedir}/lib/jersey-server-1.4.jar</systemPath>
</dependency>
<dependency>
<groupId>javax-ws</groupId>
<artifactId>javax-ws</artifactId>
<version>1.4</version>
<scope>system</scope>
<systemPath>${basedir}/lib/javax.ws.rs.jar</systemPath>
</dependency>