Spark Scala read csv file using s3a - scala

I am trying to read a csv (native) file from an S3 bucket using a locally running Spark - Scala. I am able to read the file using the http protocol but I intend to use the s3a protocol.
Below is the configuration setup before the call.
val awsId = System.getenv("AWS_ACCESS_KEY_ID")
val awsKey = System.getenv("AWS_SECRET_ACCESS_KEY")
sc.hadoopConfiguration.set("fs.s3a.access.key", awsId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", awsKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider");
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "us-east-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl.disable.cache", "true")
here
Read the file and print the first 5 rows from the rdd/dataframe
val fileAPath = Files.s3aPath(Files.input);
println("reading file s3", fileAPath)
// s3a://bucket-name/dataSets/policyoutput.csv
val df = sc.textFile(fileAPath);
df.take(5).foreach(println);
I am getting the below exception
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: FD92FDC175C64AA2, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: IuloUEASgqnY4lrSMpbyJpwgFfCFbttxuxmJ9hGHMUgZTbO/UR/YyDgjix+3rBe0Y4MQHPzNvhA=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
Any help / direction for further investigation will be much appreciated.
Thanks

Anyone else struggling with this I had to update the version of hadoop-client
additionally the links below were quite helpful
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
https://disqus.com/by/cfeduke/?utm_source=reply&utm_medium=email&utm_content=comment_author
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
pom details below
<properties>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.8.0</hadoop.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
</dependency>

Related

trouble with google text-to-speech and mysql-connector-java 8.0.19

I'm a newbee using google text-to-speech. the API work fine with Java 1.8 but when i had Mysql connectore driver in my pom.xml file i got a warning and an error Just on executing the QuickStart demo. here is my eclipse console.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/C:/Users/LOGISPORT/.m2/repository/com/google/protobuf/protobuf-java/3.6.1/protobuf-java-3.6.1.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "grpc-default-executor-0" java.lang.NoSuchMethodError: 'boolean com.google.protobuf.GeneratedMessageV3.isStringEmpty(java.lang.Object)'
at com.google.cloud.texttospeech.v1.VoiceSelectionParams.getSerializedSize(VoiceSelectionParams.java:328)
at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:916)
at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:668)
at com.google.cloud.texttospeech.v1.SynthesizeSpeechRequest.getSerializedSize(SynthesizeSpeechRequest.java:352)
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
at io.grpc.internal.MessageFramer.getKnownLength(MessageFramer.java:205)
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:137)
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
at io.grpc.internal.DelayedStream$6.run(DelayedStream.java:283)
at io.grpc.internal.DelayedStream.drainPendingCalls(DelayedStream.java:182)
at io.grpc.internal.DelayedStream.access$100(DelayedStream.java:44)
at io.grpc.internal.DelayedStream$4.run(DelayedStream.java:148)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
Is there an incompatility beetween MySql connector and Google text-to-speech ?
Here is a brief view of my code:
public static void main(String... args ) throws Exception {
String jsonPath = "accueil-mamoudzou.json";
ConnectBase();
String num = String.valueOf(Integer.valueOf("00013"));
CredentialsProvider credentialsProvider = FixedCredentialsProvider.create(ServiceAccountCredentials.fromStream(new FileInputStream(jsonPath)));
TextToSpeechSettings settings = TextToSpeechSettings.newBuilder().setCredentialsProvider( credentialsProvider).build();
System.out.println("Settings créer ... Lancement de la traduction.");
try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create(settings)){
SynthesisInput input = SynthesisInput.newBuilder().setText("Le numéro "+num+" est demandé à la porte 45. Merci").build();
VoiceSelectionParams voice =
VoiceSelectionParams.newBuilder()
.setName("fr-FR-Wavenet-E")
.setLanguageCode("fr-FR")
.setSsmlGender(SsmlVoiceGender.FEMALE)
.build();
AudioConfig audioConfig =
AudioConfig.newBuilder().setAudioEncoding(AudioEncoding.LINEAR16).build();
SynthesizeSpeechResponse response =
textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);
ByteString audioContents = response.getAudioContent();
// Write the response to the output file.
try (OutputStream out = new FileOutputStream("output.wav")) {
out.write(audioContents.toByteArray());
System.out.println("Audio content written to file output.wav");
out.close();
}
playSound();
}
}
and my pom.xml
<dependencies>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.19</version>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-texttospeech</artifactId>
<version>2.1.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.google.auth/google-auth-library-appengine -->
<dependency>
<groupId>com.google.auth</groupId>
<artifactId>google-auth-library-appengine</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage -->
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>2.4.5</version>
</dependency>
</dependencies>
Try adding this to your pom.xml
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>libraries-bom</artifactId>
<version>25.0.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
It's said to address errors such as NoSuchMethodException and references protobuf directly (Among other things)
Source: https://cloud.google.com/java/docs/bom

Scala on eclipse : reading csv as dataframe throw a java.lang.ArrayIndexOutOfBoundsException

Trying to read a simple csv file and load it in a dataframe throw a java.lang.ArrayIndexOutOfBoundsException.
As I am new to Scala I may have missed something trivial, however a thorough search both in google and stackoverflow lead nothing.
The code is the following:
import org.apache.spark.sql.SparkSession
object TransformInitial {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
val df = session.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("data_sets/small_test.csv")
df.show()
}
}
small_test.csv is as simple as possible:
v1,v2,v3
0,1,2
3,4,5
Here is the actual pom of this Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Scala_tests</groupId>
<artifactId>Scala_tests</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<resources>
<resource>
<directory>src</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Execution of the code throw the following
java.lang.ArrayIndexOutOfBoundsException:
18/11/09 12:03:31 INFO FileSourceStrategy: Pruning directories with:
18/11/09 12:03:31 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
18/11/09 12:03:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
18/11/09 12:03:31 INFO FileSourceScanExec: Pushed Filters:
18/11/09 12:03:31 INFO CodeGenerator: Code generated in 413.859722 ms
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.IterableLike.foreach(IterableLike.scala:71)
at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
at scala.collection.TraversableLike.map(TraversableLike.scala:234)
at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:170)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:169)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.immutable.List.flatMap(List.scala:352)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:169)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:22)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:30)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:78)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:467)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:351)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:283)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:169)
at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:223)
at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:348)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:210)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:153)
at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1203)
at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1157)
at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:481)
at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:679)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:107)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3559)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2927)
at org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:142)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:232)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:68)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at TransformInitial$.main(TransformInitial.scala:9)
at TransformInitial.main(TransformInitial.scala)
For the record eclipse version is 2018-09 (4.9.0).
I've hunted for special characters in the csv with a cat -A. It yield nothing.
I'm out of options, something trivial must be missing but I can't put a finger on it.
I'm not sure exactly what is causing your error, since the code works for me. It could be related to the version of the Scala compiler that you are using, since there's no information about that in your Maven file.
I have posted my complete solution—using SBT— to GitHub. To exectute the code, you'll need to install SBT, cd to the checked out source's root folder, then run the following command:
$ sbt run
BTW, I changed your code to take advantage of more idiomatic Scala conventions, and also used the csv function to load your file. The new Scala code looks like this:
import org.apache.spark.sql.SparkSession
// Extending App is more idiomatic than writing a "main" function.
object TransformInitial
extends App {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
// As of Spark 2.0, it's easier to read CSV files.
val df = session.read.option("header", "true").option("inferSchema", "true").csv("data_sets/small_test.csv")
df.show()
// Shutdown gracefully.
session.stop()
}
Note that I also removed the redundant delimiter option.
Downgrading scala version to 2.11 fixed for me.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>

Kafka-spark Streaming processing jobs synchronically

Im trying a simple test where i use Kafka-connect and spark
I wrote a custom kafka-connect that creates this source record
SourceRecord sr = new SourceRecord(null,
null,
destTopic,
Schema.STRING_SCHEMA,
cleanPath);
in the spark i receive this message like this
val kafkaConsumerParams = Map[String, String](
"metadata.broker.list" -> prop.getProperty("kafka_host"),
"zookeeper.connect" -> prop.getProperty("zookeeper_host"),
"group.id" -> prop.getProperty("kafka_group_id"),
"schema.registry.url" -> prop.getProperty("schema_registry_url"),
"auto.offset.reset" -> prop.getProperty("auto_offset_reset")
)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerParams, topicsSet)
val ds = messages.foreachRDD(rdd => {
val toPrint = rdd.map(t => {
val file_path = t._2
val startTime = DateTime.now()
Thread.sleep(1000 * 60)
1
}).sum()
LogUtils.getLogger(classOf[DeviceManager]).info(" toPrint = " + toPrint +" (number of flows calculated)")
})
}
when i use the connector to send multiple message to the desired topic ( in my test it had 6 partitions)
The sleep thread gets all the messages, but preforms them synchronically instead of asynchronically.
When i create a simple test producer, the sleeps are done asynchronically.
I Also created 2 simple consumers, and tried both the connector and a producer, and both task were consumed asynchronically
which means my problems lays with the way the spark is receiving the messages sent from the connector.
I cant figure why the tasks are not acting the same way as they do when i send it from a producer.
i even printed the record the spark recieves and they are exactly the same
producer sent record
1: {partition=2, offset=11, value=something, key=null}
2: {partition=5, offset=9, value=something2, key=null}
connect sent record
1: {partition=3, offset=9, value=something, key=null}
the versions used in my projects are
<scala.version>2.11.7</scala.version>
<confluent.version>4.0.0</confluent.version>
<kafka.version>1.0.0</kafka.version>
<java.version>1.8</java.version>
<spark.version>2.0.0</spark.version>
dependencies
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-client</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-RC1</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<scope>${global.scope}</scope>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-connect-avro-converter</artifactId>
<version>${confluent.version}</version>
<scope>${global.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-api</artifactId>
<version>${kafka.version}</version>
</dependency>
We cannot run Spark-Kafka streaming jobs asynchronously. But we can run them in parallel, as Kafka consumer(s) do. For that, we need to set following configuration in SparkConf():
sparkConf.set("spark.streaming.concurrentJobs","4")
By default, its value is "1". But we can override it to a higher value.
I hope this helps!

java.lang.ClassNotFoundException: io.jsonwebtoken.Jwts when using JJWT JSON Web Token

When I am trying to use JJWT from Stormpath, it is throwing a run time Exception java.lang.ClassNotFoundException: io.jsonwebtoken.Jwts. I am using Jersey2 embedded on GlassFish 4.1; here is the code that is throwing the exception:
private String issueToken(String login) {
Key key = keyGenerator.generateKey();
//Key key = MacProvider.generateKey();
String jwtToken = Jwts.builder()
.setIssuer(uriInfo.getAbsolutePath().toString())
//.setIssuer("http://trustyapp.com/")
.setSubject(login)
.setIssuedAt(new Date())
.setExpiration(toDate(LocalDateTime.now().plusMinutes(15L)))
.signWith(SignatureAlgorithm.HS512, key)
.compact();
logger.info("#### generating token for a key : " + jwtToken + " - " + key);
return jwtToken;
}
I have imported io.jsonwebtoken.Jwts and my pom.xml has :
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>2.8.2</version>
<scope>compile</scope>
</dependency>
i also tried it without the above dependency in case the below dependency which is on my pom.xml is enough:
<dependency>
<groupId>io.jsonwebtoken</groupId>
<artifactId>jjwt</artifactId>
<version>0.7.0</version>
<scope>compile</scope>
</dependency>
I tried the recommendations from this and this but it did not work, please help
The problem is solved after adding the following dependencies into my pom.xml:
<dependency>
<groupId>org.glassfish.jersey.core</groupId>
<artifactId>jersey-common</artifactId>
<version>${version.jersey}</version>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.containers</groupId>
<artifactId>jersey-container-jdk-http</artifactId>
<version>${version.jersey}</version>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.core</groupId>
<artifactId>jersey-client</artifactId>
<version>${version.jersey}</version>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.core</groupId>
<artifactId>jersey-server</artifactId>
<version>${version.jersey}</version>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.containers</groupId>
<artifactId>jersey-container-servlet</artifactId>
<version>${version.jersey}</version>
</dependency>
I assumed that such dependencies are not required since i am using Jersey 2 which is embedded on the GlassFish4.1.1 Server.

Scala to connect in HIVE via JDBC - HDP

i'm trying to connect in HIVE (in sandbox of Hortonworks) and i'm receving the message below:
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:hive2://sandbox.hortonworks.com:10000/default
Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
Code:
// **** SetMaster is Local only to test *****
// Set context
val sparkConf = new SparkConf().setAppName("process").setMaster("local")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
// Set HDFS
System.setProperty("HADOOP_USER_NAME", "hdfs")
val hdfsconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
hdfsconf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020")
val hdfs = FileSystem.get(hdfsconf)
// Set Hive Connector
val url = "jdbc:hive2://sandbox.hortonworks.com:10000/default"
val user = "username"
val password = "password"
hiveContext.read.format("jdbc").options(Map("url" -> url,
"user" -> user,
"password" -> password,
"dbtable" -> "tablename")).load()
You need to have Hive JDBC driver in your application classpath:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>
Also, specify driver explicitly in options:
"driver" -> "org.apache.hive.jdbc.HiveDriver"
However, it's better to skip JDBC and use native Spark integration with Hive, since it make possible to use Hive metastore. See http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables