FileTailSource throws null pointer exception - scala

I wrote the following Alpakka code
val fs = FileSystems.getDefault
val resource = getClass.getResource("countrycapital.csv")
val source = FileTailSource.lines(Paths.get(resource.toURI), maxLineSize = 8092, pollingInterval = 10000 seconds)
I have a file called "countrycapital.csv" in the src/main/resources folder.
My objective is to build a Akka Streams Source for this file and process all the records of this file.
when I run this code. the list line throws an exception
[error] (run-main-0) java.lang.NullPointerException
java.lang.NullPointerException
at com.abhi.ElasticAkkaStreams$.delayedEndpoint$com$abhi$ElasticAkkaStreams$1(ElasticAkkaStreams.scala:37)
at com.abhi.ElasticAkkaStreams$delayedInit$body.apply(ElasticAkkaStreams.scala:28)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
I have imported the following libraries in my build.sbt
"com.lightbend.akka" %% "akka-stream-alpakka-elasticsearch" % "0.13",
"com.lightbend.akka" %% "akka-stream-alpakka-csv" % "0.13",
"com.lightbend.akka" %% "akka-stream-alpakka-file" % "0.13"

OK. I solved this myself
I didn't have a / when picking the file path.
This works without the null pointer exception
val resource = getClass.getResource("/countrycapital.csv")

Related

Spark with Scala read json and insert the data in MongoDB

I tray to read json file and insert the data in mongodb and this my code
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("SparkByExamples.com")
.config("spark.mongodb.input.uri","mongodb://doss:doss123#localhost:27017/spark?authSource=test")
.config("spark.mongodb.output.uri","mongodb://doss:doss123#localhost:27017/spark?authSource=test")
.getOrCreate()
val dataset = spark.read
.option("multiline",true)
.json("src/main/scala/tes.json")
dataset.write.format("com.mongodb.spark.sql.DefaultSource")
.option("database","jsonDB")
.option("collection","jsonData")
.mode("append").save()
but I get a error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed
to find data source: com.mongodb.spark.sql.DefaultSource. Please find
packages at https://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
at
org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:864)
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
at Main$.main(Main.scala:33) at Main.main(Main.scala) Caused by:
java.lang.ClassNotFoundException:
com.mongodb.spark.sql.DefaultSource.DefaultSource at
java.net.URLClassLoader.findClass(URLClassLoader.java:387) at
java.lang.ClassLoader.loadClass(ClassLoader.java:418) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at
java.lang.ClassLoader.loadClass(ClassLoader.java:351) at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:661)
at scala.util.Try$.apply(Try.scala:213) at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:661)
at scala.util.Failure.orElse(Try.scala:224) at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:661)
... 6 more
i dont know what the problem and the build.sbt file:
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.1"
// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector" % "10.0.5"
pleas help me

Can we create a table using spark.sql api from the IDE

I'm on IntelliJ and my spark session looks like this -
val spark = SparkSession.builder()
.appName("Spark SQL")
.config("spark.master", "local")
.config("spark.sql.warehouse.dir", "src/main/resources/warehouse") //to create a user-defined warehouse for storing tables
.config("spark.network.timeout" , "10000000s")//to avoid Heartbeat exception
.getOrCreate()
While I can create a database using
spark.sql("create database newdb")
This creates a directory under src/main/resources/warehouse
However when I attempt to create a table using the same manner
spark.sql("create table testing(id int, name string)"), it fails
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `testing`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
Thereafter, I did add enableHiveSupport() while creating spark session, but it also leads me to this exception
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Can anyone help me with this?
EDIT
This is my sbt
name := "spark-essentials"
version := "0.1"
scalaVersion := "2.12.10"
val sparkVersion = "3.0.0-preview"
val vegasVersion = "0.3.11"
val postgresVersion = "42.2.2"
resolvers ++= Seq(
"bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven",
"Typesafe Simple Repository" at "https://repo.typesafe.com/typesafe/simple/maven-releases",
"MavenRepository" at "https://mvnrepository.com"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion,
// logging
"org.apache.logging.log4j" % "log4j-api" % "2.4.1",
"org.apache.logging.log4j" % "log4j-core" % "2.4.1",
// postgres for DB connectivity
"org.postgresql" % "postgresql" % postgresVersion
)
EDIT:
below is the stack trace
20:41:08 WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
20:41:08 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:203)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:127)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:300)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:421)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:314)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:147)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:137)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:165)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:56)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:92)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:92)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:741)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:781)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:725)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:765)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:757)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$2(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$2(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:757)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:694)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:89)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:55)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:95)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
at part4sql.SparkSql$.delayedEndpoint$part4sql$SparkSql$1(SparkSql.scala:29)
at part4sql.SparkSql$delayedInit$body.apply(SparkSql.scala:7)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at part4sql.SparkSql$.main(SparkSql.scala:7)
at part4sql.SparkSql.main(SparkSql.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
... 94 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
... 100 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:src/main/resources/warehouse
Last line in the stacktrace --
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:src/main/resources/warehouse
-- hints that the proper way to set spark.sql.warehouse.dir is to supply an absolute path for metastore directory, something like file:///<project_root_dir>/src/main/resources/warehouse.
try including /usr/hdp/current/spark-client/conf/hive-site.xml or respective hive-site file while you submit spark.
Also you can try including below configuration when submitting spark job
--conf "spark.sql.catalogImplementation=hive" \

java.lang.NoSuchMethodError: backtype.storm.spout.MultiScheme.deserialize([B)Ljava/lang/Iterable;

I am using storm:0.9.7 , scala-library:2.11.7 ,kafka-clients:0.9.0.0 , kafka : kafka_2.11
RuntimeError:
java.lang.NoSuchMethodError: backtype.storm.spout.MultiScheme.deserialize([B)Ljava/lang/Iterable;
at storm.kafka.KafkaUtils.generateTuples(KafkaUtils.java:209) ~[stormjar.jar:0.3-SNAPSHOT]
at storm.kafka.PartitionManager.next(PartitionManager.java:131) ~[stormjar.jar:0.3-SNAPSHOT]
at storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:141) ~[stormjar.jar:0.3-SNAPSHOT]
at com.aexp.fastdata.core.builder.spout.KafkaExtractorSpout.nextTuple(KafkaExtractorSpout.java:112) ~[stormjar.jar:0.3-SNAPSHOT]
at backtype.storm.daemon.executor$fn__4615$fn__4630$fn__4661.invoke(executor.clj:610) ~[storm-core-0.10.0-mapr-1611.jar:0.10.0-mapr-1611]
at backtype.storm.util$async_loop$fn__544.invoke(util.clj:479) [storm-core-0.10.0-mapr-1611.jar:0.10.0-mapr-1611]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
you need to implement the following dependency in your build
"org.scala-lang" % "scala-library" % "2.11.7"

Exception when using the saveToPhoenix method to load/save a RDD on Hbase

I would like to use the apache-phoenix framework.
The problem is that I keep having an exception telling me that the class HBaseConfiguration can't be found.
Here is the code I want to use:
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.phoenix.spark._
// Load INPUT_TABLE
object MainTest2 extends App {
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
"zkUrl" -> "localhost:3888"))
}
Here is the SBT I'm using :
name := "spark-to-hbase"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.3.0",
"org.apache.phoenix" % "phoenix-core" % "4.11.0-HBase-1.3",
"org.apache.spark" % "spark-core_2.11" % "2.1.1",
"org.apache.spark" % "spark-sql_2.11" % "2.1.1",
"org.apache.phoenix" % "phoenix-spark" % "4.11.0-HBase-1.3"
)
And here is the exception :
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration at
org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl$1.call(ConfigurationFactory.java:49)
at
org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl$1.call(ConfigurationFactory.java:46)
at
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
at
org.apache.phoenix.util.PhoenixContextExecutor.callWithoutPropagation(PhoenixContextExecutor.java:91)
at
org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl.getConfiguration(ConfigurationFactory.java:46)
at
org.apache.phoenix.jdbc.PhoenixDriver.initializeConnectionCache(PhoenixDriver.java:151)
at
org.apache.phoenix.jdbc.PhoenixDriver.(PhoenixDriver.java:142)
at
org.apache.phoenix.jdbc.PhoenixDriver.(PhoenixDriver.java:69)
at org.apache.phoenix.spark.PhoenixRDD.(PhoenixRDD.scala:43)
at
org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:52)
at
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:40)
at
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:389)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:965) at
MainTest2$.delayedEndpoint$MainTest2$1(MainTest2.scala:9) at
MainTest2$delayedInit$body.apply(MainTest2.scala:6) at
scala.Function0$class.apply$mcV$sp(Function0.scala:34) at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76) at
scala.App$$anonfun$main$1.apply(App.scala:76) at
scala.collection.immutable.List.foreach(List.scala:381) at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76) at
MainTest2$.main(MainTest2.scala:6) at MainTest2.main(MainTest2.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.HBaseConfiguration at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 26 more
I've already tried to change the HADOOP_CLASSPATH in hadoop-env.sh like it is said in this previous post.
What can I do to overcome this problem?
I found a solution to my problem. As the exception says, my compiler isn't able to find the class HBaseConfiguration. HBaseConfiguration is used inside org.apache.hadoop.hbase library and so is needed to compile. I noticed that the HBaseConfiguration class wasn't present in the org.apache.hadoop library as I thought. For the hbase 1.3.1 version installed on my PC computer, I managed to find this class in the hbase-common-1.3.1 jar located in my HBASE_HOME/lib folder.
Then I include this dependency in my built.SBT :
"org.apache.hbase" % "hbase-common" % "1.3.1"
And the Exception was gone.

Using S3 (Frankfurt) with Spark

Anyone is using s3 on Frankfurt using hadoop/spark 1.6.0?
I am trying to store the result of a job on s3, my dependencies are declared as follows:
"org.apache.spark" %% "spark-core" % "1.6.0" exclude("org.apache.hadoop", "hadoop-client"),
"org.apache.spark" %% "spark-sql" % "1.6.0",
"org.apache.hadoop" % "hadoop-client" % "2.7.2",
"org.apache.hadoop" % "hadoop-aws" % "2.7.2"
I have set the following configuration:
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.endpoint", ""s3.eu-central-1.amazonaws.com")
When calling saveAsTextFile on my RDD it starts ok, saving everything on S3. However after some time when it is transferring from _temporary to the final output result it yields the error:
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXXXX, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If I use hadoop-client from spark package it not even start the transfer. The error occurs randomly, sometimes it works and sometimes don't.
In case you are using pyspark, the following worked for me
aws_profile = "your_profile"
aws_region = "eu-central-1"
s3_bucket = "your_bucket"
# see https://github.com/jupyter/docker-stacks/issues/127#issuecomment-214594895
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
# If this doesn't work you might have to delete your ~/.ivy2 directory to reset your package cache.
# (see https://github.com/databricks/spark-redshift/issues/244#issuecomment-239950148)
import pyspark
sc=pyspark.SparkContext()
# see https://github.com/databricks/spark-redshift/issues/298#issuecomment-271834485
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
# see https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark
hadoop_conf=sc._jsc.hadoopConfiguration()
# see https://stackoverflow.com/questions/43454117/how-do-you-use-s3a-with-spark-2-1-0-on-aws-us-east-2
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
# see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
hadoop_conf.set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
sql=pyspark.sql.SparkSession(sc)
path = s3_bucket + "your_file_on_s3"
dataS3=sql.read.parquet("s3a://" + path)
Please try to set the values below:
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("com.amazonaws.services.s3.enableV4", "true")
hadoopConf.set("fs.s3a.endpoint", "s3." + region + ".amazonaws.com")
please set the region where that bucket is located, in my case it was: eu-central-1
and add dependency into gradle or in some other way:
dependencies {
compile 'org.apache.hadoop:hadoop-aws:2.7.2'
}
hope it will help.
Inspired from the others answers, running the following directly in pyspark shell produced the desired output for me:
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") # fails without this
hc=sc._jsc.hadoopConfiguration()
hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hc.set("com.amazonaws.services.s3.enableV4", "true")
hc.set("fs.s3a.endpoint", end_point)
hc.set("fs.s3a.access.key",access_key)
hc.set("fs.s3a.secret.key",secret_key)
data = sc.textFile("s3a://bucket/file")
data.take(3)
Choose your endpoint at: list of endpoints
I was able to fetch data from Asia Pacific (Mumbai)(ap-south-1) which is a Version 4 only region.