hbase-spark load data raise NullPointerException Error (scala) - scala

I want to load data from HBase by Spark SQL, I use the hbase-spark official example and it raise NullPointerException
My build.sbt file is:
name := "proj_1"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.3.1",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1",
"org.apache.spark" % "spark-mllib_2.11" % "2.3.1",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.1",
"org.apache.spark" % "spark-hive_2.11" % "2.3.1",
"org.elasticsearch" % "elasticsearch-hadoop" % "6.4.0",
"org.apache.hadoop" % "hadoop-core" % "2.6.0-mr1-cdh5.15.1",
"org.apache.hbase" % "hbase" % "2.1.0",
"org.apache.hbase" % "hbase-server" % "2.1.0",
"org.apache.hbase" % "hbase-common" % "2.1.0",
"org.apache.hbase" % "hbase-client" % "2.1.0",
"org.apache.hbase" % "hbase-spark" % "2.1.0-cdh6.x-SNAPSHOT"
)
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
resolvers += "clojars" at "https://clojars.org/repo"
resolvers += "conjars" at "http://conjars.org/repo"
resolvers += "Apache HBase" at "https://repository.apache.org/content/repositories/releases"
The wrong code is:
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.hadoop.hbase.spark")
.option("zkUrl", "127.0.0.1:2181/chen_test")
.load()
}
val df = withCatalog(catalog)
Exception info is:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.spark.HBaseRelation.<init> (DefaultSource.scala:139)
at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at hbase_test$.withCatalog$1(hbase_test.scala:57)
at hbase_test$.main(hbase_test.scala:59)
at hbase_test.main(hbase_test.scala)
How do I fix it? Can you help me?

Ran into this problem recently. Suggest you try this:
import org.apache.hadoop.fs.Path
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/path/to/hbase-site.xml"))
new HBaseContext(sc, conf) // "sc" is the SparkContext you created earlier.
The last expression is introducing a stable value into the environment; found this quite accidentally by scanning Hbase's codebase.
Hope it helps.

Related

Unable to write files after scala and spark upgrade

My project was previously using Scala version 2.11.12 which I have upgraded to 2.12.10 and the Spark version has been upgraded from 2.4.0 to 3.1.2. See the build.sbt file below with the rest of the project dependencies and versions:
scalaVersion := "2.12.10"
val sparkVersion = "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
libraryDependencies += "org.xerial.snappy" % "snappy-java" % "1.1.4"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.8"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.8" % "test, it"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "3.1.2_1.1.0" % "test, it"
libraryDependencies += "com.github.pureconfig" %% "pureconfig" % "0.12.1"
libraryDependencies += "com.typesafe" % "config" % "1.3.2"
libraryDependencies += "org.pegdown" % "pegdown" % "1.1.0" % "test, it"
libraryDependencies += "com.github.scopt" %% "scopt" % "3.7.1"
libraryDependencies += "com.github.pathikrit" %% "better-files" % "3.8.0"
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.9.2"
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1" excludeAll (
ExclusionRule(organization = "org.apache.spark")
)
libraryDependencies += "net.liftweb" %% "lift-json" % "3.4.0"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.13.1"
The app is building fine after the upgrade but it is unable to write files to the filesystem which was working fine before the upgrade. I havent made any code changes to the write logic.
The relevant portion of code that writes to the files is shown below.
val inputStream = getClass.getResourceAsStream(resourcePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val output = fs.create(new Path(outputPath))
IOUtils.copyBytes(inputStream, output.getWrappedStream, conf, true)
I am wondering if IOUtils is not compatible with the new Scala/Spark versions?

object hbase is not a member of package org.apache.hadoop

I am trying to use HBase API in my Scala project, but I'm getting an error when I try:
import org.apache.hadoop.hbase
The error is "object hbase is not a member of package org.apache.hadoop"
I am using sbt 1.3.12 to build my project, this is a part of the build.sbt:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-common" % "2.7.3",
"org.apache.hadoop" % "hadoop-client" % "2.7.3",
"org.apache.hbase" % "hbase-common" % "1.2.1",
"org.apache.hbase" % "hbase-client" % "1.2.1",
"org.apache.hbase" % "hbase-protocol" % "1.2.1",
"org.apache.hbase" % "hbase-server" % "1.2.1"
)
Do you know how to solve the issue?
It is probably the val hbaseVersion = that is in the middle of your build.sbt. Try to remove it:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-common" % "2.7.3",
"org.apache.hadoop" % "hadoop-client" % "2.7.3",
"org.apache.hbase" % "hbase-common" % "1.2.1",
"org.apache.hbase" % "hbase-client" % "1.2.1",
"org.apache.hbase" % "hbase-protocol" % "1.2.1",
"org.apache.hbase" % "hbase-server" % "1.2.1"
)
Code run at Scastie.

Not able to register RDD as TempTable

I am using IntelliJ and trying to get data from MySql DB and then write it into Hive table.
However I am not able to register my RDD to a temp table. The error is "Cannot Resolve Symbol registerTempTable".
I know that this issue is due to some imports missing but I am not able to find out which one.
I have been stuck with this issue for quite a long time and tried all the options / answers available on stack overflow.
Below is my code:
import java.sql.Driver
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.sql.hive.HiveContext
object JdbcRddExample {
def main(args: Array[String]): Unit = {
val url = "jdbc:mysql://localhost:3306/retail_db"
val username = "retail_dba"
val password ="cloudera"
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
Class.forName("com.mysql.jdbc.Driver").newInstance
val conf = new SparkConf().setAppName("JDBC RDD").setMaster("local[2]").set("spark.executor.memory","1g")
val sc = new SparkContext(conf)
val myRDD = new JdbcRDD( sc, () => DriverManager.getConnection(url,username,password) ,
"select department_id,department_name from departments limit ?,?",
0,999999999,1, r => r.getString("department_id") + ", " + r.getString("department_name"))
myRDD.registerTempTable("My_Table") // error: Not able to resolve registerTempTable
sqlContext.sql("use my_db")
sqlContext.sql("Create table my_db.depts (department_id INT, department_name String")
My SBT: (I believe I have imported all the artifacts)
name := "JdbcRddExample"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/com.typesafe.scala-logging/scala-logging
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.7.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.11.0"
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.11.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.1",
"org.apache.spark" %% "spark-sql" % "2.3.1",
"org.apache.spark" %% "spark-mllib" % "2.3.1",
"mysql" % "mysql-connector-java" % "5.1.12"
)
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
Please point me to the exact imports that I missing. Or is there an alternate way. Like I mentioned before I have tried all the solutions and nothing has worked so far.
To use Spark-sql, you probably need a DataFrame rather than a RDD, which obviously doesn't have the ability to registerTempTable.
You can quickly workaround by converting the RDD to a DataFrame, for example How to convert rdd object to dataframe in spark. But it's recommended to use SparkSql feature to read JDBC datasource, like examples here. Sample code:
val dfDepartments = sqlContext.read.format("jdbc")
.option("url", url)
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "(select department_id,department_name from departments) t")
.option("user", username)
.option("password", password).load()
dfDepartments.createOrReplaceTempView("My_Table")

Unable to create spark-warehouse directory using spark-2.3.0

I want to create a project with akka and spark. I added dependencies and some other dependencies too. Is these dependencies will cause any effect on using spark.
I have below sbt file
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
lazy val commonSettings = Seq(
organization := "com.bitool.analytics",
scalaVersion := "2.11.12",
libraryDependencies ++= Seq(
"org.scala-lang.modules" %% "scala-async" % "0.9.6",
"com.softwaremill.macwire" %% "macros" % "2.3.0",
"com.softwaremill.macwire" %% "macrosakka" % "2.3.0",
"com.typesafe.akka" %% "akka-http" % "10.0.6",
"io.swagger" % "swagger-jaxrs" % "1.5.19",
"com.github.swagger-akka-http" %% "swagger-akka-http" % "0.9.1",
"io.circe" %% "circe-generic" % "0.8.0",
"io.circe" %% "circe-literal" % "0.8.0",
"io.circe" %% "circe-parser" % "0.8.0",
"io.circe" %% "circe-optics" % "0.8.0",
"org.scalafx" %% "scalafx" % "8.0.144-R12",
"org.scalafx" %% "scalafxml-core-sfx8" % "0.4",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.apache.spark" %% "spark-hive" % "2.3.0",
"org.scala-lang" % "scala-xml" % "2.11.0-M4",
"mysql" % "mysql-connector-java" % "6.0.5"
)
)
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
name := "BITOOL-1.0"
)
ivyScala := ivyScala.value map {
_.copy(overrideScalaVersion = true)
}
fork in run := true
and below is my spark code
private val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val conf = new SparkConf()
conf.setMaster("local[4]")
conf.setAppName("Bitool")
conf.set("spark.sql.warehouse.dir", warehouseLocation)
val SPARK = SparkSession
.builder().config(conf).enableHiveSupport()
.getOrCreate()
val SPARK_CONTEXT = SPARK.sparkContext
When I trying to execute this, It is creating metastore_db folder but spark-warehouse folder is not creating.
This directory is not created by getOrCreate. You can check it in the Spark source code: getOrCreate delegates its actions to SparkSession.getOrCreate, which is just a setter. All the internal tests and CliSuite use a snippet like this to prematurely initialize the dir: val warehousePath = Utils.createTempDir()
Instead, in the actual user code, you have to perform at least one data modification operation to materialize your warehouse directory. Try running something like that just after your code and check warehouse directory on the hard drive again:
import SPARK.implicits._
import SPARK.sql
sql("DROP TABLE IF EXISTS test")
sql("CREATE TABLE IF NOT EXISTS test (key INT, value STRING) USING hive")

java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory

How can I list all file names of parquet files in the S3 directory in Amazon?
I found this way:
val s3 = AmazonS3ClientCuilder.standard.build()
var objs = s3.listObjects("bucketname","directory")
val summaries = objs.getObjectSummaries()
while (objs.isTruncated()) {
objs = s3.listNextBatchOfObjects(objs)
summaries.addAll(objs.getObjectSummaries())
}
val listOfFiles = summaries.toArray
But it throws the error:
java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory
I added the dependency for httpclient 4.5.2 as indicated in many answers, but I still get the same error.
Also I did:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion exclude("commons-httpclient", "commons-httpclient"),
"org.apache.spark" %% "spark-mllib" % sparkVersion exclude("commons-httpclient", "commons-httpclient"),
"org.sedis" %% "sedis" % "1.2.2",
"org.scalactic" %% "scalactic" % "3.0.0",
"org.scalatest" %% "scalatest" % "3.0.0" % "test",
"com.github.nscala-time" %% "nscala-time" % "2.14.0",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.53",
"org.apache.httpcomponents" % "httpclient" % "4.5.2",
"net.java.dev.jets3t" % "jets3t" % "0.9.3",
"org.apache.hadoop" % "hadoop-aws" % "2.6.0",
"com.github.scopt" %% "scopt" % "3.3.0"
)