Unable to create spark-warehouse directory using spark-2.3.0 - scala

I want to create a project with akka and spark. I added dependencies and some other dependencies too. Is these dependencies will cause any effect on using spark.
I have below sbt file
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
lazy val commonSettings = Seq(
organization := "com.bitool.analytics",
scalaVersion := "2.11.12",
libraryDependencies ++= Seq(
"org.scala-lang.modules" %% "scala-async" % "0.9.6",
"com.softwaremill.macwire" %% "macros" % "2.3.0",
"com.softwaremill.macwire" %% "macrosakka" % "2.3.0",
"com.typesafe.akka" %% "akka-http" % "10.0.6",
"io.swagger" % "swagger-jaxrs" % "1.5.19",
"com.github.swagger-akka-http" %% "swagger-akka-http" % "0.9.1",
"io.circe" %% "circe-generic" % "0.8.0",
"io.circe" %% "circe-literal" % "0.8.0",
"io.circe" %% "circe-parser" % "0.8.0",
"io.circe" %% "circe-optics" % "0.8.0",
"org.scalafx" %% "scalafx" % "8.0.144-R12",
"org.scalafx" %% "scalafxml-core-sfx8" % "0.4",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.apache.spark" %% "spark-hive" % "2.3.0",
"org.scala-lang" % "scala-xml" % "2.11.0-M4",
"mysql" % "mysql-connector-java" % "6.0.5"
)
)
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
name := "BITOOL-1.0"
)
ivyScala := ivyScala.value map {
_.copy(overrideScalaVersion = true)
}
fork in run := true
and below is my spark code
private val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val conf = new SparkConf()
conf.setMaster("local[4]")
conf.setAppName("Bitool")
conf.set("spark.sql.warehouse.dir", warehouseLocation)
val SPARK = SparkSession
.builder().config(conf).enableHiveSupport()
.getOrCreate()
val SPARK_CONTEXT = SPARK.sparkContext
When I trying to execute this, It is creating metastore_db folder but spark-warehouse folder is not creating.

This directory is not created by getOrCreate. You can check it in the Spark source code: getOrCreate delegates its actions to SparkSession.getOrCreate, which is just a setter. All the internal tests and CliSuite use a snippet like this to prematurely initialize the dir: val warehousePath = Utils.createTempDir()
Instead, in the actual user code, you have to perform at least one data modification operation to materialize your warehouse directory. Try running something like that just after your code and check warehouse directory on the hard drive again:
import SPARK.implicits._
import SPARK.sql
sql("DROP TABLE IF EXISTS test")
sql("CREATE TABLE IF NOT EXISTS test (key INT, value STRING) USING hive")

Related

`java.lang.NoSuchMethodError: cats.FlatMap.map2` in runtime when using `.sequence`

I'm getting the following runtime error after migrating from cats v1.1.0 to v1.4.0 (An error arises from places where .sequence (cats.Traverse) is used).
The code looks like:
import cats.implicits._
import cats.effect.IO
List(1, 2, 3).map(x => IO(...)).sequence
java.lang.NoSuchMethodError: cats.FlatMap.map2$(Lcats/FlatMap;Ljava/lang/Object;Ljava/lang/Object;Lscala/Function2;)Ljava/lang/Object;
at cats.effect.IOLowPriorityInstances$IOEffect.map2(IO.scala:765)
...
Here is my build.sbt:
organization := "org.xxx"
name := "yyy"
version := "0.0.1"
scalaVersion := "2.12.9"
resolvers += Resolver.bintrayRepo("hseeberger", "maven")
resolvers ++= Seq("Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/")
lazy val doobieVersion = "0.5.3"
lazy val akkaHttpVersion = "10.1.1"
lazy val akkaVersion = "2.5.12"
lazy val catsVersion = "1.4.0"
lazy val circeVersion = "0.9.3"
lazy val doobieDeps = Seq(
"org.tpolecat" %% "doobie-core" % doobieVersion,
"org.tpolecat" %% "doobie-postgres" % doobieVersion,
"org.tpolecat" %% "doobie-scalatest" % doobieVersion,
"org.tpolecat" %% "doobie-hikari" % doobieVersion
)
lazy val catsDeps = Seq(
"org.typelevel" %% "cats-effect" % catsVersion,
"org.typelevel" %% "cats-core" % catsVersion
)
lazy val otherDeps = Seq(
"com.github.pureconfig" %% "pureconfig" % "0.9.1",
"org.scorexfoundation" %% "scrypto" % "2.1.1",
"de.heikoseeberger" %% "akka-http-circe" % "1.20.1",
"org.scalaj" %% "scalaj-http" % "2.4.0",
"org.flywaydb" % "flyway-core" % "5.1.1",
"com.github.blemale" %% "scaffeine" % "2.5.0",
("org.scorexfoundation" %% "sigma-state" % "master-2b4b07a1-SNAPSHOT")
.exclude("ch.qos.logback", "logback-classic")
.exclude("org.scorexfoundation", "scrypto"),
)
lazy val circeDeps = Seq(
"io.circe" %% "circe-core" % circeVersion,
"io.circe" %% "circe-parser" % circeVersion,
"io.circe" %% "circe-generic" % circeVersion
)
libraryDependencies ++= (otherDeps ++ doobieDeps ++ catsDeps ++ loggingDeps ++ akkaDeps ++ circeDeps ++ testDeps)
I've tried to run it different ways (idea, set run) and on different platforms - the result is always the same.
What could it be caused by?
Try to change versions to
"org.typelevel" %% "cats-effect" % "1.4.0",
"org.typelevel" %% "cats-core" % "1.6.1"
Your project seems to work with them.

Cannot import classes from one module to other one - Scala

I have created a project with 3 different modules. First one is called http and second algebra. I have connected them into one in sbt file, but when I want to use classes from algebra in http then I cannot import them because they do not see each other.
This is my sbt file:
lazy val commonSettings = Seq(
libraryDependencies ++= Seq(
"org.typelevel" %% "cats-core" % CatsVersion,
"org.typelevel" %% "cats-effect" % "1.2.0",
"org.typelevel" %% "cats-tagless-macros" % "0.2.0",
"org.typelevel" %% "cats-mtl-core" % "0.5.0",
)
)
lazy val root = project.in(file(".")).aggregate(http, domain, algebra)
.settings(commonSettings)
.settings(libraryDependencies ++= Seq(
"org.tpolecat" %% "doobie-core" % DoobieVersion,
"org.tpolecat" %% "doobie-h2" % DoobieVersion,
"org.tpolecat" %% "doobie-scalatest" % DoobieVersion,
"org.tpolecat" %% "doobie-hikari" % DoobieVersion,
))
lazy val http = (project in file("http"))
.dependsOn(algebra)
.settings(commonSettings)
.settings(
name := "my-http",
libraryDependencies ++= Seq(
"io.circe" %% "circe-generic" % CirceVersion,
"io.circe" %% "circe-literal" % CirceVersion,
"io.circe" %% "circe-generic-extras" % CirceVersion,
"io.circe" %% "circe-parser" % CirceVersion,
"io.circe" %% "circe-java8" % CirceVersion,
"io.circe" %% "circe-config" % CirceConfigVersion,
"org.http4s" %% "http4s-blaze-server" % Http4sVersion,
"org.http4s" %% "http4s-circe" % Http4sVersion,
"org.http4s" %% "http4s-dsl" % Http4sVersion,
))
lazy val domain = project.in(file("domain"))
lazy val algebra = (project in file("algebra"))
.settings(commonSettings)
.settings(
name := "my-algebra",
)
I tried to refresh all projects but it did not work.
class MyRoutes[F[_]: Effect](services: MyService[F]) extends Http4sDsl[F]{...}
Class MyRoutes is in http module and MyService in algebra module. The error is Cannot find declaration to go to on MyService.
How can I fix it?
Ok, I solved this problem. It was my, stupid mistake. I have not marked directory as source root where is MyService. Because of this, in http module I could not see this class.

hbase-spark load data raise NullPointerException Error (scala)

I want to load data from HBase by Spark SQL, I use the hbase-spark official example and it raise NullPointerException
My build.sbt file is:
name := "proj_1"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.3.1",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1",
"org.apache.spark" % "spark-mllib_2.11" % "2.3.1",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.1",
"org.apache.spark" % "spark-hive_2.11" % "2.3.1",
"org.elasticsearch" % "elasticsearch-hadoop" % "6.4.0",
"org.apache.hadoop" % "hadoop-core" % "2.6.0-mr1-cdh5.15.1",
"org.apache.hbase" % "hbase" % "2.1.0",
"org.apache.hbase" % "hbase-server" % "2.1.0",
"org.apache.hbase" % "hbase-common" % "2.1.0",
"org.apache.hbase" % "hbase-client" % "2.1.0",
"org.apache.hbase" % "hbase-spark" % "2.1.0-cdh6.x-SNAPSHOT"
)
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
resolvers += "clojars" at "https://clojars.org/repo"
resolvers += "conjars" at "http://conjars.org/repo"
resolvers += "Apache HBase" at "https://repository.apache.org/content/repositories/releases"
The wrong code is:
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.hadoop.hbase.spark")
.option("zkUrl", "127.0.0.1:2181/chen_test")
.load()
}
val df = withCatalog(catalog)
Exception info is:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.spark.HBaseRelation.<init> (DefaultSource.scala:139)
at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at hbase_test$.withCatalog$1(hbase_test.scala:57)
at hbase_test$.main(hbase_test.scala:59)
at hbase_test.main(hbase_test.scala)
How do I fix it? Can you help me?
Ran into this problem recently. Suggest you try this:
import org.apache.hadoop.fs.Path
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/path/to/hbase-site.xml"))
new HBaseContext(sc, conf) // "sc" is the SparkContext you created earlier.
The last expression is introducing a stable value into the environment; found this quite accidentally by scanning Hbase's codebase.
Hope it helps.

Not able to register RDD as TempTable

I am using IntelliJ and trying to get data from MySql DB and then write it into Hive table.
However I am not able to register my RDD to a temp table. The error is "Cannot Resolve Symbol registerTempTable".
I know that this issue is due to some imports missing but I am not able to find out which one.
I have been stuck with this issue for quite a long time and tried all the options / answers available on stack overflow.
Below is my code:
import java.sql.Driver
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.sql.hive.HiveContext
object JdbcRddExample {
def main(args: Array[String]): Unit = {
val url = "jdbc:mysql://localhost:3306/retail_db"
val username = "retail_dba"
val password ="cloudera"
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
Class.forName("com.mysql.jdbc.Driver").newInstance
val conf = new SparkConf().setAppName("JDBC RDD").setMaster("local[2]").set("spark.executor.memory","1g")
val sc = new SparkContext(conf)
val myRDD = new JdbcRDD( sc, () => DriverManager.getConnection(url,username,password) ,
"select department_id,department_name from departments limit ?,?",
0,999999999,1, r => r.getString("department_id") + ", " + r.getString("department_name"))
myRDD.registerTempTable("My_Table") // error: Not able to resolve registerTempTable
sqlContext.sql("use my_db")
sqlContext.sql("Create table my_db.depts (department_id INT, department_name String")
My SBT: (I believe I have imported all the artifacts)
name := "JdbcRddExample"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/com.typesafe.scala-logging/scala-logging
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.7.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.11.0"
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.11.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.1",
"org.apache.spark" %% "spark-sql" % "2.3.1",
"org.apache.spark" %% "spark-mllib" % "2.3.1",
"mysql" % "mysql-connector-java" % "5.1.12"
)
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
Please point me to the exact imports that I missing. Or is there an alternate way. Like I mentioned before I have tried all the solutions and nothing has worked so far.
To use Spark-sql, you probably need a DataFrame rather than a RDD, which obviously doesn't have the ability to registerTempTable.
You can quickly workaround by converting the RDD to a DataFrame, for example How to convert rdd object to dataframe in spark. But it's recommended to use SparkSql feature to read JDBC datasource, like examples here. Sample code:
val dfDepartments = sqlContext.read.format("jdbc")
.option("url", url)
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "(select department_id,department_name from departments) t")
.option("user", username)
.option("password", password).load()
dfDepartments.createOrReplaceTempView("My_Table")

sbt subproject cannot find it's dependencies

I have a project tree consisting of three projects A, B and C
B depends on A, and C depends on both A and B.
A and B are checked out in C's lib/ and both build fine using sbt compile
However, when I compile C, the build of B fails, complaining that it cannot find certain types/packages:
import org.scalatra.sbt._
import sbt.Keys._
import sbt._
object NwbApiBuild extends Build {
val Organization = "org.nwb"
val Name = "NWB API"
val Version = "0.1.0-SNAPSHOT"
val ScalaVersion = "2.10.3"
val ScalatraVersion = "2.3.0"
lazy val active_slick= Project (
"active-slick",
base = file("lib/active-slick")
)
lazy val slick_auth= Project (
"slick-auth",
base = file("lib/slick-auth")
)
lazy val project = Project (
"root",
file("."),
settings = Defaults.defaultSettings ++ ScalatraPlugin.scalatraWithJRebel ++ Seq(
organization := Organization,
name := Name,
version := Version,
scalaVersion := ScalaVersion,
resolvers += Classpaths.typesafeReleases,
libraryDependencies ++= Seq(
"org.scalatra" %% "scalatra" % ScalatraVersion,
"org.scalatra" %% "scalatra-specs2" % ScalatraVersion % "test",
"ch.qos.logback" % "logback-classic" % "1.0.6" % "runtime",
"org.eclipse.jetty" % "jetty-webapp" % "8.1.8.v20121106" % "container",
"org.eclipse.jetty.orbit" % "javax.servlet" % "3.0.0.v201112011016" % "container;provided;test" artifacts (Artifact("javax.servlet", "jar", "jar")),
"com.typesafe.slick" %% "slick" % "2.0.2",
"mysql" % "mysql-connector-java" % "5.1.31",
"joda-time" % "joda-time" % "2.3",
"org.joda" % "joda-convert" % "1.5",
"com.github.tototoshi" %% "slick-joda-mapper" % "1.1.0",
"org.json4s" %% "json4s-native" % "3.2.10",
"org.json4s" %% "json4s-jackson" % "3.2.7",
"c3p0" % "c3p0" % "0.9.1.2"
)
)
) aggregate(active_slick, slick_auth) dependsOn(active_slick, slick_auth)
}
where slick auth has build file
import org.scalatra.sbt._
name := "slick-auth"
version := "0.0.1-SNAPSHOT"
scalaVersion := "2.10.3"
val ScalatraVersion = "2.3.0"
lazy val active_slick = Project(
"active-slick",
base = file("lib/active-slick")
)
lazy val root = Project(
"root",
file("."),
settings = Defaults.defaultSettings ++ ScalatraPlugin.scalatraSettings ++ Seq(
libraryDependencies ++= Seq(
"com.typesafe.slick" %% "slick" % "2.0.2",
"org.slf4j" % "slf4j-nop" % "1.6.4",
"org.scalatest" %% "scalatest" % "2.2.0" % "test",
"org.scalatra" %% "scalatra" % ScalatraVersion,
"org.scalatra" %% "scalatra-specs2" % ScalatraVersion % "test",
"ch.qos.logback" % "logback-classic" % "1.0.6" % "runtime",
"org.eclipse.jetty" % "jetty-webapp" % "8.1.8.v20121106" % "container",
"org.eclipse.jetty.orbit" % "javax.servlet" % "3.0.0.v201112011016" % "container;provided;test" artifacts (Artifact("javax.servlet", "jar", "jar")),
"com.typesafe.slick" %% "slick" % "2.0.2",
"joda-time" % "joda-time" % "2.3",
"org.joda" % "joda-convert" % "1.5",
"com.github.tototoshi" %% "slick-joda-mapper" % "1.1.0",
"org.json4s" %% "json4s-native" % "3.2.10",
"org.json4s" %% "json4s-jackson" % "3.2.7",
"c3p0" % "c3p0" % "0.9.1.2"
)
)
).aggregate(active_slick).dependsOn(active_slick)
and active_slick:
name := "active-slick"
version := "0.0.1-SNAPSHOT"
scalaVersion := "2.10.3"
libraryDependencies ++= Seq(
"com.typesafe.slick" %% "slick" % "2.0.2",
"org.slf4j" % "slf4j-nop" % "1.6.4",
"org.scalatest" %% "scalatest" % "2.2.0" % "test",
"com.h2database" % "h2" % "1.3.166" % "test"
)
If you want to use another project as a dependency (rather than its binary version) you can use project references. There are two types of references, ProjectRef or a simpler version of the ProjectRef, which is RootProject.
You should change your build definition to reference slick_auth as
lazy val slick_auth = RootProject(file("lib/slick-auth"))
and active_slick as
lazy val active_slick = RootProject(file("lib/active-slick"))