Set up sbt for use with HBase - scala

I'm trying to work with HBase from Spark/Scala using sbt and followed the instructions where I replaced the version with 1.2.1. However, it seems my machine cannot resolve the dependencies.
Below is my .sbt/repositories file:
[repositories]
local
sbt-releases-repo: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
sbt-plugins-repo: http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
maven-central: http://repo1.maven.org/maven2/
concurrent-maven: http://conjars.org/repo/
I'm using IntelliJ and it tells me that HBase is still an unresolved dependency and I don't see hbase when I type org.apache.hadoop., which should appear in the list.
Am I missing a repo or resolver?

I figured it out: if you can use one of the CHD or HDP versions, which in my case works out fine because we use HDP, then you just have to add the repos as here.
Then in build.sbt you use the version from your Hadoop distro. If you happen to use a vanilla HBase, then you probably have to publish to your local repo. I haven't opted for that though.
And yes, I was right: the libraries reside in org.apache.hadoop.hbase.

Related

Databricks - java.lang.NoClassDefFoundError: org/json/JSONException

We can't figure out the following issue: we are trying to use Apache Hudi to save data to the storage. The problem is when we upload a fat jar which includes the org.json package in dependencies, the df.save() application is failing on
java.lang.NoClassDefFoundError: org/json/JSONException
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:384)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:367)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:357)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:262)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:176)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:321)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:363)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:359)
Even if I go to the cluster libraries and explicitly add this dependency it still fails on save. On the other hand, when I just create new JSONException("hello") in my notebook everything seem to work fine. What could cause this behaviour? Thanks
This is probably because the jar is not making it's way to the executor nodes, try addJar (https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addJar-java.lang.String-)
What version of Hudi are you using? There is a problem with JSON in version 0.6.0 and there is an opened issue. I suggest you to use version 0.5.2 by now.
Turns out that the problem was with different classpath between metastore service and spark process, because they run in separated JVM's. The problem was fixed in an init script that downloads the jar to the classpath folder.

What is the equivalent of ScalaInstance.jars.classpath in sbt 1.3.3?

I am trying to move a scala project from sbt 1.2.7 to 1.3.3 and there are a lot of errors. One of them is
value jars is not a member of sbt.internal.inc.ScalaInstance
scalaInstance.map( _.jars.classpath).value
The ScalaInstance class can be found in Zinc repo and it does not have jars as a member in the latest master. What would be the equivalent code now?
You can use .allJars instead.
I'm not sure if the .classpath extension method will work, so you might need to tweak that.

Why does sbt keep downloading my snapshot dependencies?

I have an SBT project that depends on two snapshot dependencies. Every time I build it, it goes off to the remote repository to fetch the dependencies. This is true even if I set offline := true.
When I look at how it is trying to resolve the local dependencies, the build is saying it is looking in "local", i.e., ~/.ivy2/local/... -- which is a nonexistent directory.
The jars are in ~/.ivy2/cache/... and this is where SBT downloads them when it pulls the dependencies from the remote server.
I have searched my .sbt and .scala build files and the string "local" does not appear in them in connection with a repository or cache.
SBT is at version 0.13.11 building against scala 2.11.8.
Why is SBT doing this, and how can I get it to see the cached jars?
If you want to prevent SBT from trying to download from official repositories you could simply create a file project/offline-repositories:
[repositories]
mirror-central: file:////nexus/central
mirror-maven-central-org: file:////nexus/maven-central-org
...
(/nexus/central and /nexus/maven-central-org should contain a (partial) mirror of what you need offline)
Then call sbt with the sbt.repository.config property configured:
-Dsbt.override.build.repos=true \
-Dsbt.repository.config=./project/offline-repositories
For Reference:
http://www.scala-sbt.org/0.13/docs/Proxy-Repositories.html
How to prevent SBT from trying to download from official repositories?
EDIT
If you want to use your ~/.m2 cache:
[repositories]
mirror-central: file:////home/XXXXX/.m2/repository
mirror-maven-central-org: file:////home/XXXXX/.m2/repository
...
This apparently is because in my Ivy cache I had a file named ~/.ivy2/cache/com.xxx/xxx-utils/ivy-2.3.2-SNAPSHOT.xml.original , which the build was trying and failing to parse. I'm not sure where this file came from; conceivably it was put there manually ages ago.

Installing SBT on Win 7 64 bit

I want to install Apache Spark for testing purpose. For that I found out that Scala and sbt are necessary. I downloaded scala msi and installed it. For installing sbt I tried various methods but am unable to do so. Can someone tell me what am I doing wrong. What I did is
Install Scala msi
Download sbt msi and install it.
Set sbt_home and path variable to the location where sbt is extracted. Then I opened cmd to check my sbt version by using sbt sbt-version I am getting the following error **unresolved dependency:
org.fusesource.jansi#jansi;1.11: not found
Error during sbt execution: Error retrieving required libraries (see C:\Users\ashish-b\.sbt\boot\update.log for complete log) Error: Could not retrieve jansi 1.11 **
Whats wrong in it?
I saw this issue as well when connecting to the internet via a corporate proxy. In this case, sbt couldn't download its dependencies.
We work with a proxy Maven repository for depedencies. Configure sbt to use a proxy repo.
Our sbt repositories file looks like this:
[repositories]
local
local-maven: file:///C:/data/maven_repo/
aaa-ext-ivy-proxy: http://nexus-ext.company.net:8081/nexus/content/groups/ivy-public/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
aaa-ext-maven-proxy: http://nexus-ext.company.net:8081/nexus/content/groups/public/
aaa-int-maven-repo: http://nexus-int.company.net:8081/nexus/content/groups/public/
Or you can also configure the proxy server directly for SBT, see this question.

Using SBT on a remote node without internet access via SSH

I am trying to write a Spark program with Scala on a remote machine, but that machine has no internet access. Since I am using the pre-built version for Hadoop, I am able to run the pre-compiled examples:
[user#host spark-0.7.2]$ ./run spark.examples.LocalPi
but I can't compile anything that references spark on the machine:
$ scalac PiEstimate.scala
PiEstimate.scala:1: error: not found: object spark
import spark.SparkContext
^
Normally, I would use SBT to take care of any dependencies, but the machine does not have internet access, and tunneling internet through SSH is not possible.
Is it possible to compile an SBT project on a remote machine that has no internet access? Or how could I manually link the Spark dependencies to the Scala compiler.
If you're compiling your Spark program through scalac, you'll have to add Spark's jars to scalac's classpath; I think this should work:
scalac -classpath "$SPARK_HOME/target/scala-*/*.jar" PiEstimate.scala
I know this is an old post but I had to deal with this issue recently. I solved it by removing the dependencies from my .sbt file and adding the spark jar (spark-home/assembly/target/scala.2-10/spark-[...].jar) under my-project-dir/lib directory. You can also point to it using unmanagedBase = file("/path/to/jars/") Then I could use sbt package as usually