java.lang.NoClassDefFoundError: Could not initialize class XXXXXXXX in scala spark - scala

I have written the scala-spark code to build my project and IDE is IntelliJ and it was showing this error while running it on AWS EMR cluster and working fine on the local.
It was cracking at below commented line:
var join_sql="select ipfile.id,ipfile.col1,opfile.col2 from ipfile join opfile on ipfile.id=opfile.id"
var df1=Operation.spark.sql(join_sql)
df1.createOrReplaceTempView("df1")
var df2 = df1.groupBy("col1","col2").count()
df2.createOrReplaceTempView("df2")
df2=Operation.spark.sql("select * from df2 order by count desc")
print("count : ",df2.count())
try {
df2.foreach(t => {
impact=t.getAs[Long]("impact").toString // Job was aborting at this particular line
m1 = t.getAs[String]("col1")
m2=t.getAs[String]("col2")
print("m1" + "m2" )
})
When I created the jar through sbt assembly to run it on the local mode, it was working fine but when I created the jar for yarn-client and executing that on cluster mode, it was showing this error.

Related

Setup of Scala/Flink project using Bazel

I am trying to setup a simple flink application from scratch using Bazel. I've bootstrapped the project by running
sbt new tillrohrmann/flink-project.g8
and after that I have added some files in order for Bazel to take control of the building (i.e., migrate from sbt). This is how the WORKSPACE looks like
# WORKSPACE
load("#bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
skylib_version = "1.0.3"
http_archive(
name = "bazel_skylib",
sha256 = "1c531376ac7e5a180e0237938a2536de0c54d93f5c278634818e0efc952dd56c",
type = "tar.gz",
url = "https://mirror.bazel.build/github.com/bazelbuild/bazel-skylib/releases/download/{}/bazel-skylib-{}.tar.gz".format(skylib_version, skylib_version),
)
rules_scala_version = "5df8033f752be64fbe2cedfd1bdbad56e2033b15"
http_archive(
name = "io_bazel_rules_scala",
sha256 = "b7fa29db72408a972e6b6685d1bc17465b3108b620cb56d9b1700cf6f70f624a",
strip_prefix = "rules_scala-%s" % rules_scala_version,
type = "zip",
url = "https://github.com/bazelbuild/rules_scala/archive/%s.zip" % rules_scala_version,
)
# Stores Scala version and other configuration
# 2.12 is a default version, other versions can be use by passing them explicitly:
load("#io_bazel_rules_scala//:scala_config.bzl", "scala_config")
scala_config(scala_version = "2.12.11")
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_repositories")
scala_repositories()
load("#io_bazel_rules_scala//scala:toolchains.bzl", "scala_register_toolchains")
scala_register_toolchains()
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_library", "scala_binary", "scala_test")
# optional: setup ScalaTest toolchain and dependencies
load("#io_bazel_rules_scala//testing:scalatest.bzl", "scalatest_repositories", "scalatest_toolchain")
scalatest_repositories()
scalatest_toolchain()
load("//vendor:workspace.bzl", "maven_dependencies")
maven_dependencies()
load("//vendor:target_file.bzl", "build_external_workspace")
build_external_workspace(name = "vendor")
and this is the BUILD file
package(default_visibility = ["//visibility:public"])
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_library", "scala_test")
scala_library(
name = "job",
srcs = glob(["src/main/scala/**/*.scala"]),
deps = [
"#vendor//vendor/org/apache/flink:flink_clients",
"#vendor//vendor/org/apache/flink:flink_scala",
"#vendor//vendor/org/apache/flink:flink_streaming_scala",
]
)
I'm using bazel-deps for vendoring the dependencies (put in the vendor folder). I have this on my dependencies.yaml file:
options:
buildHeader: [
"load(\"#io_bazel_rules_scala//scala:scala_import.bzl\", \"scala_import\")",
"load(\"#io_bazel_rules_scala//scala:scala.bzl\", \"scala_library\", \"scala_binary\", \"scala_test\")",
]
languages: [ "java", "scala:2.12.11" ]
resolverType: "coursier"
thirdPartyDirectory: "vendor"
resolvers:
- id: "mavencentral"
type: "default"
url: https://repo.maven.apache.org/maven2/
strictVisibility: true
transitivity: runtime_deps
versionConflictPolicy: highest
dependencies:
org.apache.flink:
flink:
lang: scala
version: "1.11.2"
modules: [clients, scala, streaming-scala] # provided
flink-connector-kafka:
lang: java
version: "0.10.2"
flink-test-utils:
lang: java
version: "0.10.2"
For downloading the dependencies, I'm running
bazel run //:parse generate -- --repo-root ~/Projects/bazel-flink-scala --sha-file vendor/workspace.bzl --target-file vendor/target_file.bzl --deps dependencies.yaml
Which runs just fine, but then when I try to build the project
bazel build //:job
I'm getting this error
Starting local Bazel server and connecting to it...
ERROR: Traceback (most recent call last):
File "/Users/salvalcantara/Projects/me/bazel-flink-scala/WORKSPACE", line 44, column 25, in <toplevel>
build_external_workspace(name = "vendor")
File "/Users/salvalcantara/Projects/me/bazel-flink-scala/vendor/target_file.bzl", line 258, column 91, in build_external_workspace
return build_external_workspace_from_opts(name = name, target_configs = list_target_data(), separator = list_target_data_separator(), build_header = build_header())
File "/Users/salvalcantara/Projects/me/bazel-flink-scala/vendor/target_file.bzl", line 251, column 40, in list_target_data
"vendor/org/apache/flink:flink_clients": ["lang||||||scala:2.12.11","name||||||//vendor/org/apache/flink:flink_clients","visibility||||||//visibility:public","kind||||||import","deps|||L|||","jars|||L|||//external:jar/org/apache/flink/flink_clients_2_12","sources|||L|||","exports|||L|||","runtimeDeps|||L|||//vendor/commons_cli:commons_cli|||//vendor/org/slf4j:slf4j_api|||//vendor/org/apache/flink:force_shading|||//vendor/com/google/code/findbugs:jsr305|||//vendor/org/apache/flink:flink_streaming_java_2_12|||//vendor/org/apache/flink:flink_core|||//vendor/org/apache/flink:flink_java|||//vendor/org/apache/flink:flink_runtime_2_12|||//vendor/org/apache/flink:flink_optimizer_2_12","processorClasses|||L|||","generatesApi|||B|||false","licenses|||L|||","generateNeverlink|||B|||false"],
Error: dictionary expression has duplicate key: "vendor/org/apache/flink:flink_clients"
ERROR: error loading package 'external': Package 'external' contains errors
INFO: Elapsed time: 3.644s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
Why is that? Anyone can help? It would be great having detailed instructions and project templates for Flink/Scala applications using Bazel. I've put everything together in the following repo: https://github.com/salvalcantara/bazel-flink-scala, feel free to send a PR or whatever.

How to integrate PostgreSQL with Corda 3.0 or with Corda 4.0?

I tried to configure PostgreSQL as node's database in Corda 3.0 and in Corda 4.0.
I have added following things in build.gradle file. (Testdb1 is Database name. I have tried with postgres also)
node{
...
// this part i have added
extraConfig = [
jarDirs: ['path'],
'dataSourceProperties': [
'dataSourceClassName': 'org.postgresql.ds.PGSimpleDataSource',
'"dataSource.url"' : 'jdbc:postgresql://127.0.0.1:5432/Testdb1',
'"dataSource.user"' : 'postgres',
'"dataSource.password"': 'admin#123'
],
'database': [
'transactionIsolationLevel': 'READ_COMMITTED'
]
]
// till here
}
this part in reference.conf file
dataSourceProperties = {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.url = "jdbc:postgresql://127.0.0.1:5432/Testdb1"
dataSource.user = postgres
dataSource.password = "admin#123"
}
database = {
transactionIsolationLevel = "READ_COMMITTED"
}
jarDirs = ["path"]
I got the follwing Error while deploying the nodes:
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':java-source:deployNodes'.
In node-info-gen.log file it showed the CAPSULE EXCEPTION. Then I updated my JDK to 8u191 but still got the same error.
I have go-through the followings to get the things done. One can get reference from here.
https://docs.corda.net/node-database.html ,
https://github.com/corda/corda/issues/4037 ,
How can the Corda node be extended to work with databases other than H2?
You need to add those properties in node.conf in each of your corda nodes. After you do "deployNodes"
After you add these properties in node.conf file , just run the corda jar . It will automatically start . But before that you need to create the tables (The migration to other DBs is already provided in corda documentation)
I have added following things in .conf file of each node and one referene.conf file. I have given all the privileges for the user postgres which are mentioned in corda documentation.
https://docs.corda.r3.com/node-database.html
(previously I used postgresql-42.2.5.jar file but that didn't work so I used one downgrade version of it postgresql-42.1.4.jar.
one can download jar files from https://jdbc.postgresql.org/download.html )
After deployed Nodes successfully add following things:
dataSourceProperties = {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.url = "jdbc:postgresql://127.0.0.1:5432/Testdb1"
dataSource.user = postgres
dataSource.password = "admin#123"
}
database = {
transactionIsolationLevel = "READ_COMMITTED"
}
jarDirs = ["path"]
(path = jar file's location) after adding this configuration run file called runnodes.bat

Connecting to DSE Graph running on a Docker Container results into No host found

I am running my DSE Graph inside a Docker container with this command.
docker run -e DS_LICENSE=accept -p 9042:9042 --name my-dse -d datastax/dse-server -g -k -s
On my Scala Code I am referencing it as follows
object GrDbConnection {
val dseCluster = DseCluster.builder()
.addContactPoints("127.0.0.1").withPort(9042)
.build()
val graphName = "graphName"
val graphOptions = new GraphOptions()
.setGraphName(graphName)
var graph:ScalaGraph = null
try {
val session = dseCluster.connect()
// The following uses the DSE graph schema API, which is currently only supported by the string-based
// execution interface. Eventually there will be a programmatic API for making schema changes, but until
// then this needs to be used.
// Create graph
session.executeGraph("system.graph(name).ifNotExists().create()", ImmutableMap.of("name", graphName))
// Clear the schema to drop any existing data and schema
session.executeGraph(new SimpleGraphStatement("schema.clear()").setGraphName(graphName))
// Note: typically you would not want to use development mode and allow scans, but it is good for convenience
// and experimentation during development.
// Enable development mode and allow scans
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.schema_mode').set('development')")
.setGraphName(graphName))
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.allow_scan').set('true')")
.setGraphName(graphName))
// Create a ScalaGraph from a remote Traversal Source using withRemote
// See: http://tinkerpop.apache.org/docs/current/reference/#connecting-via-remotegraph for more details
val connection = DseRemoteConnection.builder(session)
.withGraphOptions(graphOptions)
.build()
graph = EmptyGraph.instance().asScala
.configure(_.withRemote(connection))
} finally {
dseCluster.close()
}
}
Then In one of my controllers I have used this to invoke a query to the DSE Graph.
def test = Action {
val r = GrDbConnection.graph.V().count()
print(r.iterate())
Ok
}
This returns an error
lay.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[NoHostAvailableException: All host(s) tried for query failed (no host was tried)]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:178)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:363)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:361)
at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:413)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:221)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:41)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:292)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:109)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:89)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:124)
at com.datastax.driver.dse.DefaultDseSession.executeGraphAsync(DefaultDseSession.java:123)
at com.datastax.dse.graph.internal.DseRemoteConnection.submitAsync(DseRemoteConnection.java:74)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:89)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65)
Turns out all I needed to do was to remove
finally {
dseCluster.close()
}

elastic4s NoNodeAvailableException when connecting through TcpClient.transport

Am trying to get my hands on elastic4s by one of the samples as given
here
But I keep getting the below exception when am trying to connect through TcpClient.transport :
Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{IFyYWnE_S4aHRVxT9v60LQ}{dockerhost}{192.168.99.100:9300}]]
Am trying to connect to an elastic instance on docker, elastic version is 2.3.4
Here is my code below dependencies.
import com.sksamuel.elastic4s.{ElasticClient, ElasticsearchClientUri, TcpClient}
import org.elasticsearch.action.support.WriteRequest.RefreshPolicy
import org.elasticsearch.common.settings.Settings
import com.sksamuel.elastic4s.ElasticDsl._
object Main extends App {
//val settings = Settings.builder().put("cluster.name", "elasticsearch").build()
val client = TcpClient.transport(ElasticsearchClientUri("elasticsearch://dockerhost:9300"))
client.execute {
bulk(
indexInto("myindex" / "mytype").fields("country" -> "Mongolia", "capital" -> "Ulaanbaatar"),
indexInto("myindex" / "mytype").fields("country" -> "Namibia", "capital" -> "Windhoek")
).refresh(RefreshPolicy.WAIT_UNTIL)
}.await
val result = client.execute {
search("myindex").matchQuery("capital", "ulaanbaatar")
}.await
println(result.hits.head.sourceAsString)
client.close()
}
build.gradle:
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-core_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-tcp_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-http_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-streams_2.11', version: '5.4.9'
Any help regarding this issue would be helpful.
I am asked this question a lot, and 99% of the time, the answer to this question is
The cluster name is not the default (elasticsearch) and therefore it must be specified in the connection string.
The server is not setup to listen outside of localhost. https://www.elastic.co/blog/elasticsearch-unplugged

Cannot connect to Cassandra using Astyanax

I am running latest version of cassandra on my mac inside docker. I have installed cassandra dev center on my mac and I can connect to cassandra using devcenter and I can easily query my tables.
http://i.imgur.com/48MztBM.png
http://i.imgur.com/jPKblly.png
I can see that all required ports for cassandra are also open. (as shown below)
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e698e61e3bac cassandra:latest "/docker-entrypoint.s" 9
minutes ago Up 9 minutes 0.0.0.0:7000->7000/tcp, 0.0.0.0:7199-
>7199/tcp, 0.0.0.0:9042->9042/tcp, 0.0.0.0:9160->9160/tcp, 7001/tcp cassandra
I also logged in, into my cassandra container and executed the following commands
root#e698e61e3bac:/# nodetool -h 127.0.0.1 enablethrift
root#e698e61e3bac:/# nodetool -h 127.0.0.1 statusthrift
running
However when I run the code below
val configImpl = new AstyanaxConfigurationImpl()
configImpl.setDiscoveryType(NodeDiscoveryType.RING_DESCRIBE)
configImpl.setCqlVersion("3.4.2")
configImpl.setTargetCassandraVersion("3.7")
val poolConfig = new ConnectionPoolConfigurationImpl("MyConnectionPool")
poolConfig.setPort(9160)
poolConfig.setMaxConnsPerHost(1)
poolConfig.setSeeds("192.168.1.169")
val context = new AstyanaxContext.Builder()
.forCluster("localhost")
.forKeyspace("movielens_small")
.withAstyanaxConfiguration(configImpl)
.withConnectionPoolConfiguration(poolConfig)
.withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
.buildKeyspace(ThriftFamilyFactory.getInstance())
context.start()
val keyspace = context.getClient()
val cf = new ColumnFamily[UUID, String]("cf", UUIDSerializer.get, StringSerializer.get())
val result = keyspace.prepareQuery(cf).withCql("select name from movies").execute()
val data = result.getResult.getRows()
for {
row <- data
col <- row.getColumns
} {
println(col)
}
context.shutdown()
I get the following error
07:32:10,606 INFO ConnectionPoolMBeanManager:53 - Registering mbean: com.netflix.MonitoredResources:type=ASTYANAX,name=MyConnectionPool,ServiceType=connectionpool
07:32:10,625 INFO CountingConnectionPoolMonitor:194 - AddHost: 192.168.1.169
07:32:10,748 INFO CountingConnectionPoolMonitor:194 - AddHost: 172.17.0.2
07:32:10,748 INFO CountingConnectionPoolMonitor:205 - RemoveHost: 192.168.1.169
[error] (run-main-0) com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=172.17.0.2(172.17.0.2):9160, latency=2003(2003), attempts=1]Timed out waiting for connection
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=172.17.0.2(172.17.0.2):9160, latency=2003(2003), attempts=1]Timed out waiting for connection
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
at com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:84)
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:117)
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352)
at com.netflix.astyanax.thrift.AbstractThriftCqlQuery.execute(AbstractThriftCqlQuery.java:41)
at com.abhi.CassandraScanner$.delayedEndpoint$com$abhi$CassandraScanner$1(CassandraScanner.scala:41)
at com.abhi.CassandraScanner$delayedInit$body.apply(CassandraScanner.scala:19)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:378)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
I was able to solve the problem. The problem was with the NodeDiscoveryType. I changed my code to NodeDiscoveryType.NONE and it worked perfectly. Although I still don't understand why its mandatory for the NodeDiscoveryType to be NULL and why it won't work with RING_DESCRIBE.
Hopefully someone will elaborate further.