Trying to load janusgraph with local backend cassandra and elastic search - scala

I am using Spark to load Janusgraph with Cassandra backend and elastic search, both running locally.
val conf = new BaseConfiguration()
conf.setProperty("storage.backend", "cassandrathrift")
conf.setProperty("storage.hostname", "127.0.0.1")
conf.setProperty("index.search.backend","elasticsearch")
conf.setProperty("index.search.hostname", "127.0.0.1")
val gr = JanusGraphFactory.open(conf)
I am however, getting this error,
main" java.lang.IllegalArgumentException: Could not find implementation class: org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:61)
at org.janusgraph.diskstorage.Backend.getImplementationClass(Backend.java:477)
at org.janusgraph.diskstorage.Backend.getStorageManager(Backend.java:409)
at org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.<init>(GraphDatabaseConfiguration.java:1376)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:164)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:133)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:113)
at Test2$$anonfun$main$1.apply$mcVI$sp(make_graph3.scala:177)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at Test2$.main(make_graph3.scala:162)
at Test2.main(make_graph3.scala)
Caused by: java.lang.ClassNotFoundException: org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:56)
... 10 more
I tried running jps on terminal, and this is the output:
32689 NailgunRunner
44609 GremlinServer
32162
44917 Jps
44326 CassandraDaemon
44744 Launcher
44509 Elasticsearch
So, Cassandra and ES both are running. What could be the issue?
Thanks in advance.

Related

Load data from redshift using spark ad scala in an EMR

I am trying to connect redshift using spark with scala in zeppelin from an EMR cluster, I used spark-redshift library but it doesn't work. I tried many solutions and i don't know why it gives an error
val df = spark.read .format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://xx:xx/xxxx?user=xxx&password=xxx")
.option("tempdir", path)
.option("query", sql_query) .load() ```
``` java.lang.ClassNotFoundException: Failed to find data source:
com.databricks.spark.redshift. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
... 51 elided
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 53 more ```
Should I import something before ? or may be do some configuration
In order to run specific modules within EMR you must add those modules to your cluster.
(They are not there automatically)
Your error is saying that it cannot find the modules.
take a look at
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

How to specify datasource in spark.read.format when using the data direct jdbc driver of Greenplum (greenplum.jar) to read a greenplum table?

I am trying to read data from a table on Greenplum using spark. I wrote the code as below:
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", connectionUrl)
.option("server.port","5432")
.option("dbtable", "tablename")
.option("dbschema","schemaname")
.option("user", devUserName)
.option("password", devPassword)
.option("partitionColumn","employeeLoc")
.option("partitions",450)
.load()
.where("period_year=2017 and period_num=12")
.select(gpColSeq map col:_*)
.withColumn(flagCol, lit(0))
I am using greenplum.jar, which provdes the data direct jdbc driver to read data from a greenplum table using Spark.
I am using the below spark-submit command:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.YearPartition --master=yarn --conf spark.ui.port=4090 --driver-class-path /home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar --executor-cores 3 --executor-memory 13G --keytab /home/hdpuser/hdpuser.keytab --principal hdpuser#devuser.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar splinter_2.11-0.1.jar
When I submit the jar, I see the exception:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: io.pivotal.greenplum.spark.GreenplumRelationProvider. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:553)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:304)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at com.partition.source.YearPartition$.prepareFinalDF$1(YearPartition.scala:154)
at com.partition.source.YearPartition$.main(YearPartition.scala:181)
at com.partition.source.YearPartition.main(YearPartition.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: io.pivotal.greenplum.spark.GreenplumRelationProvider.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:537)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:537)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:537)
I understood that this is due to using io.pivotal.greenplum.spark.GreenplumRelationProvider in the datasource statement i.e.
spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
I tried "io.pivotal.greenplum.spark.GreenplumRelationProvider" & "greenplum" but both result in the same exception which is:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source:
io.pivotal.greenplum.spark.GreenplumRelationProvider. Please find
packages at http://spark.apache.org/third-party-projects.html
I am unable to think of what should I give as my datasource in the statement spark.read.format while using the data direct jdbc jar: greenplum.jar
Could anyone let me know how can I fix this problem ?
what version of the greenplum-spark connector are you using?
you should be able to specify the custom jdbc driver in the "driver" option. refer to http://greenplum-spark.docs.pivotal.io/160/using_the_connector.html#use_custom_jdbcdriver.
you can specify the data source as follows:
spark.read.format("greenplum")

how to include jdbc jar in spark using maven

I have a spark (2.1.0) job that uses the postgres jdbc driver as described here: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
I'm using the dataframe writer like
val jdbcURL = s"jdbc:postgresql://${config.pgHost}:${config.pgPort}/${config.pgDatabase}?user=${config.pgUser}&password=${config.pgPassword}"
val connectionProperties = new Properties()
connectionProperties.put("user", config.pgUser)
connectionProperties.put("password", config.pgPassword)
dataFrame.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, tableName, connectionProperties)
I'm successfully including the jdbc driver from https://jdbc.postgresql.org/download/postgresql-42.1.1.jar downloading it manually and using --jars postgresql-42.1.1.jar --driver-class-path postgresql-42.1.1.jar
However, I'd prefer to not have to download it first.
I've unsuccessfully tried --jars https://jdbc.postgresql.org/download/postgresql-42.1.1.jar, but that fails from
Exception in thread "main" java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:364)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:480)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:600)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1154)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have also tried:
including "org.postgresql" % "postgresql" % "42.1.1" in my build.sbt file
spark-submit options: --repositories https://mvnrepository.com/artifact --packages org.postgresql:postgresql:42.1.1
spark-submit options: --repositories https://mvnrepository.com/artifact --conf "spark.jars.packages=org.postgresql:postgresql:42.1.1
these each fail the same way:
17/08/01 13:14:49 ERROR yarn.ApplicationMaster: User class threw exception: java.sql.SQLException: No suitable driver
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:446)
You can copy JDBC jar file to jars folder in spark directory and deploy your application with spark-submit without --jars option.
Specify the driver option like you do user and password with the JDBC class.

Hive error : Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable

I am facing an error when I try to query a table in hive when hive uses spark. For example, when I do:
select count(*) from ma_table;
I get this:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable
at org.apache.hadoop.hive.ql.parse.spark.GenSparkProcContext.<init>(GenSparkProcContext.java:163)
at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.generateTaskTree(SparkCompiler.java:195)
at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:267)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10947)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:246)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:250)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:477)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1242)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1384)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1171)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1161)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.lang.ClassNotFoundException: scala.collection.Iterable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 23 more
After some research, I tried to had this to my bashrc (and sourced it ;) ) :
for f in ${HIVE_LIB}/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in ${SCALA_HOME}/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
I checked and I have all these jars in the CLASSPATH and still get the error.
I am using Hive 2.1, Hadoop 2.8 and Spark 2.1.
Any ideas? Thanks a lot by advance!
After coming back to it, I found that scala jars files weren't in the lib of Hive.
If you are facing something similar, have a look there:
https://www.linkedin.com/pulse/hive-spark-configuration-common-issues-mohamed-k

NoClassDefFoundError raised by Spark when master is setted to yarn-client

I have a simple spark (1.4.1 version) application written in Scala that consume data from a kinesis stream. If i run the application, using the spark-submit command, with the value for the master setted to local[*] everything works fine. If i choose to use as master yarn-client i have the following exception:
15/11/24 14:22:09 ERROR ReceiverTracker: Deregistered receiver for stream 1: Error starting receiver 1 - java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeFormat
at com.amazonaws.auth.AWS4Signer.<clinit>(AWS4Signer.java:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at com.amazonaws.auth.SignerFactory.createSigner(SignerFactory.java:119)
at com.amazonaws.auth.SignerFactory.lookupAndCreateSigner(SignerFactory.java:105)
at com.amazonaws.auth.SignerFactory.getSigner(SignerFactory.java:78)
at com.amazonaws.AmazonWebServiceClient.computeSignerByServiceRegion(AmazonWebServiceClient.java:307)
at com.amazonaws.AmazonWebServiceClient.computeSignerByURI(AmazonWebServiceClient.java:280)
at com.amazonaws.AmazonWebServiceClient.setEndpoint(AmazonWebServiceClient.java:160)
at com.amazonaws.services.kinesis.AmazonKinesisClient.setEndpoint(AmazonKinesisClient.java:2102)
at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:216)
at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:202)
at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:175)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.<init>(Worker.java:106)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.<init>(Worker.java:92)
at org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:133)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:125)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:109)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:308)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:300)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.joda.time.format.DateTimeFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 31 more
Obviously i have created a fat jar using the assembly plugin for sbt that include the spark-streaming-kinesis-asl_2.10 library that has joda-time-2.9.1.jar as dependency. I've listed the file contained in my fat jar and the class is present. To be sure of its presence i've also tryed to use DateTimeFormat from the main class and i hadn't any problem.
I hope that someone could help me to solve this problem.
Thanks.
I suggest to check the classpath entries from the spark "Application Detail UI" => "Environment" tab and check if you see any joda-time entires there.