JanusGraph DynamoDB backend exceptions when committing data to the database - scala

Hello: I am using JanusGraph with DynamoDB example from https://github.com/awslabs/dynamodb-janusgraph-storage-backend
Also, I am connecting to JanusGraph using Spark - Scala - Gremlin Scala framework. Everything thing works when I used Cassandra as the backend, but when I switch to using DynamoDB, I start getting backend exception errors.
My conf looks like this
val conf = new BaseConfiguration
conf.setProperty("gremlin.graph","org.janusgraph.core.JanusGraphFactory")
conf.setProperty("storage.write-time","1 ms")
conf.setProperty("storage.read-time","1 ms")
conf.setProperty("storage.backend","com.amazon.janusgraph.diskstorage.dynamodb.DynamoDBStoreManager")
conf.setProperty("storage.dynamodb.client.signing-region","us-east-1")
conf.setProperty("storage.dynamodb.client.endpoint","http://127.0.0.1:8000")
val graph = JanusGraphFactory.open(conf)
I can connect DynamoDB fine, but when I start to insert data, I run into backend exceptions.
Below is part of the error log
ERROR org.janusgraph.graphdb.database.StandardJanusGraph - Could not commit transaction [1] due to storage exception in system-commit
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.persist(CacheTransaction.java:95)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.flushInternal(CacheTransaction.java:143)
at org.janusgraph.diskstorage.keycolumnvalue.cache.CacheTransaction.commit(CacheTransaction.java:200)
at org.janusgraph.diskstorage.BackendTransaction.commit(BackendTransaction.java:150)
at org.janusgraph.graphdb.database.StandardJanusGraph.commit(StandardJanusGraph.java:703)
at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.commit(StandardJanusGraphTx.java:1363)
at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph$GraphTransaction.doCommit(JanusGraphBlueprintsGraph.java:272)
at org.apache.tinkerpop.gremlin.structure.util.AbstractTransaction.commit(AbstractTransaction.java:105)
at $line81.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:84)
at $line81.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:80)
Any idea what is going on here. I am pretty new to DynamoDB. This was working fine in Cassandra

Why do you know you are connected? I think you must provide credentials into your config. For example:
conf.setProperty("storage.dynamodb.client.credentials.class-name", "com.amazonaws.auth.BasicAWSCredentials")
conf.setProperty("storage.dynamodb.client.credentials.constructor-args", "ACCESS_KEY,SECRET_KEY")

Related

How to speed up writing into Impala from Talend

I'm using Talend Open Studio for Big Data (7.3.1), and I write files from various sources to Cloudera Impala (Cloudera QuickStart 5.13) but that takes too much time and writes only ~3300 rows/s (take a look at the pictures).
Is there way to raise writing to ~10000-100000 rows/s or even greater?
Am i using wrong approach for the load?
Or do i need to configure Impala/Talend better?
Any advice is welcome!
UPDATE
I install JDBC Impala driver:
But OutputFile looks like it not configured for Impala:
Error:
Exception in component tDBOutput_1 (db_2_impala)
org.talend.components.api.exception.ComponentException: UNEXPECTED_EXCEPTION:{message=[Cloudera]ImpalaJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:AnalysisException: Impala does not support modifying a non-Kudu table: algebra_db.source_data_textfile_2
), Query: DELETE FROM algebra_db.source_data_textfile_2.} at org.talend.components.jdbc.CommonUtils.newComponentException(CommonUtils.java:583)

AWS Glue ETL MongoDB Connection String Error

Issue using MongoDb with AWS glue - I've created a connection to the database (using the MongoDb connection option) and run a crawler against it and it all worked fine, but when I try to use this as a datasource in a basic ETL job (script- Glue version 2.0Python version 3) it throws the exception
py4j.protocol.Py4JJavaError: An error occurred while calling o70.getDynamicFrame.
: java.lang.RuntimeException: Mongo/DocumentDB connection URL is not supported
Has anyone had any success using MongoDb as a datasource in glue ETL jobs?

Spark job dataframe write to Oracle using jdbc failing

When writing spark dataframe to Oracle database (Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit), the spark job is failing with the exception java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection. The scala code is
dataFrame.write.mode(SaveMode.Append).jdbc("jdbc:oracle:thin:#" + ipPort + ":" + sid, table, props)
Already tried setting below properties for jdbc connection but hasn't worked.
props.put("driver", "oracle.jdbc.OracleDriver")
props.setProperty("testOnBorrow","true")
props.setProperty("testOnReturn","false")
props.setProperty("testWhileIdle","false")
props.setProperty("validationQuery","SELECT 1 FROM DUAL")
props.setProperty("autoReconnect", "true")
Based on the earlier search results, it seems that the connection is opened initially but is being killed by the firewall after some idle time. The connection URL is verified and is working as the select queries work fine. Need help in getting this resolved.

Streamsets DC and Crate exception. ERROR: SQLParseException: line 1:13: no viable alternative at input 'CHARACTERISTICS'

I am trying to connect to Crate as a Streamsets Data collector pipeline origin ( JDBC Consumer ). However I get this error: "JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: ERROR: SQLParseException: line 1:13: no viable alternative at input 'CHARACTERISTICS' "
Why am I getting this error ? The Crate JDBC Driver version is 2.1.5 and Streamsets Data collector version is 2.4.0.0.
#gashey already solved the issue. Within Streamsets DC uncheck Enforce Read-only Connection on the Advanced tab of my JDBC query consumer configuration
(see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/crateio/hBexxel2KQw/kU34mrsJBgAJ).
We will update the streamsets-documentation with the workaround. https://crate.io/docs/tools/streamsets/

Cannot connect to remote MongoDB from EMR cluster with spark-shell

I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2:
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user_name"), ("database", "meteor"), ("password", "my_password")))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
Spark-shell responds with the following error:
16/07/26 15:44:35 INFO SparkContext: Starting job: aggregate at MongodbSchema.scala:47
16/07/26 15:44:45 WARN DAGScheduler: Creating new stage failed due to exception - job: 1
com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting to connect. Client view of cluster state is {type=Unknown, servers=[{address=[IP.OF.REMOTE.HOST]:3001, type=Unknown, state=Connecting, exception={java.lang.IllegalArgumentException: response too long: 1347703880}}]
at com.mongodb.BaseCluster.getDescription(BaseCluster.java:128)
at com.mongodb.DBTCPConnector.getClusterDescription(DBTCPConnector.java:394)
at com.mongodb.DBTCPConnector.getType(DBTCPConnector.java:571)
at com.mongodb.DBTCPConnector.getReplicaSetStatus(DBTCPConnector.java:362)
at com.mongodb.Mongo.getReplicaSetStatus(Mongo.java:446)
.
.
.
After reading for a while, a few responses here in SO and other forums state that the java.lang.IllegalArgumentException: response too long: 1347703880 error might be caused by a faulty Mongo driver. Based on that I started executing spark-shell with updated drivers like so:
spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 --jars casbah-commons_2.10-3.1.1.jar,casbah-core_2.10-3.1.1.jar,casbah-query_2.10-3.1.1ja.jar,mongo-java-driver-2.13.0.jar
Of course before this I downloaded the jars and stored them in the same route as the spark-shell was executed. Nonetheless, with this approach spark-shell answers with the following cryptic error message:
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: com/mongodb/casbah/query/dsl/CurrentDateOp
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:218)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
It is worth mentioning that the target MongoDB is a Meteor Mongo database, that's why I'm trying to connect with [IP.OF.REMOTE.HOST]:3001 instead of using the port 27017.
What might be the issue? I've followed many tutorials but all of them seem to have the MongoDB in the same host, allowing them to declare localhost:27017 in the credentials. Is there something I'm missing?
Thanks for the help!
I ended up using MongoDB's official Java driver instead. This was my first experience with Spark and the Scala programming language, so I wasn't very familiar with the idea of using plain Java JARs yet.
The solution
I downloaded the necessary JARs and stored them in the same directory as the job file, which is a Scala file. So the directory looked something like:
/job_directory
|--job.scala
|--bson-3.0.1.jar
|--mongodb-driver-3.0.1.jar
|--mongodb-driver-core-3.0.1.jar
Then, I start spark-shell as follows to load the JARs and its classes into the shell environment:
spark-shell --jars "mongodb-driver-3.0.1.jar,mongodb-driver-core-3.0.1.jar,bson-3.0.1.jar"
Next, I execute the following to load the source code of the job into the spark-shell:
:load job.scala
Finally I execute the main object in my job like so:
MainObject.main(Array())
As of the code inside the MainObject, it is merely as the tutorial states:
val mongo = new MongoClient(IP_OF_REMOTE_MONGO , 27017)
val db = mongo.getDB(DB_NAME)
Hopefully this will help future readers and spark-shell/Scala beginners!