error writing to mongodb from pig - mongodb

I'am trying to use the mongo hadoop connector with pig or streaming to load/store data from mongodb. using pig i have following problem:
$cat process.pig
REGISTER /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
REGISTER /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar
REGISTER /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-pig-1.4.0.jar
SET mapreduce.map.speculative false
SET mapreduce.reduce.speculative false
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false
SET mongo.auth.uri 'mongodb://hadoop:password#127.0.0.1:27017/admin'
raw = LOAD 'mongodb://hadoop:password#127.0.0.1:27017/hadoop.collection'
USING com.mongodb.hadoop.pig.MongoLoader('id:chararray, t:chararray, c_s:map[]');
writing the data into a bson file with
STORE raw
INTO 'file:///tmp/pig_without_limit_bson'
USING com.mongodb.hadoop.pig.BSONStorage('id');
works and i'am able to import the file with mongorestore.
writing to mongodb with
STORE raw
INTO 'mongodb://hadoop:password#127.0.0.1:27017/hadoop.out'
USING com.mongodb.hadoop.pig.MongoInsertStorage('id:chararray, t:chararray', 'id');
does not work and produces following error:
Input(s):
Failed to read data from "mongodb://hadoop:password#127.0.0.1:27017/hadoop.collection"
Output(s):
Failed to produce result in "mongodb://hadoop:password#127.0.0.1:27017/hadoop.out"
$cat pig.log
Error: java.lang.IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:79)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:175)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:141)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:72)
at com.mongodb.Mongo.execute(Mongo.java:745)
at com.mongodb.Mongo$2.execute(Mongo.java:728)
at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:1968)
at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:1962)
at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:98)
at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:133)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitTask(PigOutputCommitter.java:356)
at org.apache.hadoop.mapred.Task.commit(Task.java:1163)
at org.apache.hadoop.mapred.Task.done(Task.java:1025)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Pig Stack Trace
---------------
ERROR 0: java.io.IOException: No FileSystem for scheme: mongodb
org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:535)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
at org.apache.pig.PigServer.execute(PigServer.java:1364)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:495)
at org.apache.pig.Main.main(Main.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2635)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.pig.StoreFunc.cleanupOnFailureImpl(StoreFunc.java:193)
at org.apache.pig.StoreFunc.cleanupOnFailure(StoreFunc.java:161)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:526)
... 18 more
However, if using the limit operator (even if limiting to enormous figures) all documents are saved into mongodb.
raw_limited = limit raw 1000000;
STORE raw_limited
INTO 'mongodb://hadoop:password#127.0.0.1:27017/hadoop.out'
USING com.mongodb.hadoop.pig.MongoInsertStorage('id:chararray, t:chararray', 'id');
results in
Input(s):
Successfully read 100 records (638 bytes) from:
Output(s):
Successfully stored 100 records (18477 bytes) in:
$mongo hadoop
>> db.out.count()
100
why is that and how can it be fixed? did i miss something?

Seems to be a bug in the mongo java driver.
It works, if using version 3.0.4 of the mongo java driver.

Related

Spark fails to read from Elasticsearch/Opensearch. Invalid map received dynamic_date_formats

Hi I'm trying using scala 2.11.12, spark 2.3.0 and elasticsearch-spark-20 7.7.0 to read from an OpenSearch 1.3.4 Index with the following code:
spark.read.format("org.elasticsearch.spark.sql")
.load("myIndex")
.filter('Timestamp === lit(dateToRead))
But I get this error
22/08/17 15:30:42 ERROR EventManager$: Unexpected error retrieving offsets. Bailing out...
Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid map received dynamic_date_formats=[yyyy-MM-dd HH:mm:ss||yyyy-MM-dd'T'HH:mm:ss.SSS||yyyy-MM-dd||yyyy-MM-dd'T'HH||yyyy-MM-dd'T'HH:mm]
at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseField(FieldParser.java:146)
at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMapping(FieldParser.java:88)
at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseIndexMappings(FieldParser.java:69)
at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMappings(FieldParser.java:40)
at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:321)
at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:307)
at org.elasticsearch.hadoop.rest.RestRepository.getMappings(RestRepository.java:293)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:103)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:229)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:229)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:233)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:233)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:233)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:431)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at com.MyCalass$$anonfun$myMethod2$1.apply(MyCalass.scala:130)
at com.MyCalass$$anonfun$myMethod2$1.apply(MyCalass.scala:126)
at scala.util.Try$.apply(Try.scala:192)
at com.MyCalass$.myMethod2(MyCalass.scala:126)
at com.MyCalass$.myMethod(MyCalass.scala:55)
at com.MyApp$.MyApp$$myMethod(MyApp.scala:107)
at com.MyApp$$anonfun$main$2.apply(MyApp.scala:86)
at com.MyApp$$anonfun$main$2.apply(MyApp.scala:76)
at scala.Option.fold(Option.scala:158)
at com.MyApp$.main(MyApp.scala:76)
at com.MyApp.main(MyApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
I've set the dynamic date mapping in opensearch. And also I am able to write to the index with the correct mapping, but when I try to read, it fails.
I found out the problem, basically The elasticsearch connector is not working properly and it tries to use ES 1.3.4 instead of Opensearch 1.3.4, to solve this problem add compatibility.override_main_response_version : true to your opensearch.yml file.

AWS GLUE: Cassandra connection using SSL is not working

I wanted to connect to Cassandra using Spark, when trying to connect Cassandra using the default port it is working, but when I try accessing it via SSL the job fails, below is the code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.trustStore.path","s3:/dev-code/certs/trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()
FYI: it has access to the S3 bucket, I am able to read the CSV file from the same location. Also, both the ports are enabled 9042 and 9142. Closed 9042 and kept only 9142 port still the error persists.
Below is the error:
ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Failed to open native connection to Cassandra at {server.abc:9142} :: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:173)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$1(CassandraConnector.scala:161)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68)
at org.apache.spark.sql.cassandra.DefaultSource.inferSchema(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:296)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at MyCsvToCassandrsJob$.main(csv-to-cassanra-job:63)
at MyCsvToCassandrsJob.main(csv-to-cassanra-job-job)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:75)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:123)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:29)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: java.lang.IllegalArgumentException: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:253)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:108)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslEngineFactory(DefaultDriverContext.java:414)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.lambda$new$4(DefaultDriverContext.java:279)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslEngineFactory(DefaultDriverContext.java:733)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslHandlerFactory(DefaultDriverContext.java:470)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslHandlerFactory(DefaultDriverContext.java:799)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.init(DefaultSession.java:348)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.access$1100(DefaultSession.java:300)
at com.datastax.oss.driver.internal.core.session.DefaultSession.lambda$init$0(DefaultSession.java:146)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at com.datastax.oss.driver.shaded.netty.channel.DefaultEventLoop.run(DefaultEventLoop.java:54)
at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:246)
... 18 more
Caused by: java.nio.file.NoSuchFileException: s3:/dev-code/certs/trust.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.buildContext(DefaultSslEngineFactory.java:119)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:72)
... 23 more
Big help if there is any workaround for this problem.
At the bottom of your error message, I see this:
NoSuchFileException: s3:/dev-code/certs/trust.jks
Alex is right, in that you need to provide a path to that file that the Spark connector can actually get to. From the looks of it, S3 won't work here.
Added the .jks s3 file into "Referenced files path" of Glue Job and then just try to access just provide the file name. As the file will be automatically be placed under /tmp folder. But it will still not solve the issue.
From the this website, I understood that we need to provide all the default values as well:
Below is my final code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.config("spark.cassandra.connection.ssl.trustStore.path","trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.connection.ssl.trustStore.type","JKS")
.config("spark.cassandra.connection.ssl.protocol","TLS")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()

Hive Table Creation Using MongoDB Hadoop Driver

I am trying to connect from a Hive Database to a collection in MongoDB using a driver (jars) provided on the wiki site. Here are the steps I did: -
I created a collection in MongoDB called "Diamond" under a database called "Moe" and it has got 20 documents:
I wanted to connect from Hive via the Hadoop MongoDB Driver and view these documents via Hive.
I have both MongoDB and Hive installed on the same server and configured. However I don't see any variable called the HIVE_CLASPATH I wonder where that is.
So I installed 3 divers on the server: -
mongo-hadoop-core-1.5.2.jar;
mongo-hadoop-hive-1.5.2.jar;
mongo-java-driver-3.0.0.jar;
Now, I connect to Hive, and then add these 2 jar's to my classpath by the following commands: - (they get added successfully)
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-hadoop-hive-1.5.2.jar;
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-hadoop-core-1.5.2.jar;
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-java-driver-3.0.0.jar;
Now I create a table in HIVE: -
CREATE TABLE Diamond
(
carat DOUBLE,
cut STRING,
color STRING,
clarity STRING,
depth DOUBLE,
table DOUBLE,
price DOUBLE,
xcord DOUBLE,
ycord DOUBLE,
zcord DOUBLE
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"carat":"carat","cut":"cut",
"color":"color", "clarity":"clarity", "depth":"depth", "table":"table",
"price":"price", "xcord":"x", "ycord":"y", "zcord":"z"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/Moe.Diamond');
However when I execute the above command in Hive I get the error below: -
java.lang.NoClassDefFoundError: com/mongodb/util/JSON
at com.mongodb.hadoop.hive.BSONSerDe.initialize(BSONSerDe.java:110)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:210)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:268)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:261)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:587)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:573)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3784)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:256)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:155)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1355)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1139)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 23 more
FAILED: Execution Error, return code -101 from
org.apache.hadoop.hive.ql.exec.DDLTask
I have tried the following: -
- placing the jars in every possible directory with no effect
- The class that is supposed to be missing, is pretty much present in the jar file.
- oh yes and the MongoStorageHandler class is very much in the jar.
I am done breaking my head with this !! If anyone can shed some light on what I could do to alleviate my anxiety, it would be great.
Thanks again.
Mario
I identified what the issue was. To connect from HIVE to MongoDB, the MongoDb Driver uses invokes a java class in a hive jar library
**
java.lang.ClassNotFoundException:
org.apache.hadoop.hive.ql.hooks.PreExecutePrinter
**
Now this class is supposed to be found in the jar file - hive-exec-0.11.0.1.3.2.0-111.jar. However it is available only in more recent versions of HIVE and not older ones.
It is not available in 0.11.0.1.3.2.0-111 but is visibly detectable in 0.13.0.2.1.7.0-784.
The solution here was to connect to a version of HIVE that is supported by the driver. MongoDB does state that its driver supports a certain version of Hadoop, but doesn't drill down to the individual Application (HIVE / SQOOP).

Hadoop TotalOrderPartitioner

I am trying to use total order partioner in hadoop with following code:
job.setNumReduceTasks(4);
Path partitionFile = new Path(args[1]);
InputSampler.Sampler sampler = new InputSampler.RandomSampler(0.1,3,1)
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),partitionFile);
InputSampler.writePartitionFile(job, sampler);
job.setPartitionerClass(TotalOrderPartitioner.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
This code while runing throwing exception as follows:
Exception running child : java.lang.IllegalArgumentException: Can't read partitions file
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:677)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:746)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.FileNotFoundException: File file:/tmp/hadoop-sarang/nm-local-dir/usercache/sarang/appcache/application_1417956066584_0001/container_1417956066584_0001_01_000005/_partition.lst does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1749)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:301)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:88)
... 10 more
I also found this error when I was using hadoop mapreduce and the mapreduce service had not been installed and started. After installing mapreduce and starting it, the exception disappeared.

MongoDB hadoop connector fails to query on mongo hive table

I am using MongoDB hadoop connector to query mongoDB using hive table in hadoop.
I am able to execute
select * from mongoDBTestHiveTable;
But when I try to execute following query
select id from mongoDBTestHiveTable;
it throws following exception.
Following class exist in hive lib folder.
Exception stacktrace:
Diagnostic Messages for this Task:
Error: java.io.IOException: Cannot create an instance of InputSplit class = com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit:Class com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit not found
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:147)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:370)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:402)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:144)
... 10 more
Container killed by the ApplicationMaster.
Please advice.
You also need to add mongo-hadoop-* and well as mongo driver jars to MR1/MR2 classpath on all workers