Import MongoDB data into Hive Error: Splitter implementation is incompatible - mongodb

I'm trying to import mongodb data into hive.
The jar versions that i have used are
ADD JAR /root/HDL/mongo-java-driver-3.4.2.jar;
ADD JAR /root/HDL/mongo-hadoop-hive-2.0.2.jar;
ADD JAR /root/HDL/mongo-hadoop-core-2.0.2.jar;
And my cluster versions are
Ambari - Version 2.6.0.0, HDFS 2.7.3, Hive 1.2.1000, HBase 1.1.2, Tez 0.7.0
MongoDB Server version:- 3.6.5
Hive Script:-
CREATE TABLE sampletable
( ID STRING,
EmpID STRING,
BeginDate DATE,
EndDate DATE,
Time TIMESTAMP,
Type STRING,
Location STRING,
Terminal STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"ID":"_id","EmpID":"emp_id","BeginDate":"begin_date","EndDate":"end_date","Time":"time","Type":"time_event_type","Location":"location","Terminal":"terminal"}')
TBLPROPERTIES('mongo.uri'='mongodb://username:password#10.10.170.43:27017/testdb.testtable');
Output:-
hive> select * from sampletable;
OK
Failed with exception java.io.IOException:java.io.IOException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
Please suggest me how can i solve this.
Thanks,
Mohan V

set mongo.input.split_size=50;

Related

Delta Lake 1.1.0 location in spark local mode not work properly

I've updated some ETLs to spark 3.2.1 and delta lake 1.1.0. After doing this my local tests started to fail. After some debugging, I found that when I create an empty table with a specified location it is registered in the metastore with some prefix.
Let's say if try to create a table on the bronze DB with spark-warehouse/users as my specified location:
spark.sql("""CREATE DATABASE IF NOT EXISTS bronze""")
spark.sql("""CREATE TABLE bronze.users (
| name string,
| active boolean
|)
|USING delta
|LOCATION 'spark-warehouse/users'""".stripMargin)
I end up with:
spark-warehouse/bronze.db/spark-warehouse/users registered on the metastore but with the actual files in spark-warehouse/users! This makes any query to the table fail.
I generated a sample repository: https://github.com/adrianabreu/delta-1.1.0-table-location-error-example/blob/master/src/test/scala/example/HelloSpec.scala

Spark MongoDB Connector unable to df.join - Unspecialised MongoConfig

Using the latest MongoDB connector for Spark (v10) and trying to join two dataframes yields the following unhelpful error.
Py4JJavaError: An error occurred while calling o64.showString.
: java.lang.UnsupportedOperationException: Unspecialised MongoConfig. Use `mongoConfig.toReadConfig()` or `mongoConfig.toWriteConfig()` to specialize
at com.mongodb.spark.sql.connector.config.MongoConfig.getDatabaseName(MongoConfig.java:201)
at com.mongodb.spark.sql.connector.config.MongoConfig.getNamespace(MongoConfig.java:196)
at com.mongodb.spark.sql.connector.MongoTable.name(MongoTable.java:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.name(DataSourceV2Relation.scala:66)
at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.$anonfun$applyOrElse$2(V2ScanRelationPushDown.scala:65)
Pyspark Code is simply pulling in two tables and running a join:
dfa = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.contacts").load()
dfb = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.accounts").load()
dfa.join(dfb, 'PKey').count()
SQL gives the same error:
dfa.createOrReplaceTempView("usr")
dfb.createOrReplaceTempView("ast")
spark.sql("SELECT count(*) FROM ast JOIN usr on usr._id = ast._id").show()
Document structures are flat.
Have you try using the latest version (10.0.2) of mongo-spark-connector? Can find it at here
I had a similar problem, solved it by replace 10.0.1 by 10.0.2

error in initSerDe : java.lang.ClassNotFoundException class org.apache.hive.hcatalog.data.JsonSerDe not found

I am trying to read data from the Hive table using spark sql (scala) and it is throwing me error as
ERROR hive.log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2255)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:392)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:274)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:256)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:607)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:358)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:355)
at scala.Option.map(Option.scala:146)
Hive table is stored as
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I added the /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar using :require /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar and able to see Added to classpath.
I also tried to add the JAR file in the SparkSession.config(). Both of them didn't worked. I checked some answers from stackoverflow which are not helped to resolve my issue.
CREATE EXTERNAL TABLE `test.record`(
`test_id` string COMMENT 'from deserializer',
`test_name` string COMMENT 'from deserializer',
`processed_datetime` timestamp COMMENT 'from deserializer'
)
PARTITIONED BY (
`filedate` date)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I am expecting to read the data from the hive table and able to store it in Dataframe.
var tempDF =sql("SELECT * FROM test.record WHERE filedate = '2019-06-03' LIMIT 5")
tempDF.show()
should work
One quick way to solve this is copy jar file to spark.
source file is from hive lib directory, hive-hcatalog-core-3.1.2.jar, copy it to jars under spark directory.
I also tried to modify hive.aux.jars.path config in hive-site.xml, but it doesn't work. If anyone knows there's configuration for spark to load extra jars, please comment.
Add the required jar as follows rather than copying the jar on all the nodes in the cluster:
in conf/spark-defaults add following configs
spark.driver.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar spark.executor.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar
Or in spark-sql, execute add jar statement before querying:
ADD JAR /fullpath/hive-hcatalog-core-3.1.2.jar

export data from mongo to hive

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?
Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0

HiveContext setting in scala+spark project to access existing HDFS

I am trying to access my existing hadoop setup in my spark+scala project
Spark Version 1.4.1
Hadoop 2.6
Hive 1.2.1
from Hive Console I able to create table and access it without any issue, I can also see the same table from Hadoop URL as well.
the problem is when I try to create a table from project, system shows error
ERROR Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
following is the code I write:
import
import org.apache.spark._
import org.apache.spark.sql.hive._
Code
val sparkContext = new SparkContext("local[2]", "HiveTable")
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Edit:
instead of create table if I had to execute insert statement like:
hiveContext.sql("INSERT INTO TABLE default.src SELECT 'username','password' FROM foo;")
any help to resolve his issue would be highly appreciable.