export data from mongo to hive - mongodb

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?

Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0

Related

How to speed up writing into Impala from Talend

I'm using Talend Open Studio for Big Data (7.3.1), and I write files from various sources to Cloudera Impala (Cloudera QuickStart 5.13) but that takes too much time and writes only ~3300 rows/s (take a look at the pictures).
Is there way to raise writing to ~10000-100000 rows/s or even greater?
Am i using wrong approach for the load?
Or do i need to configure Impala/Talend better?
Any advice is welcome!
UPDATE
I install JDBC Impala driver:
But OutputFile looks like it not configured for Impala:
Error:
Exception in component tDBOutput_1 (db_2_impala)
org.talend.components.api.exception.ComponentException: UNEXPECTED_EXCEPTION:{message=[Cloudera]ImpalaJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:AnalysisException: Impala does not support modifying a non-Kudu table: algebra_db.source_data_textfile_2
), Query: DELETE FROM algebra_db.source_data_textfile_2.} at org.talend.components.jdbc.CommonUtils.newComponentException(CommonUtils.java:583)

error in initSerDe : java.lang.ClassNotFoundException class org.apache.hive.hcatalog.data.JsonSerDe not found

I am trying to read data from the Hive table using spark sql (scala) and it is throwing me error as
ERROR hive.log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2255)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:392)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:274)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:256)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:607)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:358)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:355)
at scala.Option.map(Option.scala:146)
Hive table is stored as
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I added the /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar using :require /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar and able to see Added to classpath.
I also tried to add the JAR file in the SparkSession.config(). Both of them didn't worked. I checked some answers from stackoverflow which are not helped to resolve my issue.
CREATE EXTERNAL TABLE `test.record`(
`test_id` string COMMENT 'from deserializer',
`test_name` string COMMENT 'from deserializer',
`processed_datetime` timestamp COMMENT 'from deserializer'
)
PARTITIONED BY (
`filedate` date)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I am expecting to read the data from the hive table and able to store it in Dataframe.
var tempDF =sql("SELECT * FROM test.record WHERE filedate = '2019-06-03' LIMIT 5")
tempDF.show()
should work
One quick way to solve this is copy jar file to spark.
source file is from hive lib directory, hive-hcatalog-core-3.1.2.jar, copy it to jars under spark directory.
I also tried to modify hive.aux.jars.path config in hive-site.xml, but it doesn't work. If anyone knows there's configuration for spark to load extra jars, please comment.
Add the required jar as follows rather than copying the jar on all the nodes in the cluster:
in conf/spark-defaults add following configs
spark.driver.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar spark.executor.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar
Or in spark-sql, execute add jar statement before querying:
ADD JAR /fullpath/hive-hcatalog-core-3.1.2.jar

Import MongoDB data into Hive Error: Splitter implementation is incompatible

I'm trying to import mongodb data into hive.
The jar versions that i have used are
ADD JAR /root/HDL/mongo-java-driver-3.4.2.jar;
ADD JAR /root/HDL/mongo-hadoop-hive-2.0.2.jar;
ADD JAR /root/HDL/mongo-hadoop-core-2.0.2.jar;
And my cluster versions are
Ambari - Version 2.6.0.0, HDFS 2.7.3, Hive 1.2.1000, HBase 1.1.2, Tez 0.7.0
MongoDB Server version:- 3.6.5
Hive Script:-
CREATE TABLE sampletable
( ID STRING,
EmpID STRING,
BeginDate DATE,
EndDate DATE,
Time TIMESTAMP,
Type STRING,
Location STRING,
Terminal STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"ID":"_id","EmpID":"emp_id","BeginDate":"begin_date","EndDate":"end_date","Time":"time","Type":"time_event_type","Location":"location","Terminal":"terminal"}')
TBLPROPERTIES('mongo.uri'='mongodb://username:password#10.10.170.43:27017/testdb.testtable');
Output:-
hive> select * from sampletable;
OK
Failed with exception java.io.IOException:java.io.IOException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
Please suggest me how can i solve this.
Thanks,
Mohan V
set mongo.input.split_size=50;

Cannot run tests on h2 in-memory database, rather it runs on PostgreSQL

(I have multiple related questions, so I highlight them as bold)
I have a play app.
play: 2.6.19
scala: 2.12.6
h2: 1.4.197
postgresql: 42.2.5
play-slick/play-slick-evolutions: 3.0.1
slick-pg: 0.16.3
I am adding a test for DAO, and I believe it should run on an h2 in-memory database that is created when tests start, cleared when tests end.
However, my test always runs on PostgreSQL database I configure and use.
# application.conf
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.driver="org.postgresql.Driver"
slick.dbs.default.db.url="jdbc:postgresql://localhost:5432/postgres"
Here is my test test/dao/TodoDAOImplSpec.scala.
package dao
import play.api.inject.guice.GuiceApplicationBuilder
import play.api.test.{Injecting, PlaySpecification, WithApplication}
class TodoDAOImplSpec extends PlaySpecification {
val conf = Map(
"slick.dbs.test.profile" -> "slick.jdbc.H2Profile$",
"slick.dbs.test.db.driver" -> "org.h2.Driver",
"slick.dbs.test.db.url" -> "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
)
val fakeApp = new GuiceApplicationBuilder().configure(conf).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase()).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase("test")).build()
"TodoDAO" should {
"returns current state in local pgsql table" in new WithApplication(fakeApp) with Injecting {
val todoDao = inject[TodoDAOImpl]
val result = await(todoDao.index())
result.size should_== 0
}
}
}
For fakeApp, I try all three, but none of them work as expected - my test still runs on my local PostgreSQL table (in which there are 3 todo items), so the test fails.
What I have tried/found:
First, inMemoryDatabase() simply returns a Map("db.<name>.driver"->"org.h2.Driver", "db.<name>.url"->""jdbc:h2:mem:play-test-xxx"), which looks very similar to my own conf map. However, there are 2 main differeneces:
inMemoryDatabase uses db.<name>.xxx while my conf map uses slick.dbs.<name>.db.xxx. Which one should be correct?
Second, rename conf map's keys to "slick.dbs.default.profile", "slick.dbs.default.db.driver" and "slick.dbs.default.db.url" will throw error.
[error] p.a.d.e.DefaultEvolutionsApi - Unknown data type: "status_enum"; SQL statement:
ALTER TABLE todo ADD COLUMN status status_enum NOT NULL [50004-197] [ERROR:50004, SQLSTATE:HY004]
cannot create an instance for class dao.TodoDAOImplSpec
caused by #79bg46315: Database 'default' is in an inconsistent state!
The finding is interesting - is it related to my use of PostgreSQL ENUM type and slick-pg? (See slick-pg issue with h2). Does it mean this is the right configuration for running h2 in-memory tests? If so, the question becomes How to fake PostgreSQL ENUM in h2.
Third, I follow this thread, run sbt '; set javaOptions += "-Dconfig.file=conf/application-test.conf"; test' with a test configuration file conf/application-test.conf:
include "application.conf"
slick.dbs.default.profile="slick.jdbc.H2Profile$"
slick.dbs.default.db.driver="org.h2.Driver"
slick.dbs.default.db.url="jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
Not surprisingly, I get the same error as the 2nd trial.
It seems to me that the 2nd and 3rd trials point to the right direction (Will work on this). But why must we set name to default? Any other better approach?
In play the default database is default. You could however change that to any other database name to want, but then you need to add the database name as well. For example, I want to have a comment database that has the user table:
CREATE TABLE comment.User(
id int(250) NOT NULL AUTO_INCREMENT,
username varchar(255),
comment varchar(255),
PRIMARY KEY (id));
Then I need to have the configuration of it to connect to it (add it to the application.conf file):
db.comment.url="jdbc:mysql://localhost/comment"
db.comment.username=admin-username
db.comment.password="admin-password"
You could have the test database for your testing as mentioned above and use it within your test.
Database Tests Locally: Why not have the database, in local, as you have in production? The data is not there and running the test on local does not touch the production database; why you need an extra database?
Inconsistent State: This is when the MYSQL you wrote, changes the state of the current database within the database, that could be based on creation of a new table or when you want to delete it.
Also status_enum is not recognizable as a MySQL command obviously. Try the commands you want to use in MySQL console if you are not sure about it.

Getting DB2 SQL Error while firing a select query using Worklight 6.1.0?

I am trying to connect to DB2 in my local LAN using worklight 6.1.0 and firing a Select Query for lookup of data if exist. But i am getting below error:
{
"errors": [
"Runtime: DB2 SQL Error: SQLCODE=-204, SQLSTATE=42704, SQLERRMC=DATABASE_NAME.REGISTRATION, DRIVER=3.58.82.\nPerformed query:\nSELECT * FROM DATABASE_NAME.registration where DATABASE_NAME.registration.Mob_No = ?"
],
"info": [
],
"isSuccessful": false,
"warnings": [
]
}
My SQL adapter configuration looks like below:
<connectionPolicy xsi:type="sql:SQLConnectionPolicy">
<!-- Example for using a JNDI data source, replace with actual data source name -->
<!-- <dataSourceJNDIName>java:/data-source-jndi-name</dataSourceJNDIName> -->
<!-- Example for using MySQL connector, do not forget to put the MySQL connector library in the project's lib folder -->
<dataSourceDefinition>
<driverClass>com.ibm.db2.jcc.DB2Driver</driverClass>
<url>jdbc:db2://172.21.11.129:50000/MOBILEDB</url>
<user>db2admin</user>
<password>Newuser123</password>
</dataSourceDefinition>
</connectionPolicy>
And js file which has procedure looks like:
var selectStatement1 = "SELECT * FROM DATABASE_NAME.registration where DATABASE_NAME.registration.Mob_No = ?";
var procStmt1 = WL.Server.createSQLStatement(selectStatement1);
function registrationLookup(mobile){
WL.Logger.debug("Inside registrationLookup");
return WL.Server.invokeSQLStatement(
{
preparedStatement : procStmt1,
parameters : [mobile]
}
);
}
I did some Research about connecting DB2 with Worklight and came to know that i need to put below data in worklight.properties file.
wl.db.username=db2admin
wl.db.type=DB2
wl.db.password=Newuser123
wl.db.driver=com.ibm.db2.jcc.DB2Driver
But after adding it, i am not able to deploy Adapter and error says 'db2admin' does not exist. So i have skipped this step in the context of current question. But after going through error which i am getting without adding this worklight.properties data it seems to me that 'Object doesn't exist' as per http://www-01.ibm.com/support/docview.wss?uid=swg21613531 or user table does not exist. Any suggestion would be helpful.
NOTE:
My IP address is 172.21.11.125 from where i am invoking Adapter for DB2.
DB2 instance is running on 172.21.11.129 # 50000.
Already Added db2jcc_license_cu_9.5.jar & db2jcc_9.5.jar in server/lib. It had name appended with '_9.5' which i have removed from both jar and kept only db2jcc_license_cu.jar and db2jcc.jar.
The error message is saying that your SQL statement is invalid, and I therefore infer that your connection to the DB is fine.
To diagnose this first run the SQL using DB2 command line or other tools. My guess is you mean
LARSEN.registration
whereas you are saying DATABASE_NAME.registration
Got Answer to my own Question and its really interesting one. Thanks to https://stackoverflow.com/users/2260967/glen-misquith [Glen Misquith]
Problem was with SQL Query framing which is Different what i did with MYSQL.
Adapter's Java Script file:
var selectStatement4 = "UPDATE \"LARSEN\".\"registration\" SET \"LARSEN\".\"registration\".\"Pass\"=?, \"LARSEN\".\"registration\".\"Re_Pass\"=? WHERE \"User_Name\" = ?";
var procStmt5 = WL.Server.createSQLStatement(selectStatement4);
function updatePassword(username,pass,repass){
WL.Logger.debug("Inside updatePassword "+username+" "+pass+" "+repass);
return WL.Server.invokeSQLStatement(
{
preparedStatement : procStmt5,
parameters : [pass,repass,username]
}
);
}
This is quite a Strange thing Slashes need to be used in SQL Statement while Preparing it.
I would really like to understand this behavior of DB2. And also i cant directly write 'Select * from schema.table_name' and precisely write column name from which data needs to be fetched.