How to import data from PostgreSQL to Hadoop? - postgresql

I'm just a beginner in Hadoop and one of my colleges asked me for help in migrating some of PostgreSQL tables to Hadoop. Since I don't have much experience with PostgreSQL (I know databases though), I am not sure what would be the best way for this migration to happen. One of my ideas was to export the tables as gson data and then to process them from the Hadoop, as in this example: http://www.codeproject.com/Articles/757934/Apache-Hadoop-for-Windows-Platform. Are there better ways to import data (tables & databases) from PostgreSQL to Hadoop?

Sqoop (http://sqoop.apache.org/) is a tool precisely made for this. Go through the documentation, sqoop provides the best and the easiest way to transfer your data.

Use the below command. It is working for me.
sqoop import --driver=org.postgresql.Driver --connect jdbc:postgresql://localhost/your_db --username you_user --password your_password --table employees --target-dir /sqoop_data -m 1

Related

How to migrate RethinkDb into MongoDb?

My application is using RethinkDb. Everything is running fine, but a new required needs to migrate the db into MongoDb.
Is this possible? How do I migrate the tables/collections, data, indexes, etc?
How about blob types, auto increments. ids?
Thanks!
Is this possible? How do I migrate the tables/collections, data, indexes, etc?
One way to migrate data from RethinkDB to MongoDB is to export data from RethinkDB using rethinkdb dump command, and then use mongoimport to import into MongoDB. For example:
rethinkdb dump -e dbname.tableName
This would generate an archive file:
rethinkdb_dump_<datetime>.tar.gz
After uncompressing the archive file, you can then use mongoimport as below:
mongoimport --jsonArray --db dbName --collection tableName ./rethinkdb_dump_<datetime>/dbName/tableName.json
Unfortunately for the indexes, the format between RethinkDB and MongoDB is quite different. The indexes are stored within the same archived file:
./rethinkdb_dump_<datetime>/dbName/tableName.info
Although you can still write a Python script to read the info file, and use MongoDB Python driver (PyMongo) to create the indexes in MongoDB. See also create_indexes() method for more information.
One of the reasons in suggesting to use Python, is because RethinkDB also has a Client Python driver. So technically, you can also skip the export stage and write a script to connect your RethinkDB to MongoDB.

Is there any similar function with MySQL trigger in IBM Netezza?

As you can see on the title, I want to know the similar function with MySQL's trigger function. What actually I want to do is importing data from IBM Netezza Databases using sqoop incremental mode. Below is the sqoop scripts what I'm going to use.
sqoop job --create dhjob01 -- import --connect jdbc:netezza://10.100.3.236:5480/TEST \
--username admin --password password \
--table testm \
--incremental lastmodified \
--check-column 'modifiedtime' --last-value '1995-07-18' \
--target-dir /user/dhlee/nz_sqoop_test \
-m 1
As the official Sqoop documentation says, I can gather data from RDBs with incremental mode by making a sqoop import job and execute it recursively.
Anyway the point is, I need a function like MySQL trigger so that I can update the modified date whenever tables in Netezza are updated. And if you have any great idea that I can gather the data incrementally, please tell me. Thank you.
Unfortunately there isn't anything similar to triggers available. I would recommend modifying the relevant UPDATE commands to include setting a column to CURRENT_TIMESTAMP
In Netezza you have something even better:
- Deleted records is still possible to see http://dwgeek.com/netezza-recover-deleted-rows.html/
- the INSERT- and DELETE-TXID are a rising number (and visible on all records as described above)
- updates are really a delete plus an insert
Can you follow me?
enter image description here
This is the screen shot that I've got after I inserted and deleted some rows.

IBM WCS and DB2 : Want to export all catentries data from one DB and import into another DB

Basically i have two environments, Production and QA. On QA DB, the data is not same as it is on production so my QA team is unable to test it properly. So want to import all catentries/products related data in QA DB from production. I searched a lot but not found any solution regarding this.
May be i need to find all product related tables and export them one by one and then import into dev db but not sure.
Can anyone please guide me regarding this. How i can do this activity with best practices?
I am using DB2
The WebSphere Commerce data model is documented, which will help you identify all related tables. You can then use the DB2 utility db2move to export (and later load) those tables in one shot. For example,
db2move yourdb export -sn yourschema -tn catentry,catentrel,catentdesc,catentattr
Be sure to list all tables you need, separated by commas with no spaces. You can specify patterns to match table names:
db2move yourdb export -sn yourschema -tn "catent*,listprice"
db2move will create a file db2move.lst that lists all extracted tables, so you can load all data with:
db2move yourQAdb load -lo replace
running from the same directory.

can we get the postgres db dump using SQLAlchemy?

Is it possible to have the postgres database dump(pg_dump) using SQLAlchemy? i can get the dump using pg_dump but I am doing all other db operations using SQLALchemy and thus want to know if this dump operation is also opssible using SQLAlchemy. Any suggestion, link would be of great help.
Thanks,
Tara Singh
pg_dump is a system command.so I do not think you could have postgres database dump using SQLAlchemy.
SqlAlchemy do not manage sort of pg_dump. You probably can mimic it with a buch of queries but it will be painfull.
The more easy way is to use pg_dump itself inside a python script with os.system or subprocess.call
If it's for regular saves also have a look to safekeep project who speak for you to your databases

Delete and restore postgress database in sikuli

I have used below code to delete postgress database. My issue is I am unable to find query which would restore database. Please provide your assistance. thank you!!
from __future__ import with_statement
import sys
from sikuli import *
from com.ziclix.python.sql import zxJDBC
load("C:\\Test\\SikuliX\\postgresql-9.4.1207.jre6")
connection2 = zxJDBC.connect('jdbc:postgresql://127.0.0.1/?stringtype=unspecified', 'postgres', 'pswd#123', 'org.postgresql.Driver')
connection2.autocommit = True
curs = connection2.cursor()
curs.execute('DROP DATABASE IF EXISTS sampledb')
curs.execute( < I need query to restore database>)
There is no "command" to restore a postgresql database. Usually a backup consists of a very large bunch of commands that piece by piece create a database that is mostly identical to what you have backupped before. Your only possibility is to use the shell command pg_restore.
Oh, and your question has nothing to do with sikuli-ide. The problem is programming language agnostic.
Found workaround for this, instead of restore we can copy the database
curs.execute('CREATE DATABASE [newdb] WITH TEMPLATE [originaldb] OWNER dbuser')