Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. I have set the number of mappers as 24. There are 12 blocks in the source location.
Any idea why the sqoop is creating duplicates?
Sqoop Version: 1.4.5.2.2.9.2-1
Hadoop Version: Hadoop 2.6.0.2.2.9.2-1
Sqoop Command Used-
sqoop export -Dmapred.job.queue.name=queuename \
--connect jdbc:postgresql://ServerName/database_name \
--username USER --password PWD \
--table Tablename \
--input-fields-terminated-by "\001" --input-null-string "\\\\N" --input-null-non-string "\\\\N" \
--num-mappers 24 -m 24 \
--export-dir $3/penet_baseline.txt -- --schema public;
bagavathi you mentioned that duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. One possible scenario is that your Target table could already have records which maybe because of a previous incomplete sqoop job. Please check whether target table has key which is also in source.
One workaround for this scenario is, use parameter "--update-mode allowinsert". In your query, add these parameters, --update-key --update-mode allowinsert. This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an insert.
No sqoop does not export records twice and it has nothing to do with the number of mappers and the number of blocks.
Look at pg_bulkload connector of sqoop for faster data transfer between hdfs and postgres.
pg_bulkload connector is a direct connector for exporting data into PostgreSQL. This connector uses pg_bulkload. Users benefit from functionality of pg_bulkload such as fast exports bypassing shared bufferes and WAL, flexible error records handling, and ETL feature with filter functions.
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you have used sqoop incremental mode then there many be some duplicate records on HDFS ,before running export to postgres , collect all unique records based on max(date or timestamp column) in one table and then do export .
I think it has to work
Related
I have a java program exporting selected data from several tables of a Postgres database. The logic is
set the isolation level to TRANSACTION_REPEATABLE_READ
select data from table 1 and export them to an external file
select data from table 2 and export them
commit
While the program is running, if I run pg_dump to backup the database, pg_dump will be blocked until my program is finished.
I am using REPEATABLE_READ to make sure my export data is consistent, i.e. not affected by other concurrent transaction. Any suggestion how to get a consistent set of data without blocking pg_dump?
Thanks
As you can see on the title, I want to know the similar function with MySQL's trigger function. What actually I want to do is importing data from IBM Netezza Databases using sqoop incremental mode. Below is the sqoop scripts what I'm going to use.
sqoop job --create dhjob01 -- import --connect jdbc:netezza://10.100.3.236:5480/TEST \
--username admin --password password \
--table testm \
--incremental lastmodified \
--check-column 'modifiedtime' --last-value '1995-07-18' \
--target-dir /user/dhlee/nz_sqoop_test \
-m 1
As the official Sqoop documentation says, I can gather data from RDBs with incremental mode by making a sqoop import job and execute it recursively.
Anyway the point is, I need a function like MySQL trigger so that I can update the modified date whenever tables in Netezza are updated. And if you have any great idea that I can gather the data incrementally, please tell me. Thank you.
Unfortunately there isn't anything similar to triggers available. I would recommend modifying the relevant UPDATE commands to include setting a column to CURRENT_TIMESTAMP
In Netezza you have something even better:
- Deleted records is still possible to see http://dwgeek.com/netezza-recover-deleted-rows.html/
- the INSERT- and DELETE-TXID are a rising number (and visible on all records as described above)
- updates are really a delete plus an insert
Can you follow me?
enter image description here
This is the screen shot that I've got after I inserted and deleted some rows.
Is there any way to exclude some partitioned tables while dumping data using pg_dump.exe in windows command line.
I have tried following flags of pg_dump to exclude but its not working
-T schema_name.table.*
I got the solution. As all my partitions starting with "log_history" .so I have added flag -T schema_name.log_history* to exclude them while dumping data.
Is there a way to define database file for data range in one table in PostgreSQL ? I need to move data between PostgreSQL database and thinking if I use database file for movement ( rather than using SQL statments ) it will be faster. Will file movement be faster than SQL insert queries and is this a good solution ?
Copying a full database with Postgres is not impossible, but copying per table will require smarter replication.
For copying the full database you can do something like this:
psql -c "select pg_start_backup('backup label', true);"
rsync -av --exclude pg_xlog --exclude postgresql.conf --exclude postgresql.pid data remote_server:data
psql -c 'select pg_stop_backup();'
Copying a part of a database is somewhat more complicated. For that there are several options available, previously Slony was the recommended option but that project is inactive these days. Right now your options are Foreign Data Wrappers, PL/Proxy or one of the options mentioned in the Postgres wiki: https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling#Clustering
I have a database containing a very large table including binary data which I want to update on a remote machine, once a day. Rather than dumping the entire table, transferring and recreating it on the remote machine, I want to dump only the updates, then use that dump to update the remote machine.
I already understand that I can produce the dump file as such.
mysqldump -u user -p=pass --quick --result-file=dump_file \
--where "Updated >= one_day_ago" \
database_name table_name
1) Does the resulting "restore" on the remote machine
mysql -u user -p=pass database_name < dump_file
only update the necessary rows? Or are there other adverse effects?
2) Is this the best way to do this? (I am unable to pipe to the server directly and using --host option)
If you only dump rows where the WHERE clause is true, then you will only get a .sql file that contains the values you want to update. So you will never get duplicate values if you use the current export options. However, inserting these into a database will not work. You will have to use the commandline parameter --replace, otherwise, if you dump your database and a row with id 6 in table table1 and try to import this into your other database, you'll get an error on duplicates if a row already has id = 6. Using the --replace parameter, it will overwrite older values, which can only happen if there is a new one (according to your WHERE clause).
So to quickly answer:
Yes, this will restore on the remote machine, but only if you saved using --replace (this will restore the latest version of the file you have)
I am not entirely sure if you can pipe backups. According to this website, you can, but I have never tried it before.