When I run this command on shell works fine:
sqoop import --incremental append --check-column id_civilstatus --last-value -1
--connect jdbc:postgresql://somehost/somedb --username someuser
--password-file file:///passfile.txt --table sometable --direct -m 3
--target-dir /jobs/somedir -- --schema someschema
But when I try to save it as a job:
sqoop job --create myjob -- import --incremental append --check-column id_civilstatus
--last-value -1 --connect jdbc:postgresql://somehost/somedb --username someuser
--password-file file:///passfile.txt --table sometable --direct -m 3
--target-dir /jobs/somedir -- --schema someschema
Then I execute:
sqoop job --exec myjob
I get this error message:
PSQLException: ERROR: relation "sometable" does not exist
This is error due to 'sometable' does not exists in default schema.
Why sqoop job soes not take schema parameter? I am missing something?
Thanks
You can specify /change default schema passing "?currentSchema=myschema"in jdbc connection More detail .
jdbc:postgresql://localhost:5432/mydatabase?currentSchema=myschema
You don’t need to mention schema separately, you can either keep it in jdbc URL, not sure if postgres jdbc URL have that option or not. You have to add it in the table option itself. Something like below
—table schemaName.tableName
Use the following as your JDBC URL
jdbc:postgresql://somehost/somedb/someschema
and remove --schema someschema from the Sqoop Statement.
I found a way to make this work here.
sqoop job --exec myjob -- -- --schema someschema
Related
I am trying to transfer 35 GB table from aws rds postgres to hive but when I try to full table it take much time and after long time execution get stop. so I decide to load incremental way.
schema:All are in varchar except mentioned below.
twoway_id
twoway_seq =>int
guid
twoway_section_cd
transmit_host_nm
receipt_host_nm
service_id
msg_id
msg_no
vin_id
twoway_dt
status_result_cd
company_cd
limit_count_yn
response_status_cd
create_user_id
create_app_id
create_tmsmp =>timestamp
update_user_id
update_app_id
update_tmsmp =>timestamp
kshsjsjsj 320393682 IN K 02 TMU ISS CMM GPI 14 0 20201800230936 FAIL 02 Y 500 ISS ISS 2020-12-02 17:36:36.447 ISS ISS 2020-12-02 17:36:36.462
326403236
sqoop query: This query is working perfectly for full table load.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where 1=1 AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
tried but not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp like '2020-12-01%' AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
Error:
21/12/29 16:02:43 ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException:
ERROR: operator does not exist: integer % boolean
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
This is also not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where "create_tmsmp < 2020-12-04 04:51:26.150"\
--append
Also help me on incremental load query. I am also facing syntax error issue.
Incremental import arguments:
Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
I think the way you are passing --where need to be changed like below. You need to use single quotes around strings.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where " create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS' )"\
or
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query "SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS') AND \\\$CONDITIONS"\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
Pls change position of \$CONDITIONS if above doesn't work.
Now you can implement Incremental sqoop like below.
NEW ROWS : you can use --incremental append to append new records. identify a column which can be used to determine brand new record - like primary key. And also calculate max() of that column in hive(i assumed it to be 1000). Load those data using below sqoop.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column pk_col \
--incremental append \
--last-value 1000
So this above Sqoop will append any row where pk_col > 10000 from postgres.
EXISTING ROWS : Similarly you may want to bring modified rows from source. Then use below SQL. You need to first calculate max() of create_tmsmp and then use it in below statement(i assumed max to be 2020-12-04 04:51:26).
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column create_tmsmp \
--incremental lastmodified \
--last-value "'2020-12-04 04:51:26'"
I write the following sqoop command :
sqoop import --connect jdbc:mysql://localhost/export --username root --password cloudera --table cust --create-hive-table --fields-terminated-by ' ' --hive-table default.cust -m 1
Then, I could not found the table in default database but the file appeared in /user/cloudera/cust
Use —hive-import and -hive-overwrite if it is a overwrite table. You can also mention the —target-dir
I am trying to import a table from Postgresql to a Parquet file on HDFS.
Here is what I do:
sqoop import \
--connect "jdbc:postgresql://pg.foo.net:5432/bar" \
--username user_me --password $PASSWORD \
--table foo.bar \
--target-dir /user/me/bar \
--as-parquetfile
and I get
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "foo.bar" AS t LIMIT 1
ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException: ERROR: relation "foo.bar" does not exist
SELECT t.* FROM "foo.bar" AS t LIMIT 1 does not work indeed, but SELECT t.* FROM foo.bar AS t LIMIT 1 does. So the problem is that table name is quoted. I tried supplying --table argument different ways, but with no effect.
How do I work it around?
EDIT
As the docs you linked state, there is a --schema argument. For some reason it is not mentioned in sqoop help import.
Another weird thing is that
--table bar --schema foo
still does not work, but
--table bar -- --schema foo
does.
Anyway, it works now. Thanks for linking the relevant docs section!
The table name is bar, foo is the name of the schema.
According to the docs you should do it like:
sqoop import \
(...)
--table bar \
--schema foo
(...)
According to the documentation you need to specify the schema separately:
sqoop import \
--connect "jdbc:postgresql://pg.foo.net:5432/bar" \
--username user_me --password $PASSWORD \
--table bar \
--schema foo \
--target-dir /user/me/bar \
--as-parquetfile
Is there any way to mention different schema when exporting data to postgresql using Sqoop?
Based on the URL http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html, I need to use -- --schema which is pretty weird, well it doesn't work.
I tried to use --schema as well, but still same result.
-- --schema works with list-tables command but not with export command.
Any helps will be highly appreciated.
It finally worked. In order to use "-- --schema", we need to provide that option at the very end, not in the middle. So this one will work:
--connect jdbc:postgresql://xxx/abcd --username xxx --password xxx --table xxx --input-fields-terminated-by '\001' --input-lines-terminated-by '\n' --num-mappers 8 --input-null-string '\\N' --input-null-non-string '\\N' --export-dir /user/hadoop/xxx -- --schema stage
Whereas this one will not work:
--connect jdbc:postgresql://xxx/abcd --username xxx --password xxx -- --schema stage --table xxx --input-fields-terminated-by '\001' --input-lines-terminated-by '\n' --num-mappers 8 --input-null-string '\\N' --input-null-non-string '\\N' --export-dir /user/hadoop/xxx
By default you might be thinking to put schema name before mentioning table name, but that will not work. It would have been great if this information was included in the Sqoop document.
Yes using the -- --schema at the end of the sqoop export statement worked fine.
By default the the schema "dbo" is picked up in sqoop export.
The --schema parameter must be separated from the rest of the parameters with an extra set of dashes (i.e. -- ) and the --schema parameter must come last.
Refer: Sqoop Export Cookbook
I’m using Sqoop v1.4.2 to do incremental imports with jobs. The jobs are:
--create job_1 -- import --connect <CONNECT_STRING> --username <UNAME> --password <PASSWORD> -m <MAPPER#> --split-by <COLUMN> --target-dir <TARGET_DIR> --table <TABLE> --check-column <COLUMN> --incremental append --last-value 1
NOTES:
Incremental type is append
Job creation is successful
Job execution is successful for repeated times
Can see new rows being imported in HDFS
--create job_2 -- import --connect <CONNECT_STRING> --username <UNAME> --password <PASSWORD> -m <MAPPER#> --split-by <COLUMN> --target-dir <TARGET_DIR> --table <TABLE> --check-column <COLUMN> --incremental lastmodified --last-value 1981-01-01
NOTES:
Incremental type is lastmodified
Job creation is successful, table name is different from as used in job_1
Job execution is successful ONLY FOR FIRST TIME
Can see rows being imported for first execution in HDFS
Subsequent job execution fails with following error:
ERROR security.UserGroupInformation: PriviledgedActionException as:<MY_UNIX_USER>(auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <TARGET_DIR_AS_SPECIFIED_IN_job_2> already exists
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <TARGET_DIR_AS_SPECIFIED_IN_job_2> already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:202)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:465)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:108)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:403)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:228)
at org.apache.sqoop.tool.JobTool.run(JobTool.java:283)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
If you wanted to execute job_2 again and again then you need to use --incremental lastmodified --append
sqoop --create job_2 -- import --connect <CONNECT_STRING> --username <UNAME>
--password <PASSWORD> --table <TABLE> --incremental lastmodified --append
--check-column<COLUMN> --last-value "2017-11-05 02:43:43" --target-dir
<TARGET_DIR> -m 1