Cloudera-Sqoop import with/without --hive-import - import

I'm trying to do an activity whereas i'll import a table from MSSQL then export to MSSQL again in another database for the sake of testing sqoop1. So far, my imports are successful. My concern is regarding the export, if i import a table without --hive-import option, i'll be able to export it successfully. But if i include --hive-import option, sqoop wont be able to export it and prompts an error:
17/04/02 23:08:20 ERROR sqoop.Sqoop: Got exception running Sqoop:
org.kitesdk.data.DatasetIOException: Unable to load descriptor
file:hdfs://quickstart.cloudera:8020/user/hive/warehouse/customer/.metadata/descriptor.properties
for dataset:customer org.kitesdk.data.DatasetIOException: Unable to
load descriptor
file:hdfs://quickstart.cloudera:8020/user/hive/warehouse/customer/.metadata/descriptor.properties
for dataset:customer
As per checking, there's a difference in the metadata with --hive-imports. Imports with --hive-import parameter only does not have the required metadata:
Supplier/.metadata/descriptor.properties
My question is, is it possible to import a table in sqoop with --as-parquetfile and --hive-import option then be able to export it also?
here's my sample import and export code for referrence:
sqoop export \
--connect "jdbc:sqlserver://192.168.1.23;database=SqoopDB;schema=dbo;" \
--username sa \
--password Password1 \
--export-dir /user/hive/warehouse/customer \
--table customer
sqoop import \
--connect "jdbc:sqlserver://192.168.1.23;database=SourceDB;schema=dbo" \
--username sa \
--password Password1 \
--table Customer \
--as-parquetfile \
--hive-import \
--hive-overwrite \
-m 1

Related

sqoop query is giving error for the Import command

I am trying to transfer 35 GB table from aws rds postgres to hive but when I try to full table it take much time and after long time execution get stop. so I decide to load incremental way.
schema:All are in varchar except mentioned below.
twoway_id
twoway_seq =>int
guid
twoway_section_cd
transmit_host_nm
receipt_host_nm
service_id
msg_id
msg_no
vin_id
twoway_dt
status_result_cd
company_cd
limit_count_yn
response_status_cd
create_user_id
create_app_id
create_tmsmp =>timestamp
update_user_id
update_app_id
update_tmsmp =>timestamp
kshsjsjsj 320393682 IN K 02 TMU ISS CMM GPI 14 0 20201800230936 FAIL 02 Y 500 ISS ISS 2020-12-02 17:36:36.447 ISS ISS 2020-12-02 17:36:36.462
326403236
sqoop query: This query is working perfectly for full table load.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where 1=1 AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
tried but not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp like '2020-12-01%' AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
Error:
21/12/29 16:02:43 ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException:
ERROR: operator does not exist: integer % boolean
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
This is also not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where "create_tmsmp < 2020-12-04 04:51:26.150"\
--append
Also help me on incremental load query. I am also facing syntax error issue.
Incremental import arguments:
Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
I think the way you are passing --where need to be changed like below. You need to use single quotes around strings.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where " create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS' )"\
or
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query "SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS') AND \\\$CONDITIONS"\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
Pls change position of \$CONDITIONS if above doesn't work.
Now you can implement Incremental sqoop like below.
NEW ROWS : you can use --incremental append to append new records. identify a column which can be used to determine brand new record - like primary key. And also calculate max() of that column in hive(i assumed it to be 1000). Load those data using below sqoop.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column pk_col \
--incremental append \
--last-value 1000
So this above Sqoop will append any row where pk_col > 10000 from postgres.
EXISTING ROWS : Similarly you may want to bring modified rows from source. Then use below SQL. You need to first calculate max() of create_tmsmp and then use it in below statement(i assumed max to be 2020-12-04 04:51:26).
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column create_tmsmp \
--incremental lastmodified \
--last-value "'2020-12-04 04:51:26'"

No manager for connect string

As soon as my EMR-Cluster was ready to be run.
I started facing some issues when listing databases and importing sqoop
Apparently, sqoop has been installed normally and it is working normally when I type "sqoop help" in Linux terminal.
using sqoop help
as you can see, the command could be recognized normally.
However, if I try out the sqoop import command, this one cannot be and it faces an error:
sqoop import \
--connect jdbc:postgres://sportsdb.cxri########.us-east-2.rds.amazonaws.com/SportsDB \
--username postgres \
--password mypassword \
--table addresses --target-dir s3://sqoop-table-from-rds-to-s3/sqoop-table/ -m 1 --fields-terminated-by '\t' --lines-terminated-by ','
sqoop import
The same goes to the second one, which is "sqoop list-databases" as shown below:
sqoop list-databases \
--connect jdbc:postgres://sportsdb.cxri########.us-east-2.rds.amazonaws.com \
--username postgres \
--password mypassword
sqoop list-databases
they don't really works and anything happens ;/
I also downloaded jar and put into /usr/lib/sqoop/lib/ where is the jar files on sqoop
To do it I run these two follow commands:
1) wget -O postgresql-jdbc.jar https://jdbc.postgresql.org/download/postgresql-42.3.1.jar
2) sudo mv postgresql-jdbc.jar /usr/lib/sqoop/lib/
Jar file added to sqoop/lib
Someone else has a suggestion about what can be done in order to fix this issue?
The issue could be solved after following a tip received about a typo.
Then, I just changed the word postgres for postgresql as follows:
sqoop list-databases \
--connect jdbc:postgresql://sportsdb.cxri########.us-east-2.rds.amazonaws.com \
--username postgres \
--password mypassword
That is it. The issue was fixed just adjusting something pretty simple

Hive import is not compatible with importing into AVRO format

I have the following codes:
sqoop import --connect jdbc:mysql://localhost/export \
--username root \
--password cloudera \
--table cust \
--hive-import \
--create-hive-table \
--fields-terminated-by ' ' \
--hive-table default.cust \
--target-dir /user/hive/warehouse/cust \
--compression-codec org.apache.org.io.compress.GzipCodec \
--as-avrodatafile \
-m 1
got the following error, please help.
Hive import is not compatible with importing into AVRO format.
Currently, Sqoop does not support to import AVRO format directly into a HIVE table, so as a workaround you can import into HDFS and create a EXTERNAL TABLE in HIVE
Step 1 : IMPORT into hdfs
sqoop import --connect jdbc:mysql://localhost/export \
--username root --password cloudera
--table cust \
--target-dir /user/hive/warehouse/cust \
--compression-codec org.apache.org.io.compress.GzipCodec \
--as-avrodatafile -m 1
This import will create a schema file in the current directory (Linux) with an extension .avsc .Copy this file to some location in HDFS (PATH_TO_THE_COPIED_SCHEMA).
Step 2: Create an external table in HIVE like
CREATE EXTERNAL TABLE cust
STORED AS AVRO
LOCATION 'hdfs:///user/hive/warehouse/cust'
TBLPROPERTIES ('avro.schema.url'='hdfs:///PATH_TO_THE_COPIED_SCHEMA/cust.avsc');

Sqoop export is successful but destination postgres table is empty

I am trying to export the table from hdfs to postgres
Below is the query which I used for export:
sqoop export --connect jdbc:postgresql:hostname:5432/postgresDB --username user --password password --input-fields-terminated-by '\001' --fields-terminated-by ',' --table customer --export-dir /hdfs/location/customer --input-null-string '\\N' --input-null-non-string '\\N' --direct --update-key customer_id
The sqoop query completes with success message. Please see the screenshot below:
But when I query the table, I am not finding any data.
Any help is appreciated. Thanks in advance.
sqoop export --connect jdbc:postgresql:hostname:5432/postgresDB \
--username user \
--password password \
--input-fields-terminated-by '\001' \
--fields-terminated-by ',' \
--table customer \
--export-dir /hdfs/location/customer \
--input-null-string '\\N' --input-null-non-string '\\N' \
--direct \
--update-mode allowinsert
This worked ..
For me, this worked after I added the schema name for my table name.
-- -- schema my_schema

Sqoop: how switch off Prepared Statements?

I use Sqoop 1.4.5-cdh5.4.2 and Postgresql.
If Sqoop connects directly to the database - all right.
But need use Sqoop over pgbouncer, and I have problem with this.
In pgbouncer you can not do prepared statements transaction mode.
... connect command:
sqoop import \
--connect "$db_name" \
--username "$db_user" \
--password "$db_pass" \
--direct \
--hive-import \
--hive-table "$hive_schema.$t" \
--hive-overwrite \
--num-mappers 10 \
--fetch-size 100000 \
--split-by "object_id" \
--target-dir "/user/$hive_schema/$t" \
--table "$t"
... and error:
org.postgresql.util.PSQLException: ERROR: prepared statement "S_3" already exists
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2270)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1998)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:791)
at org.postgresql.jdbc2.AbstractJdbc2Connection.commit(AbstractJdbc2Connection.java:815)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:315)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:227)
at org.apache.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:126)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:188)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:514)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Add prepareThreshold=0 to the connection string
SQOOP don't work with pgbouncer and transaction pool :(