Cloudera-Sqoop import with/without --hive-import

Cloudera-Sqoop import with/without --hive-import - import

I'm trying to do an activity whereas i'll import a table from MSSQL then export to MSSQL again in another database for the sake of testing sqoop1. So far, my imports are successful. My concern is regarding the export, if i import a table without --hive-import option, i'll be able to export it successfully. But if i include --hive-import option, sqoop wont be able to export it and prompts an error:
17/04/02 23:08:20 ERROR sqoop.Sqoop: Got exception running Sqoop:
org.kitesdk.data.DatasetIOException: Unable to load descriptor
file:hdfs://quickstart.cloudera:8020/user/hive/warehouse/customer/.metadata/descriptor.properties
for dataset:customer org.kitesdk.data.DatasetIOException: Unable to
load descriptor
file:hdfs://quickstart.cloudera:8020/user/hive/warehouse/customer/.metadata/descriptor.properties
for dataset:customer
As per checking, there's a difference in the metadata with --hive-imports. Imports with --hive-import parameter only does not have the required metadata:
Supplier/.metadata/descriptor.properties
My question is, is it possible to import a table in sqoop with --as-parquetfile and --hive-import option then be able to export it also?
here's my sample import and export code for referrence:
sqoop export \
--connect "jdbc:sqlserver://192.168.1.23;database=SqoopDB;schema=dbo;" \
--username sa \
--password Password1 \
--export-dir /user/hive/warehouse/customer \
--table customer
sqoop import \
--connect "jdbc:sqlserver://192.168.1.23;database=SourceDB;schema=dbo" \
--username sa \
--password Password1 \
--table Customer \
--as-parquetfile \
--hive-import \
--hive-overwrite \
-m 1

Related

sqoop query is giving error for the Import command

I am trying to transfer 35 GB table from aws rds postgres to hive but when I try to full table it take much time and after long time execution get stop. so I decide to load incremental way.
schema:All are in varchar except mentioned below.
twoway_id
twoway_seq =>int
guid
twoway_section_cd
transmit_host_nm
receipt_host_nm
service_id
msg_id
msg_no
vin_id
twoway_dt
status_result_cd
company_cd
limit_count_yn
response_status_cd
create_user_id
create_app_id
create_tmsmp =>timestamp
update_user_id
update_app_id
update_tmsmp =>timestamp
kshsjsjsj 320393682 IN K 02 TMU ISS CMM GPI 14 0 20201800230936 FAIL 02 Y 500 ISS ISS 2020-12-02 17:36:36.447 ISS ISS 2020-12-02 17:36:36.462
326403236
sqoop query: This query is working perfectly for full table load.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where 1=1 AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
tried but not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp like '2020-12-01%' AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
Error:
21/12/29 16:02:43 ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException:
ERROR: operator does not exist: integer % boolean
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
This is also not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where "create_tmsmp < 2020-12-04 04:51:26.150"\
--append
Also help me on incremental load query. I am also facing syntax error issue.
Incremental import arguments:
Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.

I think the way you are passing --where need to be changed like below. You need to use single quotes around strings.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where " create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS' )"\
or
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query "SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS') AND \\\$CONDITIONS"\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
Pls change position of \$CONDITIONS if above doesn't work.
Now you can implement Incremental sqoop like below.
NEW ROWS : you can use --incremental append to append new records. identify a column which can be used to determine brand new record - like primary key. And also calculate max() of that column in hive(i assumed it to be 1000). Load those data using below sqoop.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column pk_col \
--incremental append \
--last-value 1000
So this above Sqoop will append any row where pk_col > 10000 from postgres.
EXISTING ROWS : Similarly you may want to bring modified rows from source. Then use below SQL. You need to first calculate max() of create_tmsmp and then use it in below statement(i assumed max to be 2020-12-04 04:51:26).
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column create_tmsmp \
--incremental lastmodified \
--last-value "'2020-12-04 04:51:26'"

No manager for connect string

As soon as my EMR-Cluster was ready to be run.
I started facing some issues when listing databases and importing sqoop
Apparently, sqoop has been installed normally and it is working normally when I type "sqoop help" in Linux terminal.
using sqoop help
as you can see, the command could be recognized normally.
However, if I try out the sqoop import command, this one cannot be and it faces an error:
sqoop import \
--connect jdbc:postgres://sportsdb.cxri########.us-east-2.rds.amazonaws.com/SportsDB \
--username postgres \
--password mypassword \
--table addresses --target-dir s3://sqoop-table-from-rds-to-s3/sqoop-table/ -m 1 --fields-terminated-by '\t' --lines-terminated-by ','
sqoop import
The same goes to the second one, which is "sqoop list-databases" as shown below:
sqoop list-databases \
--connect jdbc:postgres://sportsdb.cxri########.us-east-2.rds.amazonaws.com \
--username postgres \
--password mypassword
sqoop list-databases
they don't really works and anything happens ;/
I also downloaded jar and put into /usr/lib/sqoop/lib/ where is the jar files on sqoop
To do it I run these two follow commands:
1) wget -O postgresql-jdbc.jar https://jdbc.postgresql.org/download/postgresql-42.3.1.jar
2) sudo mv postgresql-jdbc.jar /usr/lib/sqoop/lib/
Jar file added to sqoop/lib
Someone else has a suggestion about what can be done in order to fix this issue?

The issue could be solved after following a tip received about a typo.
Then, I just changed the word postgres for postgresql as follows:
sqoop list-databases \
--connect jdbc:postgresql://sportsdb.cxri########.us-east-2.rds.amazonaws.com \
--username postgres \
--password mypassword
That is it. The issue was fixed just adjusting something pretty simple

Hive import is not compatible with importing into AVRO format

I have the following codes:
sqoop import --connect jdbc:mysql://localhost/export \
--username root \
--password cloudera \
--table cust \
--hive-import \
--create-hive-table \
--fields-terminated-by ' ' \
--hive-table default.cust \
--target-dir /user/hive/warehouse/cust \
--compression-codec org.apache.org.io.compress.GzipCodec \
--as-avrodatafile \
-m 1
got the following error, please help.
Hive import is not compatible with importing into AVRO format.

Currently, Sqoop does not support to import AVRO format directly into a HIVE table, so as a workaround you can import into HDFS and create a EXTERNAL TABLE in HIVE
Step 1 : IMPORT into hdfs
sqoop import --connect jdbc:mysql://localhost/export \
--username root --password cloudera
--table cust \
--target-dir /user/hive/warehouse/cust \
--compression-codec org.apache.org.io.compress.GzipCodec \
--as-avrodatafile -m 1
This import will create a schema file in the current directory (Linux) with an extension .avsc .Copy this file to some location in HDFS (PATH_TO_THE_COPIED_SCHEMA).
Step 2: Create an external table in HIVE like
CREATE EXTERNAL TABLE cust
STORED AS AVRO
LOCATION 'hdfs:///user/hive/warehouse/cust'
TBLPROPERTIES ('avro.schema.url'='hdfs:///PATH_TO_THE_COPIED_SCHEMA/cust.avsc');

Sqoop export is successful but destination postgres table is empty

I am trying to export the table from hdfs to postgres
Below is the query which I used for export:
sqoop export --connect jdbc:postgresql:hostname:5432/postgresDB --username user --password password --input-fields-terminated-by '\001' --fields-terminated-by ',' --table customer --export-dir /hdfs/location/customer --input-null-string '\\N' --input-null-non-string '\\N' --direct --update-key customer_id
The sqoop query completes with success message. Please see the screenshot below:
But when I query the table, I am not finding any data.
Any help is appreciated. Thanks in advance.

sqoop export --connect jdbc:postgresql:hostname:5432/postgresDB \
--username user \
--password password \
--input-fields-terminated-by '\001' \
--fields-terminated-by ',' \
--table customer \
--export-dir /hdfs/location/customer \
--input-null-string '\\N' --input-null-non-string '\\N' \
--direct \
--update-mode allowinsert
This worked ..

For me, this worked after I added the schema name for my table name.
-- -- schema my_schema

Sqoop: how switch off Prepared Statements?

I use Sqoop 1.4.5-cdh5.4.2 and Postgresql.
If Sqoop connects directly to the database - all right.
But need use Sqoop over pgbouncer, and I have problem with this.
In pgbouncer you can not do prepared statements transaction mode.
... connect command:
sqoop import \
--connect "$db_name" \
--username "$db_user" \
--password "$db_pass" \
--direct \
--hive-import \
--hive-table "$hive_schema.$t" \
--hive-overwrite \
--num-mappers 10 \
--fetch-size 100000 \
--split-by "object_id" \
--target-dir "/user/$hive_schema/$t" \
--table "$t"
... and error:
org.postgresql.util.PSQLException: ERROR: prepared statement "S_3" already exists
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2270)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1998)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:791)
at org.postgresql.jdbc2.AbstractJdbc2Connection.commit(AbstractJdbc2Connection.java:815)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:315)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:227)
at org.apache.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:126)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:188)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:514)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

Add prepareThreshold=0 to the connection string

SQOOP don't work with pgbouncer and transaction pool :(

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cloudera-Sqoop import with/without --hive-import - import

Related

sqoop query is giving error for the Import command

No manager for connect string

Hive import is not compatible with importing into AVRO format

Sqoop export is successful but destination postgres table is empty

Sqoop: how switch off Prepared Statements?

Categories

Resources