Importing postgresql arrays to hive - postgresql

I've been using Sqoop to move data between Postgresql tables and Hive. But apparently Sqoop does not support the Postgresql array type.

Postgresql has a function called array_to_string. You can utilize that to transform your array to a string.
To illustrate, here is the table in postgresql:
=# select * from albums;
id | album_id | names
-----------+-------+-----
123 | {1,2,3,4} | test
(1 row)
=#
As you see the album_id has array as type, more specifically an array of integers.
Now, to import this from my database called mydb I use the following command:
sqoop import --connect jdbc:postgresql://localhost:5432/mydb \
--query "select id, array_to_string(album_id,',','*'), names \
from albums where \$CONDITIONS" \
--split-by id \
--target-dir albums
After that you can create create an external Hive table with the following parameters:
collection.delim $
field.delim ,

Related

sqoop query is giving error for the Import command

I am trying to transfer 35 GB table from aws rds postgres to hive but when I try to full table it take much time and after long time execution get stop. so I decide to load incremental way.
schema:All are in varchar except mentioned below.
twoway_id
twoway_seq =>int
guid
twoway_section_cd
transmit_host_nm
receipt_host_nm
service_id
msg_id
msg_no
vin_id
twoway_dt
status_result_cd
company_cd
limit_count_yn
response_status_cd
create_user_id
create_app_id
create_tmsmp =>timestamp
update_user_id
update_app_id
update_tmsmp =>timestamp
kshsjsjsj 320393682 IN K 02 TMU ISS CMM GPI 14 0 20201800230936 FAIL 02 Y 500 ISS ISS 2020-12-02 17:36:36.447 ISS ISS 2020-12-02 17:36:36.462
326403236
sqoop query: This query is working perfectly for full table load.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where 1=1 AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
tried but not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query 'SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp like '2020-12-01%' AND $CONDITIONS'\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
--hive-drop-import-delims\
--hive-overwrite --hs2-url jdbc:hive2://hivehostname:10000/default;
Error:
21/12/29 16:02:43 ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException:
ERROR: operator does not exist: integer % boolean
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
This is also not working
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where "create_tmsmp < 2020-12-04 04:51:26.150"\
--append
Also help me on incremental load query. I am also facing syntax error issue.
Incremental import arguments:
Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
I think the way you are passing --where need to be changed like below. You need to use single quotes around strings.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--where " create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS' )"\
or
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k"\
--query "SELECT * FROM db_core.service_twoway_ifo_202012 where create_tmsmp < TO_TIMESTAMP('2020-12-04 04:51:26','YYYY-MM-DD HH:MI:SS') AND \\\$CONDITIONS"\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--hive-import --hive-table db_core.service_twoway_ifo_202012\
Pls change position of \$CONDITIONS if above doesn't work.
Now you can implement Incremental sqoop like below.
NEW ROWS : you can use --incremental append to append new records. identify a column which can be used to determine brand new record - like primary key. And also calculate max() of that column in hive(i assumed it to be 1000). Load those data using below sqoop.
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column pk_col \
--incremental append \
--last-value 1000
So this above Sqoop will append any row where pk_col > 10000 from postgres.
EXISTING ROWS : Similarly you may want to bring modified rows from source. Then use below SQL. You need to first calculate max() of create_tmsmp and then use it in below statement(i assumed max to be 2020-12-04 04:51:26).
sqoop import --connect "jdbc:postgresql://hostname:5432/db_core_k \
--table `db_core.service_twoway_ifo_202112`\
--m 1 --target-dir "/user/hive/warehouse/db_core.db/service_twoway_ifo_202112"\
--username test --password test001 \
--check-column create_tmsmp \
--incremental lastmodified \
--last-value "'2020-12-04 04:51:26'"

Validating the data inside postgres table

I have launched a postgres container. By injecting script.sql at docker entrypoint, I created database, schema, tables and have inserted data into them. Docker Logs says that all table creation and data insertion is successful .
But How can I validate the data insertion? Below commands didn't help
List of relations
Schema | Name | Type | Owner
------------+----------------------+-------+----------
my_db | users | table | postgres
my_db | audit_log | table | postgres
my_db | config | table | postgres
(3 rows)
my_db=# SELECT * FROM my_db.users
my_db-# SELECT * FROM users
my_db-# SELECT * FROM my_db.users;
What is wrong here? Please help.
You should use
\c my_db
to connect to your database. And then:
SELECT * FROM users;
To query your table into your database.
It seems that you understood this part, but everything should be done using postgresql command line, so inside the docker container.

Sqoop import not working with hive parquet

Change data capture in Sqoop-Hive Import
I am trying to do change data capture using Sqoop but when I am writing -as-parquet I my Sqoop import command it is falling .but after removing -as-parquet from my Sqoop command it is working and putting data in text format in hive table but want it in parquet hive table.
i want to do update operation from my data.
I have written this below command :
Sqoop import --connect "myoracleconntiondetails"
--username myuser --password mypasswd
--query 'select * from test_table where $CONDITIONS'
--hive_import --hive-database test_dase
--hive-table test_dase.test_table --null-string 'NULL'
--null-non-string '-99999' --target-dir mydir/full path
--split-by mycol --incremental append
--merge-key could -as-parquet -m -10
I get this error:
Got exception running sqoop: org.kitesdk.data.validationException: Dataset name test_dase.test_table is not alphanumeric (plus '')
Org.kitesdk.data.validation:Dataset name test_dase.test_table is not alphanumeric (plus '')

Sqoop + Postgresql: how to prevent quotes around table name

I am trying to import a table from Postgresql to a Parquet file on HDFS.
Here is what I do:
sqoop import \
--connect "jdbc:postgresql://pg.foo.net:5432/bar" \
--username user_me --password $PASSWORD \
--table foo.bar \
--target-dir /user/me/bar \
--as-parquetfile
and I get
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "foo.bar" AS t LIMIT 1
ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException: ERROR: relation "foo.bar" does not exist
SELECT t.* FROM "foo.bar" AS t LIMIT 1 does not work indeed, but SELECT t.* FROM foo.bar AS t LIMIT 1 does. So the problem is that table name is quoted. I tried supplying --table argument different ways, but with no effect.
How do I work it around?
EDIT
As the docs you linked state, there is a --schema argument. For some reason it is not mentioned in sqoop help import.
Another weird thing is that
--table bar --schema foo
still does not work, but
--table bar -- --schema foo
does.
Anyway, it works now. Thanks for linking the relevant docs section!
The table name is bar, foo is the name of the schema.
According to the docs you should do it like:
sqoop import \
(...)
--table bar \
--schema foo
(...)
According to the documentation you need to specify the schema separately:
sqoop import \
--connect "jdbc:postgresql://pg.foo.net:5432/bar" \
--username user_me --password $PASSWORD \
--table bar \
--schema foo \
--target-dir /user/me/bar \
--as-parquetfile

Generate DDL programmatically on Postgresql

How can I generate the DDL of a table programmatically on Postgresql? Is there a system query or command to do it? Googling the issue returned no pointers.
Use pg_dump with this options:
pg_dump -U user_name -h host database -s -t table_or_view_names -f table_or_view_names.sql
Description:
-s or --schema-only : Dump only ddl / the object definitions (schema), without data.
-t or --table Dump : Dump only tables (or views or sequences) matching table
Examples:
-- dump each ddl table elon build.
$ pg_dump -U elon -h localhost -s -t spacex -t tesla -t solarcity -t boring > companies.sql
Sorry if out of topic. Just wanna help who googling "psql dump ddl" and got this thread.
You can use the pg_dump command to dump the contents of the database (both schema and data). The --schema-only switch will dump only the DDL for your table(s).
Why would shelling out to psql not count as "programmatically?" It'll dump the entire schema very nicely.
Anyhow, you can get data types (and much more) from the information_schema (8.4 docs referenced here, but this is not a new feature):
=# select column_name, data_type from information_schema.columns
-# where table_name = 'config';
column_name | data_type
--------------------+-----------
id | integer
default_printer_id | integer
master_host_enable | boolean
(3 rows)
The answer is to check the source code for pg_dump and follow the switches it uses to generate the DDL. Somewhere inside the code there's a number of queries used to retrieve the metadata used to generate the DDL.
Here is a good article on how to get the meta information from information schema,
http://www.alberton.info/postgresql_meta_info.html.
I saved 4 functions to mock up pg_dump -s behaviour partially. Based on \d+ metacommand. The usage would be smth alike:
\pset format unaligned
select get_ddl_t(schemaname,tablename) as "--" from pg_tables where tableowner <> 'postgres';
Of course you have to create functions prior.
Working sample here at rextester