Is there a way to set PostgreSQL Optimizer in Sqoop? - postgresql

I am trying to run a sqoop job to ingest data from postgresql to hdfs and I am stuck on some point.
Sqoop adds " AND (1=0)" to the end of my "WHERE" statement to fetch metadata just before ingestion.
sqoop import
--connect jdbc:postgresql://randomtexthere.com:5432/test
--username user
-P
--query
"
SELECT *
FROM table1 pr
INNER JOIN
table2 fr
ON pr.id = fr.id
WHERE fr.another_id > 12345 AND fr.another_id < 123456 AND \$CONDITIONS
"
--hcatalog-database test
--hcatalog-storage-stanza "STORED AS PARQUET"
--hcatalog-table table1--split-by id
Once above command is added, query never gets completed. (both in Sqoop and DBeaver)
However this query only works once I set SET OPTIMIZER = ON (in DBeaver)
SET OPTIMIZER = ON;
SELECT *
FROM table1 pr
INNER JOIN
table2 fr
ON pr.id = fr.id
WHERE fr.another_id > 12345 AND fr.another_id < 123456 AND (1=0);
I am looking for a solution to set optimizer parameter in my sqoop session.
Is there a way to do it?

passing the optimizer hint in jdbc connection solves the issue.
--connect jdbc:postgresql://randomtexthere.com:5432/test?optimizer=true

Related

fixing orphaend Sequences in postgres-12

I'm trying to run pg_dump and it's failing due to a orphaned sequence.
pg_dump -U db --format=custom --compress=0 db
pg_dump: error: query to get data of sequence "non_existing_table_id_seq" returned 0 rows (expected 1)
https://wiki.postgresql.org/wiki/Fixing_Sequences
The above wiki page has some snippets which can be used to fix this issue, the last snippit does work to display the orphaned snippets.
select ns.nspname as schema_name, seq.relname as seq_name
from pg_class as seq
join pg_namespace ns on (seq.relnamespace=ns.oid)
where seq.relkind = 'S'
and not exists (select * from pg_depend where objid=seq.oid and deptype='a')
order by seq.relname;
schema_name | seq_name
-------------+--------------------------------
public | non_existing_table_id_seq
public | another_non_existing_table_id_seq
(2 rows)
The command which should fix this issue doesn't run because column d.adsrc does not exist
It seems to have been removed from postgres-12.
https://stackoverflow.com/a/58798028/1891184 says I can replace d.adsrc with pg_get_expr(d.adbin, d.adrelid). and that runs, however the issue still remains.
Other than this, the database is working fine.
How can I either fix or remove the offending sequences in order to let pg_dump work?

DB2 Scheduled Trigger

I'm new to triggers and I want to ask the proper procedure to create a trigger (or any better methods) to duplicate the contents of T4 table to T5 table on a specified datetime.
For example, on the 1st day of every month at 23:00, I want to duplicate the contents of T4 table to T5 table.
Can anyone please advise what's the best method?
Thank you.
CREATE TRIGGER TRIG1
AFTER INSERT ON T4
REFERENCING NEW AS NEW
FOR EACH ROW
BEGIN
INSERT INTO T5 VALUES (:NEW.B, :NEW.A);
END TRIG1;
It can be done by Administrative Task Scheduler feature instead of cron. Here is a sample script.
#!/bin/sh
db2set DB2_ATS_ENABLE=YES
db2stop
db2start
db2 -v "drop db db1"
db2 -v "create db db1"
db2 -v "connect to db1"
db2 -v "CREATE TABLESPACE SYSTOOLSPACE IN IBMCATGROUP MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 4"
db2 -v "create table s1.t4 (c1 int)"
db2 -v "create table s1.t5 (c1 int)"
db2 -v "insert into s1.t4 values (1)"
db2 -v "create procedure s1.copy_t4_t5() language SQL begin insert into s1.t5 select * from s1.t4; end"
db2 -v "CALL SYSPROC.ADMIN_TASK_ADD ('ATS1', CURRENT_TIMESTAMP, NULL, NULL, '0,10,20,30,40,50 * * * *', 'S1', 'COPY_T4_T5',NULL , NULL, NULL )"
date
It will create a task, called 'ATS1' and will call the procedure s1.copy_t4_t5 every 10 minuets, such as 01:00, 01:20, 01:30. You may need to run below after executing the script:
db2 -v "connect to db1"
Then, after some time, run below to see if the t5 table has row as expected:
db2 -v "select * from s1.t5"
For your case, the 5th parameter would be replaced with '0 23 1 * *'.
It represents 'minute hour day_of_month month weekday' so
it will be called every 1st day of month at 23:00.
For more information, how to modify existing task, delete task, review status, see at:
Administrative Task Scheduler routines and views
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/c0061223.html
Also, here is one of good article about it:
[DB2 LUW] Sample administrative task scheduler ADMIN_TASK_ADD and ADMIN_TASK_REMOVE usage
https://www.ibm.com/support/pages/node/1140388?lang=en
Hope this helps.

Passing Variables to Query via Beeline

I need assistance with understanding why my hivevar is not being set in my query?
This is my beeline statement in a shell script:
Start_Date="20180423"
End_Date="20180424"
beeline -u 'jdbc:hive2://#####/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=######;' -f ${my_queries}/Count_Query --showHeader=false --outputformat=csv2 --silent=false --hivevar start_date=$Start_Date --hivevar end_date=$End_Date 1>${my_data}/Data_File 2>${my_log}/Log_File
The Query
use sample_db;
select count(*) from sample_table where data_dt>=${start_date} and data_dt<${end_date};
When I look at the data file, which provides a dump of the query, the variables are not properly set to the values.
0: jdbc:hive2://####> use sample_db;
0: jdbc:hive2://####> select count(*) from sample_table where data_dt>=${start_date} and data_dt<${end_date};
The issue is following part
**--hivevar start_date=$Start_Date --hivevar end_date=$End_Date**
Remove ** and you are good to go.
Shell Script.
Start_Date="20180423"
End_Date="20180424"
beeline_cmd="beeline -u 'jdbc:hive2://#####/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=######;' --showHeader=false --outputformat=csv2 --silent=false"
${beeline_cmd} -f ${my_queries}/Count_Query --hivevar start_date=${Start_Date} --hivevar end_date=${End_Date} 1>${my_data}/Data_File 2>${my_log}/Log_File
Hive Query

pg_dump failed "backup : cache lookup failed"

We had pg_dump failed with "backup : cache lookup failed for type 174104".
2016-01-06 03:08:46.572
EST,"postgres","aabbcc",13840,"[local]",568cf5bd.3610,3,"SELECT",2016-01-06
03:08:45 PST,2/24331,0,ERROR,XX000,"cache lookup failed for type
174104",,,,,,"SELECT proretset, prosrc, probin,
pg_catalog.pg_get_function_arguments(oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs,
pg_catalog.pg_get_function_result(oid) AS funcresult, proiswindow,
provolatile, proisstrict, prosecdef, proleakproof, proconfig, procost,
prorows, (SELECT lanname FROM pg_catalog.pg_language WHERE oid =
prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid =
'174103'::pg_catalog.oid",,,"pg_dump"
Tried following solutions that not work.
1) bounced DB not work
2) vacuum full not work
Any idea is appreciated :)
3) pg_basebackup worked too
It is local RAID file system.
It did not seem to work after REINDEX of all the tables.
I tried pg_dump of all tables 1-by-1 but did not spot anything weird.

DB2 CLI result output

When running command-line queries in MySQL you can optionally use '\G' as a statement terminator, and instead of the result set columns being listed horizontally across the screen, it will list each column vertically, which the corresponding data to the right. Is there a way to the same or a similar thing with the DB2 command line utility?
Example regular MySQL result
mysql> select * from tagmap limit 2;
+----+---------+--------+
| id | blog_id | tag_id |
+----+---------+--------+
| 16 | 8 | 1 |
| 17 | 8 | 4 |
+----+---------+--------+
Example Alternate MySQL result:
mysql> select * from tagmap limit 2\G
*************************** 1. row ***************************
id: 16
blog_id: 8
tag_id: 1
*************************** 2. row ***************************
id: 17
blog_id: 8
tag_id: 4
2 rows in set (0.00 sec)
Obviously, this is much more useful when the columns are large strings, or when there are many columns in a result set, but this demonstrates the formatting better than I can probably explain it.
I don't think such an option is available with the DB2 command line client. See http://www.dbforums.com/showthread.php?t=708079 for some suggestions. For a more general set of information about the DB2 command line client you might check out the IBM DeveloperWorks article DB2's Command Line Processor and Scripting.
Little bit late, but found this post when I searched for an option to retrieve only the selected data.
So db2 -x <query> gives only the result back. More options can be found here: https://www.ibm.com/docs/en/db2/11.1?topic=clp-options
Example:
[db2inst1#a21c-db2 db2]$ db2 -n select postschemaver from files.product
POSTSCHEMAVER
--------------------------------
147.3
1 record(s) selected.
[db2inst1#a21c-db2 db2]$ db2 -x select postschemaver from files.product
147.3
DB2 command line utility always displays data in tabular format. i.e. rows horizontally and columns vertically. It does not support any other format like \G statement terminator do for mysql. But yes, you can store column organized data in DB2 tables when DB2_WORKLOAD=ANALYTICS is set.
db2 => connect to coldb
Database Connection Information
Database server = DB2/LINUXX8664 10.5.5
SQL authorization ID = BIMALJHA
Local database alias = COLDB
db2 => create table testtable (c1 int, c2 varchar(10)) organize by column
DB20000I The SQL command completed successfully.
db2 => insert into testtable values (2, 'bimal'),(3, 'kumar')
DB20000I The SQL command completed successfully.
db2 => select * from testtable
C1 C2
----------- ----------
2 bimal
3 kumar
2 record(s) selected.
db2 => terminate
DB20000I The TERMINATE command completed successfully.