DB2 (AS400/ i Series) column value description - db2

I found a query online which helps me get all the column names along with the column description, which I have pasted below.
Select
SYSTEM_TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, varchar(COLUMN_TEXT, 50) As COLUMN_DESC
From QSYS2.syscolumns
WHERE TABLE_NAME = 'xxxxxx' AND SYSTEM_TABLE_SCHEMA = 'yyyyyy'
Are there a query that gives the column value description? I am attaching a sample output desired below. Column [Item Code] has three values: A,B,C -> I want the corresponding description for those values.
Desired Output

It depends on the structure of your data. SQL does not know the description of the code values. That is data contained in your database. So if you have a table that contains the description of the codes, then you you can get that with a join. If you do not have such a table, then you cannot get that information. Here is an example of how this might work for you.
create table master (
id integer primary key,
name varchar(128) not null,
code varchar(10) not null);
create table codes (
id varchar(10) primary key,
description varchar(128) not null);
insert into master
values (1, 'test1', 'A'),
(2, 'test2', 'B'),
(3, 'test3', 'C'),
(4, 'test4', 'A'),
(5, 'test5', 'B');
insert into codes
values ('A', 'Code 1'),
('B', 'Code 2'),
('C', 'Code 3');
SELECT master.id, master.name, master.code, codes.description
FROM master
JOIN codes on master.code = codes.id;
|ID|NAME |CODE|DESCRIPTION|
|--|-----|----|-----------|
|1 |test1|A |Code 1 |
|2 |test2|B |Code 2 |
|3 |test3|C |Code 3 |
|4 |test4|A |Code 1 |
|5 |test5|B |Code 2 |

If your database is properly built, then there should be a referential (aka foreign key) constraint defined between your table 'XXXXX' and the 'codes' table jmarkmurphy provided an example of.
There are various system catalogs which will show that constraint.
IBM i
qsys2.syscst
qsys2.syscstcol
qsys2.syscstdep
qsys2.syskeycst
ODBC/JDBC
sysibm.SQLFOREIGNKEYS
ANS/ISO
qsys2.REFERENTIAL_CONSTRAINTS
Unfortunately, many legacy applications on Db2 for i don't have such constraints defined.
It's also possible that the "codes" table doesn't exist either. And that the description is simply hard coded in various programs.

Related

Postgresql + psycopg: Bulk Insert large data with POSTGRESQL function call

I am working with large, very large amount of very simple data (point clouds). I want to insert this data into a simple table in a Postgresql database using Python.
An example of the insert statement I need to execute is as follows:
INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));
Note the call to the Postgresql function ST_MakePoint in the INSERT statement.
I must call this billions (yes, billions) of times, so obviously I must insert the data into the Postgresql in a more optimized way. There are many strategies to bulk insert the data as this article presents in a very good and informative way (insertmany, copy, etc).
https://hakibenita.com/fast-load-data-python-postgresql
But no example shows how to do these inserts when you need to call a function on the server-side. My question is: how can I bulk INSERT data when I need to call a function on the server-side of a Postgresql database using psycopg?
Any help is greatly appreciated! Thank you!
Please note that using a CSV doesn't make much sense because my data is huge.
Alternatively, I tried already to fill a temp table with simple columns for the 3 inputs of the ST_MakePoint function and then, after all data is into this temp function, call a INSERT/SELECT. The problem is that this takes a lot of time and the amount of disk space I need for this is nonsensical.
The most important, in order to do this within reasonable time, and with minimum effort, is to break this task down into component parts, so that you could take advantage of different Postgres features seperately.
Firstly, you will want to first create the table minus the geometry transformation. Such as:
create table temp_table (
id_scan bigint,
scandist numeric,
pt_1 numeric,
pt_2 numeric,
pt_3 numeric
);
Since we do not add any indexes and constraints, this will be most likely the fastest way to get the "raw" data into the RDBMS.
The best way to do this would be with COPY method, which you can use either from Postgres directly (if you have sufficient access), or via the Python interface by using https://www.psycopg.org/docs/cursor.html#cursor.copy_expert
Here is example code to achieve this:
iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
iconn = psycopg2.connect(iconn_string)
import_cursor = iconn.cursor()
csv_filename = '/path/to/my_file.csv'
copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
import_cursor.copy_expert(copy_sql, csv_file)
iconn.commit()
The next step will be to efficiently create the table you want, from the existing raw data. You will then be able to create your actual target table with single SQL statement, and let RDBMS to do its magic.
Once data is in the RDBMS, makes sense to optimize it a little and add an index or two if applicable (primary or unique index preferably to speed up transformation)
This will be dependent on your data / use case, but something like this should help:
alter table temp_table add primary key (id_scan); --if unique
-- or
create index idx_temp_table_1 on temp_table(id_scan); --if not unique
To move data from raw into your target table:
with temp_t as (
select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
)
INSERT INTO points_postgis (id_scan, scandist, pt)
SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
FROM temp_t;
This will in one go select all data from the previous table and transform it.
Second option that you could use is similar. You can load all raw data to points_postgis directly, while keeping it separated into 3 temp columns. Then use alter table points_postgis add column pt geometry; and follow up with an update, and removal of the temp columns: update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;
The main takeaway is that the most performant option would be to not concentrate on the final final table state, but to break it down in easily achievable chunks. Postgres will easily handle both import of billion of rows, and transformation of it afterwards.
Some simple examples using a function that generates a UPC A barcode with check digit:
Using execute_batch. execute_batch has page_size argument that allows you to batch the inserts using a multi-line statement. By default this is set at 100 which will insert 100 rows at a time. You can bump this up to make fewer round trips to the server.
Using just execute and selecting data from another table.
import psycopg2
from psycopg2.extras import execute_batch
con = psycopg2.connect(dbname='test', host='localhost', user='postgres',
port=5432)
cur = con.cursor()
cur.execute('create table import_test(id integer, suffix_val varchar, upca_val
varchar)')
con.commit()
# Input data as a list of tuples. Means some data is duplicated.
input_list = [(1, '12345', '12345'), (2, '45278', '45278'), (3, '61289',
'61289')]
execute_batch(cur, 'insert into import_test values(%s, %s,
upc_check_digit(%s))', input_list)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
# Input data as list of dicts and using named parameters to avoid duplicating data.
input_list_dict = [{'id': 50, 'suffix_val': '12345'}, {'id': 51, 'suffix_val': '45278'}, {'id': 52, 'suffix_val': '61289'}]
execute_batch(cur, 'insert into import_test values(%(id)s,
%(suffix_val)s, upc_check_digit(%(suffix_val)s))', input_list_dict)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
50 | 12345 | 744835123458
51 | 45278 | 744835452787
52 | 61289 | 744835612891
# Create a table with values to be used for inserting into final table
cur.execute('create table input_vals (id integer, suffix_val varchar)')
con.commit()
execute_batch(cur, 'insert into input_vals values(%s, %s)', [(100, '76234'),
(101, '92348'), (102, '16235')])
con.commit()
cur.execute('insert into import_test select id, suffix_val,
upc_check_digit(suffix_val) from input_vals')
con.commit()
select * from import_test ;
id | suffix_val | upca_val
-------+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
12345 | 12345 | 744835123458
45278 | 45278 | 744835452787
61289 | 61289 | 744835612891
100 | 76234 | 744835762343
101 | 92348 | 744835923485
102 | 16235 | 744835162358

How to apply Translate function on all rows within a column in postgresql

In a dataset I have, there is a columns contains numbers like 83.420, 43.317, 149.317, ... and this columns is stored as string. The dot in the numbers doesn't represent decimal point, i.e., the number 83.420 is basically 83420 etc.
One way to remove this dot from numbers in this column is to use TRANSLATE function as follows:
SELECT translate('83.420', '.', '')
which returns 83420. But how I can apply this function on all the rows in the dataset?
I tried this, however, I failed:
SELECT translate(SELECT num_column FROM my_table, '.', '')
I face with error SQL Error [42601]: ERROR: syntax error at end of input.
Any idea how I can apply translate function on one column in data entirely? or any better idea to use rather than translate?
You can even cast the result to numeric like this:
SELECT translate(num_column, '.', '')::integer from the_table;
-- average:
SELECT avg(translate(num_column, '.', '')::integer from the_table;
or use replace
SELECT replace(num_column, '.', '')::integer from the_table;
-- average:
SELECT avg(replace(num_column, '.', '')::integer) from the_table;
Please note that storing numbers as formatted text is a (very) bad idea. Use a native numeric type instead.
Two options.
Set up table:
create table string_conv(id integer, num_column varchar);
insert into string_conv values (1, 83.420), (2, 43.317), (3, 149.317 );
select * from string_conv ;
id | num_column
----+------------
1 | 83.420
2 | 43.317
3 | 149.317
First option leave as string field:
update string_conv set num_column = translate(num_column, '.', '');
select * from string_conv ;
id | num_column
----+------------
1 | 83420
2 | 43317
3 | 149317
The above changes the value format in place. I means though that if new data comes in with the old format, 'XX.XXX', then those values would have to be converted.
Second option convert to integer column:
truncate string_conv ;
insert into string_conv values (1, 83.420), (2, 43.317), (3, 149.317 );
alter table string_conv alter COLUMN num_column type integer using translate(num_column, '.', '')::int;
select * from string_conv ;
id | num_column
----+------------
1 | 83420
2 | 43317
3 | 149317
\d string_conv
Table "public.string_conv"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | |
num_column | integer | | |
This option changes the format of the values and changes the type of the column they are being stored in. The issue is here is that from then on new values would need to be compatible with the new type. This would mean changing the input data from 'XX.XXX' to 'XXXXX'.

How to break apart a column includes keys and values into separate columns in postgres

I am new in postgres and basically have no experience. I have a table with a column includes key and value. I need write a query which return a table with the all the columns of the table and additional columns as key as the column name and the value under it.
My input is like:
id | name|message
12478| A |{img_type:=png,key_id:=f235, client_status:=active, request_status:=open}
12598| B |{img_type:=none,address_id:=c156, client_status:=active, request_status:=closed}
output will be:
id |name| Img_type|Key_id|address_id|Client_status|Request_status
12478| A | png |f235 |NULL |active | open
12598| B | none |NULL |c156 |active | closed
Any help would be greatly appreciated.
The only thing I can think of, is a regular expression to extract the key/value pairs.
select id, name,
(regexp_match(message, '(img_type:=)([^,}]+),{0,1}'))[2] as img_type,
(regexp_match(message, '(key_id:=)([^,}]+),{0,1}'))[2] as key_id,
(regexp_match(message, '(client_status:=)([^,}]+),{0,1}'))[2] as client_status,
(regexp_match(message, '(request_status:=)([^,}]+),{0,1}'))[2] as request_status
from the_table;
regexp_match returns an array of matches. As the regex contains two groups (one for the "key" and one for the "value"), the [2] takes the second element of the array.
This is quite expensive and error prone (e.g. if any of the values contains a , and you need to deal with quoted values). If you have any chance to change the application that stores the value, you should seriously consider changing your code to store a proper JSON value, e.g.
{"img_type": "png", "key_id": "f235", "client_status": "active", "request_status": "open"}'
then you can use e.g. message ->> 'img_type' to retrieve the value for the key img_type
You might also want to consider a properly normalized table, where each of those keys is a real column.
I can do it with function.
I am sure about the performance but here is my suggestion:
CREATE TYPE log_type AS (img_type TEXT, key_id TEXT, address_id TEXT, client_status TEXT, request_status TEXT);
CREATE OR REPLACE FUNCTION populate_log(data TEXT)
RETURNS log_type AS
$func$
DECLARE
r log_type;
BEGIN
select x.* into r
from
(
select
json_object(array_agg(array_data)) as json_data
from (
select unnest(string_to_array(trim(unnest(string_to_array(substring(populate_log.data, '[^{}]+'), ','))), ':=')) as array_data
) d
) d2,
lateral json_to_record(json_data) as x(img_type text, key_id text, address_id text, client_status text, request_status text);
RETURN r;
END
$func$ LANGUAGE plpgsql;
with log_data (id, name, message) as (
values
(12478, 'A', '{img_type:=png,key_id:=f235, client_status:=active, request_status:=open}'),
(12598, 'B', '{img_type:=none,address_id:=c156, client_status:=active, request_status:=closed}')
)
select id, name, l.*
from log_data, lateral populate_log(message) as l;
What you finally write in query will be something like this, imagine that the data is in a table named log_data :
select id, name, l.*
from log_data, lateral populate_log(message) as l;
I suppose that message column is a text, in Postgres it might be an array, in that case you have to remove some conversions, string_to_array(substring(populate_log.data)) -> populate_log.data

Get more than 2 columns distinct data using group by in postgresql

CREATE TABLE "spiral_articleinfo" (
"id" integer NOT NULL PRIMARY KEY,
"url" varchar(255) NOT NULL UNIQUE,
"articletitle" varchar(1000) NOT NULL,
"articleimage" text,
"articlecontent" text NOT NULL,
"articlehtml" text NOT NULL,
"articledate" date,
"articlecategory" varchar(1000) NOT NULL,
"s_articlecategory" varchar(1000),
"articlesubcategory" varchar(1000) NOT NULL,
"articledomain" varchar(255),
"domainrank" integer NOT NULL,
"articletags" text NOT NULL,
"articleprice" decimal NOT NULL,
"contenthash" varchar(255) UNIQUE
);
This is the table schema. I want to get distinct articlecategory for every article present group by articledomain.
I did:
select distinct articledomain, articlecategory
from spiral_articleinfo
group by articledomain
order by articledomain;
Error coming is:
ERROR: column "spiral_articleinfo.articlecategory" must appear in
the GROUP BY clause or be used in an aggregate function.
How do I get articlecategories?
When I run this query:
select distinct articledomain, articlecategory, articlesubcategory
from spiral_articleinfo;
I get:
articledomain | articlecategory | articlesucategory
androidadvices.com| Android Apps | Books & Reference
androidadvices.com| Android Apps | Business
I want the result to be one domain following its articlecategory and articlesubcategory fields.
Sample Output:
articledomain | articlecategory | articlesucategory
androidadvices.com | Android Apps | Books & Reference, Business
www.womensweb.in | Home|Social Lens, Crime, Law, Safety
suitcaseofstories.in| Himachal|Ladakh, People, Street Photography
You can use string_agg() function along with GROUP BYto achieve the desired output:
with following as an example:
create table foow (articledomain text,articlecategory text,articlesucategory text);
insert into foow values ('androidadvices.com','Android Apps','Books & Reference');
insert into foow values ('androidadvices.com','Android Apps','Business');
and the select statement should be
select articledomain
,articlecategory
,string_agg(articlesucategory,',')
from foow
group by articledomain, articlecategory
SQLFIDDLE DEMO

Postgres and Word Clouds

I would like to know if its possible to create a Postgres function to scan some table rows and create a table that contains WORD and AMOUNT (frequency)? My goal is to use this table to create a Word Cloud.
There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:
SELECT string_to_array(lower(words), ' ') FROM table;
With those arrays, you can use unnest to aggregate them:
WITH words AS (
SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;
This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.
Other, and probably better option, is to use PostgreSQL full text search.
Late to the party but I also needet this and wanted to use full text search.
Which conveniently removes html tags.
So basically you convert the text to a tsvector and then use ts_stat:
select word, nentry
from ts_stat($q$
select to_tsvector('simple', '<div id="main">a b c <b>b c</b></div>')
$q$)
order by nentry desc
Result:
|word|nentry|
|----|------|
|c |2 |
|b |2 |
|a |1 |
But this does not scale well, so here is what I endet up with:
Setup:
-- table with a gist index on the tsvector column
create table wordcloud_data (
html text not null,
tsv tsvector not null
);
create index on wordcloud_data using gist (tsv);
-- trigger to update the tsvector column
create trigger wordcloud_data_tsvupdate
before insert or update on wordcloud_data
for each row execute function tsvector_update_trigger(tsv, 'pg_catalog.simple', html);
-- a view for the wordcloud
create view wordcloud as select word, nentry from ts_stat('select tsv from wordcloud_data') order by nentry desc;
Usage:
-- insert some data
insert into wordcloud_data (html) values
('<div id="id1">aaa</div> <b>bbb</b> <i attribute="ignored">ccc</i>'),
('<div class="class1"><span>bbb</span> <strong>ccc</strong> <pre>ddd</pre></div>');
After that your wordcloud view should look like this:
|word|nentry|
|----|------|
|ccc |2 |
|bbb |2 |
|ddd |1 |
|aaa |1 |
Bonus features:
Replace simple with for example english and postgres will strip out stop words and do stemming for you.