Insert Update/Merge/Dimension Lookup/ Update using Pentaho - postgresql

There is a table 'TICKETS' in PostgreSQL.I perform an ETL job using Pentaho to populate this table.
There is also a GUI on which a user makes changes and the result is reflected in this table.
The fields in the table are :
"OID" Char(36) <------ **PRIMARY KEY**
, "CUSTOMER" VARCHAR(255)
, "TICKETID" VARCHAR(255)
, "PRIO_ORIG" CHAR(36)
, "PRIO_COR" CHAR(36)
, "CATEGORY" VARCHAR(255)
, "OPENDATE_ORIG" TIMESTAMP
, "OPENDATE_COR" TIMESTAMP
, "TTA_ORIG" TIMESTAMP
, "TTA_COR" TIMESTAMP
, "TTA_DUR" DOUBLE PRECISION
, "MTTA_TARGET" DOUBLE PRECISION
, "TTA_REL_ORIG" BOOLEAN
, "TTA_REL_COR" BOOLEAN
, "TTA_DISCOUNT_COR" DOUBLE PRECISION
, "TTA_CHARGE_COR" DOUBLE PRECISION
, "TTR_ORIG" TIMESTAMP
, "TTR_COR" TIMESTAMP
, "TTR_DUR" DOUBLE PRECISION
, "MTTR_TARGET" DOUBLE PRECISION
, "TTR_REL_ORIG" BOOLEAN
, "TTR_REL_COR" BOOLEAN
, "TTR_DISCOUNT_COR" DOUBLE PRECISION
, "TTR_CHARGE_COR" DOUBLE PRECISION
, "COMMENT" VARCHAR(500)
, "USER" CHAR(36)
, "MODIFY_DATE" TIMESTAMP
, "CORRECTED" BOOLEAN
, "OPTIMISTICLOCKFIELD" INTEGER
, "GCRECORD" INTEGER
, "ORIGINATOR" Char(36)
I want to update the table when columns TICKETID+ORIGINATOR+CUSTOMERS are same. Otherwise, an insert will be performed.
How should I do it using Pentaho? Is the step Dimension/Lookup update fine for it or just the Update/Insert step will do the work ?
Any help would be much appreciated. Thanks in advance.

Eugene Lisitsky suggestion is good practice: you may hard wire it in the database constraints and let PostgesSQL do the job.
For a PDI solution: your table does not look like Slowly Changing Dimension therefore the Insert/Update covers your need.
If you want to use the Dimension_update, you need to alter the table in the Pentaho SCD format: add a version column and valid_form_date/valid_upto_date (with PDI the alter is a one button operation).
After that, when a new row comes in, the TICKETID+ORIGINATOR+CUSTOMERS is searched in the table and if found it receives a valitity_upto=now(). At the same time, a version+1 is created in the table valid from now() to the end-of-time.
The (main) pro is that you can retrieve the state of the database as it was at any date in the past with a simple where now() between validity_from and validity_upto.
The (mian) con is that you have to alter the table which may have some impact on the GUIs (plural).

Related

postgresql group by datetime in join query

I have 2 tables in my postgresql timescaledb database (version 12.06) that I try to query through inner join.
Tables' structure:
CREATE TABLE currency(
id serial PRIMARY KEY,
symbol TEXT NOT NULL,
name TEXT NOT NULL,
quote_asset TEXT
);
CREATE TABLE currency_price (
currency_id integer NOT NULL,
dt timestamp WITHOUT time ZONE NOT NULL,
open NUMERIC NOT NULL,
high NUMERIC NOT NULL,
low NUMERIC NOT NULL,
close NUMERIC,
volume NUMERIC NOT NULL,
PRIMARY KEY (
currency_id,
dt
),
CONSTRAINT fk_currency FOREIGN KEY (currency_id) REFERENCES currency(id)
);
The query I'm trying to make is:
SELECT currency_id AS id, symbol, MAX(close) AS close, DATE(dt) AS date
FROM currency_price
JOIN currency ON
currency.id = currency_price.currency_id
GROUP BY currency_id, symbol, date
LIMIT 100;
Basically, it returns all the rows that exist in currency_price table. I know that postgres doesn't allow select columns without an aggregate function or including them in "group by" clause. So, if I don't include dt column in my select query, i receive expected results, but if I include it, the output shows rows of every single day of each currency while I only want to have the max value of every currency and filter them out based on various dates afterwards.
I'm very inexperienced with SQL in general.
Any suggestions to solve this would be very appreciated.
There are several ways to do it, easiest one comes to mind is using window functions.
select *
from (
SELECT currency_id,symbol,close,dt
,row_number() over(partition by currency_id,symbol
order by close desc,dt desc) as rr
FROM currency_price
JOIN currency ON currency.id = currency_price.currency_id
where dt::date = '2021-06-07'
)q1
where rr=1
General window functions:
https://www.postgresql.org/docs/9.5/functions-window.html
works also with standard aggregate functions like SUM,AVG,MAX,MIN and others.
Some examples: https://www.postgresqltutorial.com/postgresql-window-function/

Speed up heavy UPDATE..FROM..WHERE PostgreSQL query

I have 2 big tables
CREATE TABLE "public"."linkages" (
"supplierid" integer NOT NULL,
"articlenumber" character varying(32) NOT NULL,
"article_id" integer,
"vehicle_id" integer
);
CREATE INDEX "__linkages_datasupplierarticlenumber" ON "public"."__linkages" USING btree ("datasupplierarticlenumber");
CREATE INDEX "__linkages_supplierid" ON "public"."__linkages" USING btree ("supplierid");
having 215 000 000 records, and
CREATE TABLE "public"."article" (
"id" integer DEFAULT nextval('tecdoc_article_id_seq') NOT NULL,
"part_number" character varying(32),
"supplier_id" integer,
CONSTRAINT "tecdoc_article_part_number_supplier_id" UNIQUE ("part_number", "supplier_id")
) WITH (oids = false);
having 5 500 000 records.
I need to update linkages.article_id according article.part_number and article.supplier_id, like this:
UPDATE linkages
SET article_id = article.id
FROM
article
WHERE
linkages.supplierid = article.supplier_id AND
linkages.articlenumber = article.part_number;
But it is to heavy. I tried this, but it processed for a day with no result. So I had terminated it.
I need to do this update only once to normalize my datatable structure for using foreign keys in Django ORM. How can I resolve this issue?
Thanks a lot!

T-SQL data type for 39-digit number?

I'm using the free IPv6 data table from Ip2Location. The table schema is defined as:
CREATE TABLE [ip2location].[dbo].[ip2location_db11_ipv6](
[ip_from] char(39) NOT NULL,
[ip_to] char(39) NOT NULL,
[country_code] nvarchar(2) NOT NULL,
[country_name] nvarchar(64) NOT NULL,
[region_name] nvarchar(128) NOT NULL,
[city_name] nvarchar(128) NOT NULL,
[latitude] float NOT NULL,
[longitude] float NOT NULL,
[zip_code] nvarchar(30) NOT NULL,
[time_zone] nvarchar(8) NOT NULL
) ON [PRIMARY]
GO
CREATE INDEX [ip_from] ON [ip2location].[dbo].[ip2location_db11_ipv6]([ip_from]) ON [PRIMARY]
GO
CREATE INDEX [ip_to] ON [ip2location].[dbo].[ip2location_db11_ipv6]([ip_to]) ON [PRIMARY]
GO
Table queries are done with a converted IPv6 address to an IpNumber. The provided search query is:
SELECT *
FROM ip2location_db11_ipv6
WHERE #IpNumber BETWEEN ip_from AND ip_to
My average query time based on a few random look ups was ~1.2 seconds. While there are 3,458,959 records in the table, this seems a bit slow to me (is it? I'm by no means a SQL guru). My first thought was to make the ip_from and ip_to columns a numeric data type, but the maximum value is 58569107375850417935858934690443427840 (39 digits), which falls outside of the maximum range for a DECIMAL type. Is there anything that can be done to improve the query time for this?
You may try to use BINARY(17), 39 decimals fits into 17 bytes. Binary type doc.

my SQL query runs slow need to optimise

I'm having issues with the below query for my iPhone app. When the app runs the query it takes quite a while to process the result, maybe around a second or so... I was wondering if the query can be optimised in anyway? I'm using the FMDB framework to proces all my SQL.
select pd.discounttypeid, pd.productdiscountid, pd.quantity, pd.value, p.name, p.price, pi.path
from productdeals as pd, product as p, productimages as pi
where pd.productid = 53252
and pd.discounttypeid == 8769
and pd.productdiscountid = p.parentproductid
and pd.productdiscountid = pi.productid
and pi.type = 362
order by pd.id
limit 1
My statements are below for the tables:
CREATE TABLE "ProductImages" (
"ProductID" INTEGER,
"Type" INTEGER,
"Path" TEXT
)
CREATE TABLE "Product" (
"ProductID" INTEGER PRIMARY KEY,
"ParentProductID" INTEGER,
"levelType" INTEGER,
"SKU" TEXT,
"Name" TEXT,
"BrandID" INTEGER,
"Option1" INTEGER,
"Option2" INTEGER,
"Option3" INTEGER,
"Option4" INTEGER,
"Option5" INTEGER,
"Price" NUMERIC,
"RRP" NUMERIC,
"averageRating" INTEGER,
"publishedDate" DateTime,
"salesLastWeek" INTEGER
)
CREATE TABLE "ProductDeals" (
"ID" INTEGER,
"ProductID" INTEGER,
"DiscountTypeID" INTEGER,
"ProductDiscountID" INTEGER,
"Quantity" INTEGER,
"Value" INTEGER
)
Do you have indexes on foreign key columns (productimages.productid and product.parentproductid), and the columns you use to find right product deal (productdeals.productid and productdeals.discounttypeid)? If not, that could be the cause of poor performance.
You can create them like this:
CREATE INDEX idx_images_productid ON productimages(productid);
CREATE INDEX idx_products_parentid ON products(parentproductid);
CREATE INDEX idx_deals ON productdeals(productid, discounttypeid);
The below query could help you out to reduce the execution time, moreover try to create the indexes the fields correctly to fasten your query.
select pd.discounttypeid, pd.productdiscountid, pd.quantity, pd.value, p.name,
p.price, pi.path from productdeals pd join product p on pd.productdiscountid =
p.parentproductid join productimages pi on pd.productdiscountid = pi.productid where
pd.productid = 53252 and pd.discounttypeid = 8769 and pi.type = 362 order by pd.id
limit 1
Thanks

an empty row with null-like values in not-null field

I'm using postgresql 9.0 beta 4.
After inserting a lot of data into a partitioned table, i found a weird thing. When I query the table, i can see an empty row with null-like values in 'not-null' fields.
That weird query result is like below.
689th row is empty. The first 3 fields, (stid, d, ticker), are composing primary key. So they should not be null. The query i used is this.
select * from st_daily2 where stid=267408 order by d
I can even do the group by on this data.
select stid, date_trunc('month', d) ym, count(*) from st_daily2
where stid=267408 group by stid, date_trunc('month', d)
The 'group by' results still has the empty row.
The 1st row is empty.
But if i query where 'stid' or 'd' is null, then it returns nothing.
Is this a bug of postgresql 9b4? Or some data corruption?
EDIT :
I added my table definition.
CREATE TABLE st_daily
(
stid integer NOT NULL,
d date NOT NULL,
ticker character varying(15) NOT NULL,
mp integer NOT NULL,
settlep double precision NOT NULL,
prft integer NOT NULL,
atr20 double precision NOT NULL,
upd timestamp with time zone,
ntrds double precision
)
WITH (
OIDS=FALSE
);
CREATE TABLE st_daily2
(
CONSTRAINT st_daily2_pk PRIMARY KEY (stid, d, ticker),
CONSTRAINT st_daily2_strgs_fk FOREIGN KEY (stid)
REFERENCES strgs (stid) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT st_daily2_ck CHECK (stid >= 200000 AND stid < 300000)
)
INHERITS (st_daily)
WITH (
OIDS=FALSE
);
The data in this table is simulation results. Multithreaded multiple simulation engines written in c# insert data into the database using Npgsql.
psql also shows the empty row.
You'd better leave a posting at http://www.postgresql.org/support/submitbug
Some questions:
Could you show use the table
definitions and constraints for the
partions?
How did you load your data?
You get the same result when using
another tool, like psql?
The answer to your problem may very well lie in your first sentence:
I'm using postgresql 9.0 beta 4.
Why would you do that? Upgrade to a stable release. Preferably the latest point-release of the current version.
This is 9.1.4 as of today.
I got to the same point: "what in the heck is that blank value?"
No, it's not a NULL, it's a -infinity.
To filter for such a row use:
WHERE
case when mytestcolumn = '-infinity'::timestamp or
mytestcolumn = 'infinity'::timestamp
then NULL else mytestcolumn end IS NULL
instead of:
WHERE mytestcolumn IS NULL