database design for new system but legacy dependency - postgresql

We are planing to make a new project (complete relaunch) of a web application in PHP (Symfony 2) and PostgreSQL. Currently we use PHP and MySQL (MyISAM). -> webapp
The current and new webapp depends on another system (.NET) including a database (MS SQL 8 / 2000), which will not be modified (changed or merge the databases together) anytime soon, because there is a complex workflow with the whole megillah -> legacy system
BTW: biggest table has 27 million rows in total
Most of the data/tables will be transfered multuple times per day from the legacy database to the webapp database. For the new webapp we already redesigned most of the database schema, so we have now almost a normalised schema (the schema of the legacy database is massive redundant and really messy)
Currently the transfer job try to insert data. When there is an exception with the specific code, we know the row already there and then do a update. This is because of performance (no select before update).
For the new webapp schema we still want to use the same primary IDs like in the legacy database. But there are some problems, one of them: some tables has primary keys which looks like a integer, but they aren't. most of the rows have integers like 123456, but then, there are some rows with a character like 123456P32.
Now there are two options for the new schema:
Use string type for PK and risk performance issues
Use integer type for PK and make a conversion
The conversion could look like this (character based)
legacy new
--------------------------
0 10
1 11
2 12
. ..
9 19
a 20
b 21
. ..
y 45
z 46
A 50 (not 47, because the arity of the second digit is 'clean' with 50)
B 51
. ..
Z 76
The legacy pk 123 would be converted into 111213, so the length is double from original. Another example 123A9 -> 1112135019. Because every character hase two digits it also can be converted back.
My first doubt was that the sparse PKs would bring some performance issues, but when using b-tree (self-balancing) as index which is default index sysetm for Postgres, it should be fine.
What do you think? Have you some experience with similar systems with legacy dependencies?

PostgreSQL performance with text PK isn't that bad — I'd go with it for simplicity.
You didn't tell us how long can these keys be. Using your conversion an ordinary integer would be enough for only 4 character key and bigint only for 9.

Use CREATE DOMAIN to isolate the proposed data types. Then build and test a prototype. You're lucky; you have no shortage of valid test data.
create domain legacy_key as varchar(15) not null;
create table your_first_table (
new_key_name legacy_key primary key,
-- other columns go here.
);
To test a second database using integer keys, dump the schema, change that one line (and the name of the database if you want to have them both at the same time), and reload.
create domain legacy_key as bigint not null;
You should think hard about storing the legacy system's primary key exactly as they are. Nothing to debug--great peace of mind. If you must convert, be careful with values like '1234P45'. If that letter happens to be an E or a D, some applications will interpret it as indicating an exponent.
You shouldn't have performance problems due to key length if you're using varchar() keys of 10 or 15 characters, especially with version 9.2. Read the documentation about indexes before you start. PostgreSQL supports more kinds of indexes than most people realize.

Related

Binary to binary cast with JSONb

How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.

Changing GenerationType.IDENTITY to GenerationType.SEQUENCE (and back) PostgreSQL

We had database with entities using GenerationType.IDENTITY. There were some technical issues with it and I replaced it with GenerationType.SEQUENCE because I misunderstood what SEQUENCE does. Now we have database tables all numbered by +1 increments nicely until the point where usage of SEQUENCE started -> there the numbering is shared across all tables and quickly spirals to huge numbers with huge gaps in each table since it shared across the whole system.
When I reverted to IDENTITY now, it started numbering where it stopped when last used IDENTITY. Can (and what would happen if) IDENTITY provide number which would collide with number of SEQUENCE? (sadly I can not really test it because last IDENTITY entry is for example 9000 and first SEQUENCE is 25 mils - its because there is lot of addition/deletion in DB over month period)
Is there any way to maybe start numbering IDENTITY from the last SEQUENCE value then?
Thanks for any help.
The difference between IDENTITY and SEQUENCE - a least in PostgreSQL - is that IDENTITY creates a sequence per table while SEQUENCE use one global sequence for all the tables as you noticed.
Assuming you have an entity named MyEntity (or my_entity as table name) - with IDENTITY strategy - there should be a sequence named something like: my_entity_id_sequence. SEQUENCE might be named (this depends on the JPA implementation) like hibernate_sequence.
If you can access the database - for example with psql - you can just check the IDENTITY sequence and set its value:.
ALTER SEQUENCE my_entity_id_sequence RESTART with_your_value;
where with_your_value can be for example
SELECT max(id)+1 FROM my_entity;

How to implement a high performing non incremental ID in postgresql? [duplicate]

I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?
For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)
I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.
I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here
How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.
Your best bet would probably be some form of hash function, and then a checksum added to the end.
If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

DB2 Auto generated Column / GENERATED ALWAYS pros and cons over sequence

Earlier we were using 'GENERATED ALWAYS' for generating the values for a primary key. But now it is suggested that we should, instead of using 'GENERATED ALWAYS' , use sequence for populating the value of primary key. What do you think can be the reason of this change? It this just a matter of choice?
Earlier Code:
CREATE TABLE SCH.TAB1
(TAB_P INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1, NO CACHE),
.
.
);
Now it is
CREATE TABLE SCH.TAB1
(TAB_P INTEGER ),
.
.
);
now while inserting, generate the value for TAB_P via sequence.
I tend to use identity columns more than sequences, but I'll compare the two for you.
Sequences can generate numbers for any purpose, while an identity column is strictly attached to a column in a table.
Since a sequence is an independent object, it can generate numbers for multiple tables (or anything else), and is not affected when any table is dropped. When a table with a identity column is dropped, there is no memory of what value was last assigned by that identity column.
A table can have only one identity column, so if you want to want to record multiple sequential numbers into different columns in the same table, sequence objects can handle that.
The most common requirement for a sequential number generator in a database is to assign a technical key to a row, which is handled well by an identity column. For more complicated number generation needs, a sequence object offers more flexibility.
This might probably be to handle ids in case there are lots of deletes on the table.
For eg: In case of identity, if your ids are
1
2
3
Now if you delete record 3, your table will have
1
2
And then if your insert a new record, the ids will be
1
2
4
As opposed to this, if you are not using an identity column and are generating the id using code, then after delete for the new insert you can calculate id as max(id) + 1, so the ids will be in order
1
2
3
I can't think of any other reason, why an identity column should not be used.
Heres something I found on the publib site:
Comparing IDENTITY columns and sequences
While there are similarities between IDENTITY columns and sequences, there are also differences. The characteristics of each can be used when designing your database and applications.
An identity column has the following characteristics:
An identity column can be defined as
part of a table only when the table
is created. Once a table is created,
you cannot alter it to add an
identity column. (However, existing
identity column characteristics might
be altered.)
An identity column
automatically generates values for a
single table.
When an identity
column is defined as GENERATED
ALWAYS, the values used are always
generated by the database manager.
Applications are not allowed to
provide their own values during the
modification of the contents of the
table.
A sequence object has the following characteristics:
A sequence object is a database
object that is not tied to any one
table.
A sequence object generates
sequential values that can be used in
any SQL or XQuery statement.
Since a sequence object can be used
by any application, there are two
expressions used to control the
retrieval of the next value in the
specified sequence and the value
generated previous to the statement
being executed. The PREVIOUS VALUE
expression returns the most recently
generated value for the specified
sequence for a previous statement
within the current session. The NEXT
VALUE expression returns the next
value for the specified sequence. The
use of these expressions allows the
same value to be used across several
SQL and XQuery statements within
several tables.
While these are not all of the characteristics of these two items, these characteristics will assist you in determining which to use depending on your database design and the applications using the database.
I don't know why anyone would EVER use an identity column rather than a sequence.
Sequences accomplish the same thing and are far more straight forward. Identity columns are much more of a pain especially when you want to do unloads and loads of the data to other environments. I not going to go into all the differences as that information can be found in the manuals but I can tell you that the DBA's have to almost always get involved anytime a user wants to migrate data from one environment to another when a table with an identity is involved because it can get confusing for the users. We have no issues when a sequence is used. We allow the users to update any schema objects so they can alter their sequences if they need to.