PostgreSQL - random primary key - postgresql

I need a primary key for a PostgreSQL table. The ID should consist out of a number from about 20 numbers.
I am a beginner at database and also worked not with PostgreSQL. I found some examples for a random id, but that examples where with characters and I need only an integer.
Can anyone help me to resolve this problem?

I'm guessing you actually mean random 20 digit numbers, because a random number between 1 and 20 would rapidly repeat and cause collisions.
What you need probably isn't actually a random number, it's a number that appears random, while actually being a non-repeating pseudo-random sequence. Otherwise your inserts will randomly fail when there's a collision.
When I wanted to do something like this a while ago I asked the pgsql-general list, and got a very useful piece of advice: Use a feistel cipher over a normal sequence. See this useful wiki example. Credit to Daniel Vérité for the implementation.
Example:
postgres=# SELECT n, pseudo_encrypt(n) FROM generate_series(1,20) n;
n | pseudo_encrypt
----+----------------
1 | 1241588087
2 | 1500453386
3 | 1755259484
4 | 2014125264
5 | 124940686
6 | 379599332
7 | 638874329
8 | 898116564
9 | 1156015917
10 | 1410740028
11 | 1669489846
12 | 1929076480
13 | 36388047
14 | 295531848
15 | 554577288
16 | 809465203
17 | 1066218948
18 | 1326999099
19 | 1579890169
20 | 1840408665
(20 rows)
These aren't 20 digits, but you can pad them by multiplying them and truncating the result, or you can modify the feistel cipher function to produce larger values.
To use this for key generation, just write:
CREATE SEQUENCE mytable_id_seq;
CREATE TABLE mytable (
id bigint primary key default pseudo_encrypt(nextval('mytable_id_seq')),
....
);
ALTER SEQUENCE mytable_id_seq OWNED BY mytable;

Related

Why does selecting a real type only prints out max 6 digits in Postgres?

I have a Postgres table which has a column of real type.
When I use a query in order to select the values from this column (e. g. select amount from my_table), it round the value for numbers larger than 100,000. (It completely ignores the decimal part). For numbers larger than 1,000,000, it displays them using the scientific notation. For numbers smaller than 100,000 it strips off the decimal part so that only 6 digits max are printed out overall (e. g. 1,000.12345 becomes 1,000.12).
It is only when I cast the value to double precision (using CAST(amount as double precision)) that it starts to behave as I'd expect and it prints out all the stored decimal digits.
Does anyone have an idea why is Postgres behaving in this manner?
The answer is likely in PostgreSQL source code postgres/src/fe_utils/print.c.
Comments for format_numeric_locale say:
/*
* Format a numeric value per current LC_NUMERIC locale setting
*
* Returns the appropriately formatted string in a new allocated block,
* caller must free.
*
* setDecimalLocale() must have been called earlier.
*/
See also comments in setDecimalLocale.
It's also possible that some reasons can be found in SQL standards.
OK but with PG 12.2 there are some formatting differences between real and double precision:
# select x, x::double precision from t order by 1;
x | x
---------------+-------------
123456 | 123456
999999 | 999999
1e+06 | 1000000
1.000001e+06 | 1000001
1.234567e+06 | 1234567
1.2345671e+06 | 1234567.125
1.234568e+06 | 1234568
9.999999e+06 | 9999999
9.999999e+06 | 9999999
1e+08 | 100000000
1.2345679e+09 | 1234567936
1.2345679e+09 | 1234567936
(12 rows)
And it seems that you can have 7 decimal digits precision.

Relational databse design to represent similarity between rows of same table

For background purposes: I'm using PostgreSQL with SQLAlchemy (Python).
Given a table of unique references as such:
references_table
-----------------------
id | reference_code
-----------------------
1 | CODEABCD1
2 | CODEABCD2
3 | CODEWXYZ9
4 | CODEPOIU0
...
In a typical scenario, I would have a separate items table:
items_table
-----------------------
id | item_descr
-----------------------
1 | `Some item A`
2 | `Some item B`
3 | `Some item C`
4 | `Some item D`
...
In such typical scenario, the many-to-many relationship between references and items is set in a junction table:
references_to_items
-----------------------
ref_id (FK) | item_id (FK)
-----------------------
1 | 4
2 | 1
3 | 2
4 | 1
...
In that scenario, it is easy to model and obtain all references that are associated to the same item, for instance item 1 has references 2 and 4 as per table above.
However, in my scenario, there is no items_table. But I would still want to model the fact that some references refer to the same (non-represented) item.
I see a possibility to model that via a many-to-many junction table as such (associating FKs of the references table):
reference_similarities
-----------------------
ref_id (FK) | ref_id_similar (FK)
-----------------------
2 | 4
2 | 8
2 | 9
...
Where references with ID 2, 4, 8 and 9 would be considered 'similar' for the purposes of my data model.
However, the inconvenience here is that such model requires to choose one reference (above id=2) as a 'pivot', to which multiple others can be declared 'similar' in the reference_similarities table. Ref 2 is similar to 4 and ref 2 is similar to 8 ==> thus 4 is similar to 8.
So the question is: is there a better design that doesn't involve having a 'pivot' FK as above?
Ideally, I would store the 'similarity' as an Array of FKs as such:
reference_similarities
------------------------
id | ref_ids (Array of FKs)
------------------------
1 | [2, 4, 8, 9]
2 | [1, 3, 5]
..but I understand from https://dba.stackexchange.com/questions/60132/foreign-key-constraint-on-array-member that it is currently not possible to have foreign keys in PostgreSQL arrays. So I'm trying to figure out a better design for this model.
I can understand that you want to group items in a set, and able to query the set from any of item in it.
You can use a hash function to hash a set, then use the hash as pivot value.
For example you have a set of values (2,4,8,9), it will be hashed like this:
hash = ((((31*1 + 2)*31 + 4)*31 + 8)*31 + 9
you can refer to Arrays.hashCode in Java to know how to hash a list of values.
int result = 1;
for (Object element : a)
result = 31 * result + (element == null ? 0 : element.hashCode());
Table reference_similarities:
reference_similarities
-----------------------
ref_id (FK) | hash_value
-----------------------
2 | hash(2, 4, 8, 9) = 987204
4 | 987204
8 | 987204
9 | 987204
To query the set, you can first query hash_value from ref_id first, then, get all ref_id from hash_value.
The draw back of this solution is every time you add a new value to a set, you have to rehash the set.
Another solution is you can just write a function in Python to produce a unique hash_value when creating a new set.

Postgres Query for Beginners

Ok, I deleted previous post and will try this again. I am sure I don't know the topic and I'm not sure if this is a loop or if I should use a stored function or how to get what I'm looking for. Here's sample data and expected output;
I have a single table A. Table has following fields; date created, unique person key, type, location.
I need a Postgres query that says for any given month(parameter, based on date created) and given a location(parameter based on location field), provide me fieds below where unique person key may be duplicated + or – 30 days from the date created within the month given for same type but all locations.
Example Data
Date Created | Unique Person | Type | Location
---------------------------------------------------
2/5/2017 | 1 | Admit | Hospital1
2/6/2017 | 2 | Admit | Hospital2
2/15/2017 | 1 | Admit | Hospital2
2/28/2017 | 3 | Admit | Hospital2
3/3/2017 | 2 | Admit | Hospital1
3/15/2017 | 3 | Admit | Hospital3
3/20/2017 | 4 | Admit | Hospital1
4/1/2017 | 1 | Admit | Hospital2
Output for the month of March for Hospital1:
DateCreated| UniquePerson | Type | Location | +-30days | OtherLoc.
------------------------------------------------------------------------
3/3/2017 | 2 | Admit| Hospital1 | 2/6/2017 | Hospital2
Output for the month of March for Hospital2:
None, because no one was seen at Hospital2 in March
Output for the month of March for Hospital3:
DateCreated| UniquePerson | Type | Location | +-30days | otherLoc.
------------------------------------------------------------------------
3/15/2017 | 3 | Admit| Hospital3 | 2/28/2017 | Hospital2
Version 1
I would use a WITH clause. Please, notice that I've added a column id that is a primary key to simplify the query. It's just to prevent the rows to be matched with themselves.
WITH x AS (
SELECT
id,
date_created,
unique_person_id,
type,
location
FROM
a
WHERE
location = 'Hospital1' AND
date_trunc('month', date_created) = date_trunc('month', '2017-03-01'::date)
)
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
x
JOIN a
USING (unique_person_id, type)
WHERE
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
Now a little bit of explanations:
First we select, let's say a reference data with a WITH clause. Think of it as a temporary table that we can reference in the main query. In given example it could be a "main visit" in given hospital.
Then we join "main visits" with other visits of the same person and type (JOIN condition) that happen in date difference of 30 days (WHERE condition).
Notice that the WITH query has the limits you want to check (location and date). I use date_trunc function that truncates the date to specified precision (a month in this case).
Version 2
As #Laurenz Albe suggested, there is no special need to use a WITH clause. Right, so here is a second version.
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
a AS x
JOIN a
USING (unique_person_id, type)
WHERE
x.location = 'Hospital1' AND
date_trunc('month', x.date_created) = date_trunc('month', '2017-03-01'::date) AND
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
This version is shorter than the first one but, in my opinion, the first is easier to understand. I don't have big enough set of data to test and I wonder which one runs faster (the query planner shows similar values for both).

Table size with page layout

I'm using PostgreSQL 9.2 on Oracle Linux Server release 6.3.
According to the storage layout documentation, a page layout holds:
PageHeaderData(24 byte)
n number of points to item(index item / table item) AKA ItemIdData(4 byte)
free space
n number of items
special space
I tested it to make some formula to estimate table size anticipated...(TOAST concept might be ignored.)
postgres=# \d t1;
Table "public.t1"
Column ',' Type ',' Modifiers
---------------+------------------------+------------------------------
code |character varying(8) |not null
name |character varying(100) |not null
act_yn |character(1) |not null default 'N'::bpchar
desc |character varying(100) |not null
org_code1 |character varying(3) |
org_cole2 |character varying(10) |
postgres=# insert into t1 values(
'11111111', -- 8
'1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111', <-- 100
'Y',
'1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111', <-- 100
'111',
'1111111111');
postgres=# select * from pgstattuple('t1');
table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+--------------
8192 | 1 | 252 | 3.08 | 1 | 252 | 3.08 | 7644 | 93.31
(1 row)
Why is tuple_len 252 instead of 249? ("222 byte of all column's maximum length" PLUS
"27 byte of tuple header followed by an optional null bitmap, an optional object ID field, and the user data")
Where do the 3 bytes come from?
Is there something wrong with my formula?
Your calculation is off at several points.
Storage size of varchar, text (and character!) is, quoting the manual):
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1.
Long strings are compressed by the system automatically, so the
physical requirement on disk might be less.
Bold emphasis mine to address question in comment.
The HeapTupleHeader occupies 23 bytes. But each tuple ("item" - row or index entry) has an item identifier at the start of the data page to it, totaling at the mentioned 27 bytes. The distinction is relevant as actual user data begins at a multiple of MAXALIGN from the start of each item, and the item identifier does not count against this offset - as well as the actual "tuple size".
1 byte of padding due to data alignment (multiple of 8), which is used for the NULL bitmap in this case.
No padding for type varchar (but the additional byte mentioned above)
So, the actual calculation (with all columns filled to the maximum) is:
23 -- heap tuple header
+ 1 -- NULL bitmap (or padding if row has NO null values)
+ 9 -- columns ...
+ 101
+ 2
+ 101
+ 4
+ 11
-------------
252 bytes
+ 4 -- item identifier at page start
Related:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
Calculating and saving space in PostgreSQL

Check if field value is in a list of strings in SSRS report

I'm using SSRS (VS2008) and creating a report of work orders. In the detail line of the report table, I have the following columns (with some fake data)
WONUM | A | B | Hours
ABC123 | 3 | 0 | 3
SPECIAL| 0 | 6 | 6
DEF456 | 5 | 0 | 5
GHI789 | 4 | 0 | 4
OTHER | 0 | 2 | 2
As you can kind of see, all work orders have a work order number (WONUM) as well as a total # of hours (HOURS). I need to put the hours into either column A or column B based on WONUM. I have a list of specifically named work orders (in the example, they would be "SPECIAL" and "OTHER") which would cause the HOURS value to be put in column B. If the WONUM is NOT a special named one, then it goes in column A. Here's what I WANTED to put as the expression for column A and column B:
Column A: =IIF(Fields!WONUM.Value IN ("SPECIAL","OTHER"), 0, Fields!Hours.Value)
Column B: =IIF(Fields!WONUM.Value IN ("SPECIAL","OTHER"), Fields!Hours.Value, 0)
But as you're probably aware, Fields!WONUM.Value IN ("SPECIAL","OTHER") is not a valid method of doing this! What is the best way to make this work? I cannot flag it in the SQL query in any other way for other reasons so it must be done in the table.
Thanks in advance for any and all help!
Try this, (Using InStr() function)
IIF(InStr(Fields!WONUM.Value,"SPECIAL")>0 OR InStr(Fields!WONUM.Value,"OTHER")>0, 0, Fields!Hours.Value)
IIF(InStr(Fields!WONUM.Value,"SPECIAL")>0 OR InStr(Fields!WONUM.Value,"OTHER")>0, Fields!Hours.Value,0)
If it's just the two WONUMs then you can do this:
Column A:
=IIF((Fields!WONUM.Value <> "SPECIAL") AND (Fields!WONUM.Value <> "OTHER"), Fields!Hours.Value, 0)
Column B:
=IIF((Fields!WONUM.Value = "SPECIAL") OR (Fields!WONUM.Value = "OTHER"), Fields!Hours.Value, 0)
or use the same formula in each column for consistency and swap the field/0 at the end.