Can I use "char" as a replacement for Enum in PostgreSQL? - postgresql

I've an ever-growing table(~20 Million rows in a year).
The table has a status column which is enum type with 10 states like:
ENQUEUED, DELIVERED, FAILED, ...
Since Postgres enum type takes 4 bytes space, I decided to alter the column to something less needed space like smallint or character(1) or "char".
enum takes 4 bytes
smallint takes 2 bytes
character(1) takes 2 bytes
"char" takes 1 byte
I want to use "char" with this values:
E, D, F, ...
Because "char" takes less space than smallint or character(1)
Is it good idea to use "char" as a less space replacement for enum?
Any ideas would be great appreciated.

Related

Postgres large numeric value operations

I am trying some operations on large numeric field such as 2^89.
Postgres numeric data type can store 131072 on left of decimal and 16383 digits on right of decimal.
I tried some thing like this and it worked:
select 0.037037037037037037037037037037037037037037037037037037037037037037037037037037037037037037037037037::numeric;
But when I put some operator, it rounds off values after 14 digits.
select (2^89)::numeric(40,0);
numeric
-----------------------------
618970019642690000000000000
(1 row)
I know the value from elsewhere is:
>>> 2**89
618970019642690137449562112
Why is this strange behavior. It is not letting me enter values beyond 14 digits numeric to database.
insert into x select (2^89-1)::numeric;
select * from x;
x
-----------------------------
618970019642690000000000000
(1 row)
Is there any way to circumvent this.
Thanks in advance.
bb23850
You should not cast the result but one part of the operation to make clear that this is a numeric operation, not an integer operation:
select (2^89::numeric)
Otherwise PostgreSQL takes the 2 and the 89 as type integer. In that case the result is type integer, too, which is not an exact value at that size. Your cast is a cast of that inaccurate result, so it cannot work.

Convert a varchar column to integer in Redshift

Is there a way in Amazon Redshift to convert a varchar column (with values such as A,B,D,M) to integer (1 for A, 2 for B, 3 for C...and so on) ? I know teradata has something like ASCII() but that doesn't work in Redshift.
Note: My goal is to convert the varchar columns to a number in my query and compare those two columns to see if the numbers are same or different.
demo:db<>fiddle
Postgres:
SELECT
ascii(upper(t.letter)) - 64
FROM
table t
Explanation:
upper() makes the input to capital letters (to handle the different ascii value for capital and non-capital letters)
ascii() converts the letters to ASCII code. The capital letters begin at number 65.
decrease the input by 64 to shift from ASCII starting point == 65 downto 1
Redshift:
The ascii() function is marked as deprecated on Redshift (https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions_leader_node_only.html)
So one possible (and more pragmatic) solution is to get a fixed alphabet string and give out the index for a given letter:
SELECT
letter,
strpos('ABCDEFGHIJKLMNOPQRSTUVWXYZ', upper(t.letter))
FROM
table t

Alphanumeric Sorting in PostgreSQL

I have this table with a character varying column in Postgres 9.6:
id | column
------------
1 |IR ABC-1
2 |IR ABC-2
3 |IR ABC-10
I see some solutions typecasting the column as bytea.
select * from table order by column::bytea.
But it always results to:
id | column
------------
1 |IR ABC-1
2 |IR ABC-10
3 |IR ABC-2
I don't know why '10' always comes before '2'. How do I sort this table, assuming the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
When sorting character data types, collation rules apply - unless you work with locale "C" which sorts characters by there byte values. Applying collation rules may or may not be desirable. It makes sorting more expensive in any case. If you want to sort without collation rules, don't cast to bytea, use COLLATE "C" instead:
SELECT * FROM table ORDER BY column COLLATE "C";
However, this does not yet solve the problem with numbers in the string you mention. Split the string and sort the numeric part as number.
SELECT *
FROM table
ORDER BY split_part(column, '-', 2)::numeric;
Or, if all your numbers fit into bigint or even integer, use that instead (cheaper).
I ignored the leading part because you write:
... the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
Related:
Alphanumeric sorting with PostgreSQL
Split comma separated column data into additional columns
What is the impact of LC_CTYPE on a PostgreSQL database?
Typically, it's best to save distinct parts of a string in separate columns as proper respective data types to avoid any such confusion.
And if the leading string is identical for all columns, consider just dropping the redundant noise. You can always use a VIEW to prepend a string for display, or do it on-the-fly, cheaply.
As in the comments split and cast the integer part
select *
from
table
cross join lateral
regexp_split_to_array(column, '-') r (a)
order by a[1], a[2]::integer

T-SQL surprising DATALENGTH values of char and nchar

SELECT DATALENGTH('źźźź') -- 4
SELECT DATALENGTH(CONVERT(char, 'źźźź')) -- 30
SELECT DATALENGTH(CONVERT(nchar, 'źźźź')) -- 60
SELECT DATALENGTH(CONVERT(varchar, 'źźźź')) -- 4
SELECT DATALENGTH(CONVERT(nvarchar, 'źźźź')) -- 8
I know that char is non-Unicode type, but nchar actually IS a unicode type
Yes - and what's your question?
If you don't define a length in a CAST or CONVERT, then 30 characters is the system-default.
So this
SELECT DATALENGTH(CONVERT(char, 'źźźź'))
is equivalent to
SELECT DATALENGTH(CONVERT(char(30), 'źźźź'))
and since the CHAR (and NCHAR) datatypes are always padded to their defined length, you get 30 characters and thus 30 (char) and 60 (nchar) bytes length.
All perfectly clear and well documented - see MSDN documentation on CAST and CONVERT
length
Is an optional integer that specifies the length of the target data type. The default value is 30.
When you're using the variable length strings instead varchar or nvarchar, you only get as many characters stored as there are really in the string - therefore you get 4 characters and thus 4 and 8 bytes of length.

Is it possible to store a 1 byte number in Postgres?

I have 7 8-bit integer values per record that I want to store in Postgres. Pg doesn't offer a single byte integer type, SMALLINT, or 2 bytes, being the smallest integer datatype. Is there anyway I can store my 7 8-bit numbers and save on space?
Would an array type with a 7 element array be more compact? Or, should I make a binary representation of my 7 numbers (for example, using pack in Perl) and store that in a single bytea field?
Any other suggestions?
Given that the overhead for any row in PostgreSQL is 23 bytes (HeapTupleHeaderData), if you really care about small amounts of space this much you've probably picked the wrong way to store your data.
Regardless, since all the more complicated types have their own overhead (bytea adds four bytes of overhead for example, bit strings 5 to 8), the only way to accomplish what you're looking for is to use a bigint (8 bytes), numerically shifting each value and OR-ing together the result. You can do this using the bit string operations to make the code easier--compute as bit string, then cast to bigint before storing--or just manually multiply/add if you want speed to be better. For example, here's how you store two bytes together into a two byte structure and then get them back again:
int2 = 256 * byte1 + byte2
byte1 = int2 / 256
byte2 = int2 % 256
You can extend the same idea into storing 7 of them that way. Retrieval overhead is still going to be terrible, but you will have actually saved some space in the process. But not very much relative to just the row header.
"char"
This is a one-byte type in PostgreSQL that fits in the range of -128,127. From the docs,
The type "char" (note the quotes) is different from char(1) in that it only uses one byte of storage. It is internally used in the system catalogs as a simplistic enumeration type.
You can bias this towards [-128,127], by subtracting 128 from any input in the range of [0-255] before you write to the database, and adding it back on the output when you read from the database.
-- works
SELECT (-128)::"char", 127::"char";
-- generates out of range
SELECT (-128)::"char";
SELECT 128::"char";
-- Shifts to unsigned range.
-- If you're going to be using "char"
-- review the results of this query!
SELECT
x::int AS "inputUnsigned",
chr(x) AS "extendedASCII",
-- this is the "char" types representation for that input.
signed::"char" AS "charRepresentation",
signed AS "inputUnsignedToSigned",
signed+128 AS "inputUnsignedToSignedToUnsigned"
FROM generate_series(1,255) AS gs(x)
-- Here we map the input in the range of [0,255] to [-128,127]
CROSS JOIN LATERAL ( VALUES (x::int-128) )
AS v(signed);
Small excerpt of the output
inputUnsigned | extendedASCII | charRepresentation | inputUnsignedToSigned | inputUnsignedToSignedToUnsigned
---------------+---------------+--------------------+-----------------------+---------------------------------
....
190 | ¾ | > | 62 | 190
191 | ¿ | ? | 63 | 191
192 | À | # | 64 | 192
193 | Á | A | 65 | 193
194 | Â | B | 66 | 194
195 | Ã | C | 67 | 195
196 | Ä | D | 68 | 196
...
We use generate_series(1,255) only because chr(0) throws because you can't generate or output the ASCII NUL (PostgreSQL uses cstrings)
pguint Extension
Pguint is an extension that provides two one byte representations,
int1 (signed)
uint1 (unsigned)
There is pg_catalog.char ( another notation - "char" ) type which uses only 1 byte to store its value.
select pg_column_size( 'A' );
pg_column_size
----------------
2
(1 row)
select pg_column_size( 'A'::"char" );
pg_column_size
----------------
1
(1 row)
Will you ever lookup records using these values?
If yes - use normal datatypes like int4 (or even int8 if you're on 64bit architecture).
If not - first ask yourself - what is the point of storing this values in Pg? You can use bytea (complicated i/o), or bitstrings (even more complicated i/o) but what is the point? How many billion records you're going to have? Did you actually check that smaller datatype uses less space (hint: it doesn't, check it - there are data alignment issues involved)? Are you working under impression that smaller datatype is faster (it isn't. It's actually more complex to compare two int2 values than two int4 values on 32bit architecture).
You will want to look into the bytea data type referenced here: http://www.postgresql.org/docs/8.4/interactive/datatype-binary.html
There is also bit string data types: http://www.postgresql.org/docs/8.4/interactive/datatype-bit.html
First you asked about 7, but now 6 bytes. Six 8-bit values corresponds exactly to MAC address size and PostgreSQL's built in type macaddr. You can insert those bytes using MAC syntax f.i. A1-B2-C3-D4-E5-F6.
I haven't tested them myself, but there are extensions for this; e.g http://pgxn.org/dist/tinyint/ .