Why does PostgreSQL consider NULL boundaries in range types to be distinct from infinite boundaries? - postgresql

Just to preface, I'm not asking what the difference is between a NULL boundary and an infinite boundary - that's covered in this other question. Rather, I'm asking why PostgreSQL makes a distinction between NULL and infinite boundaries when (as far as I can tell) they function exactly the same.
I started using PostgreSQL's range types recently, and I'm a bit confused by what NULL values in range types are supposed to mean. The documentation says:
The lower bound of a range can be omitted, meaning that all values less than the upper bound are included in the range, e.g., (,3]. Likewise, if the upper bound of the range is omitted, then all values greater than the lower bound are included in the range. If both lower and upper bounds are omitted, all values of the element type are considered to be in the range.
This suggests to me that omitted boundaries in a range (which are the equivalent NULL boundaries specified in a range type's constructor) should be considered infinite. However, PostgreSQL makes a distinction between NULL boundaries and infinite boundaries. The documentation continues:
You can think of these missing values [in a range] as +/-infinity, but they are special range type values and are considered to be beyond any range element type's +/-infinity values.
This is puzzling. "beyond infinity" doesn't make sense, as the entire point of infinite values is that nothing can be greater than +infinity or less than -infinity. That doesn't break "element in range"-type checks, but it does introduce an interesting case for primary keys that I think most people wouldn't expect. Or at least, I didn't expect it.
Suppose we create a basic table whose sole field is a daterange, which is also the PK:
CREATE TABLE public.range_test
(
id daterange NOT NULL,
PRIMARY KEY (id)
);
Then we can populate it with the following data with no problem:
INSERT INTO range_test VALUES (daterange('-infinity','2021-05-21','[]'));
INSERT INTO range_test VALUES (daterange(NULL,'2021-05-21','[]'));
Selecting all the data reveals we have these two tuples:
[-infinity,2021-05-22)
(,2021-05-22)
So the two tuples are distinct, or there would have been a primary key violation. But again, NULL boundaries and infinite boundaries work exactly the same when we're dealing with the actual elements that make up the range. For example, there is no date value X such that the results of X <# [-infinity,2021-05-22) returns a different result than X <# (,2021-05-22). This makes sense because NULL values can't have a type of date, so they can't even be compared to the range (and PostgreSQL even converted the inclusive boundary on the lower NULL bound in daterange(NULL,'2021-05-21','[]') to an exclusive boundary, (,2021-05-22) to be doubly sure). But why are two ranges that are identical in every practical way considered distinct?
When I was still in school, I remember overhearing some discussion about the difference between "unknown" and "doesn't exist" - two people who were smarter than me were talking about that in the context of why NULL values often cause issues, and that replacing the singular NULL with separate "unknown" and "doesn't exist" values might solve those issues, but the discussion was over my head at the time. Thinking about this weird feature made me think of that discussion. So is the distinction between "unknown" and "doesn't exist" the reason why PostgreSQL treats NULL and +-infinity as distinct? If so, why are ranges the only types that allow for that distinction in PostgreSQL? And if not, why does PostgreSQL treat functionally-equivalent values as distinct?

Rather, I'm asking why PostgreSQL makes a distinction between NULL and infinite boundaries when (as far as I can tell) they function exactly the same.
But they do not. NULL is a syntax convenience when used as bound of a range, while -infinity / infinity are actual values in the domain of the range. Abstract values meaning lesser / greater that any other value, but values nonetheless (which can be included or excluded).
Also, NULL works for any range type, while most data types don't have special values like -infinity / infinity. Take integer and int4range for example.
For a better understanding, consider the thread in pgsql-general that a_horse provided:
https://www.postgresql.org/message-id/flat/OrigoEmail.bf5.ac6ff6ffeb116aec.13fc29939e0%40prod2#c9fabdc670211364636b733a79a04712
This makes sense because NULL values can't have a type of date, so they can't even be compared to the range
Every data type can be NULL, even domains that are explicitly NOT NULL. See:
Why does PostgreSQL allow NULLs in domains that prohibit NULL?
That includes date, of course (like Adrian commented):
test=> SELECT NULL::date, pg_typeof(NULL::date);
date | pg_typeof
------+-----------
| date
(1 row)
But trying to discuss NULL as value (when used as bound of a range) is a misleading approach to begin with. It's not a value.
... (and PostgreSQL even converted the inclusive boundary on the lower NULL bound in daterange(NULL,'2021-05-21','[]') to an exclusive boundary, (,2021-05-22) to be doubly sure).
Again, NULL is not treated as value in the domain of the range. It just serves as convenient syntax to say: "unbounded". No more than that.

Related

Right way to use data type 'F' in SELECT-OPTIONS?

I want to have a SELECT-OPTIONS field in ABAP with the data type FLTP, which is basically a float. But this is not possible using SELECT-OPTIONS.
I tried to use PARAMETERS instead which solved this issue. But now of course I get no results when using this parameter value in the WHERE clause when selecting.
So on the one side I can't use data type 'F', but on the other side I get no results. Is there any way out of this dilema?
Checking floating point values for exact equality is a bad idea. It works in some edge-cases (like 0), but often it does not work. The reason is that not every value the user can express in decimal notation can also be expressed as a floating point value. So the values get rounded internally and now you get inequality where you would expect equality. Check the website "What Every Programmer Should Know About Floating-Point Arithmetic" for more information on this phenomenon.
So offering a SELECT-OPTION or a single PARAMETER to SELECT floating point values out of a table might be a bad idea.
What I would recommend instead is have the user state a range between two values with both fields obligatory:
PARAMETERS:
p_from TYPE f OBLIGATORY,
p_to TYPE f OBLIGATORY.
SELECT somdata
FROM table
WHERE floatfield >= p_from AND floatfield <= p_to.
But another solution you might want to consider is if float is really the appropriate data-type for your situation. When the table is a Z-table, you might want to consider to change the type of that field to a packed number or one of the decfloat flavors, as those will cause you far fewer surprises.

PostgreSql Queries treats Int as string datatypes

I store the following rows in my table ('DataScreen') under a JSONB column ('Results')
{"Id":11,"Product":"Google Chrome","Handle":3091,"Description":"Google Chrome"}
{"Id":111,"Product":"Microsoft Sql","Handle":3092,"Description":"Microsoft Sql"}
{"Id":22,"Product":"Microsoft OneNote","Handle":3093,"Description":"Microsoft OneNote"}
{"Id":222,"Product":"Microsoft OneDrive","Handle":3094,"Description":"Microsoft OneDrive"}
Here, In this JSON objects "Id" amd "Handle" are integer properties and other being string properties.
When I query my table like below
Select Results->>'Id' From DataScreen
order by Results->>'Id' ASC
I get the improper results because PostgreSql treats everything as a text column and hence does the ordering according to the text, and not as integer.
Hence it gives the result as
11,111,22,222
instead of
11,22,111,222.
I don't want to use explicit casting to retrieve like below
Select Results->>'Id' From DataScreen order by CAST(Results->>'Id' AS INT) ASC
because I will not be sure of the datatype of the column due to the fact that JSON structure will be dynamic and the keys and values may change next time. and Hence could happen the same with another JSON that has Integer and string keys.
I want something so that Integers in Json structure of JSONB column are treated as integers only and not as texts (string).
How do I write my query so that Id And Handle are retrieved as Integer Values and not as strings , without explicit casting?
I think your assumtions about the id field don't make sense. You said,
(a) Either id contains integers only or
(b) it contains strings and integers.
I'd say,
If (a) then numerical ordering is correct.
If (b) then lexical ordering is correct.
But if (a) for some time and then (b) then the correct order changes, too. And that doesn't make sense. Imagine:
For the current database you expect the order 11,22,111,222. Then you add a row
{"Id":"aa","Product":"Microsoft OneDrive","Handle":3095,"Description":"Microsoft OneDrive"}
and suddenly the correct order of the other rows changes to 11,111,22,222,aa. That sudden change is what bothers me.
So I would either expect a lexical ordering ab intio, or restrict my id field to integers and use explicit casting.
Every other option I can think of is just not practical. You could, for example, create a custom < and > implementation for your id field which results in 11,111,22,222,aa. ("Order all integers by numerical value and all strings by lexical order and put all integers before the strings").
But that is a lot of work (it involves a custom data type, a custom cast function and a custom operator function) and yields some counterintuitive results, e.g. 11,111,22,222,0a,1a,2a,aa (note the position of 0a and so on. They come after 222).
Hope, that helps ;)
If Id always integer you can cast it in select part and just use ORDER BY 1:
select (Results->>'Id')::int From DataScreen order by 1 ASC

Is there a MAX_INT constant in Postgres?

In Java I can say Integer.MAX_VALUE to get the largest number that the int type can hold.
Is there a similar constant/function in Postgres? I'd like to avoid hard-coding the number.
Edit: the reason I am asking is this. There is a legacy table with an ID of type integer, backed by a sequence. There is a lot of incoming rows into this table. I want to calculate how much time before the integer runs out, so I need to know "how many IDs are left" divided by "how fast we are spending them".
There's no constant for this, but I think it's more reasonable to hard-code the number in Postgres than it is in Java.
In Java, the philosophical goal is for Integer to be an abstract value, so it makes sense that you'd want to behave as if you don't know what the max value is.
In Postgres, you're much closer to the bare metal and the definition of the integer type is that it is a 4-byte signed integer.
There is a legacy table with an ID of type integer, backed by a sequence.
In that case, you can get the max value of the sequence by:
select seqmax from pg_sequence where seqrelid = 'your_sequence_name'::regclass.
This might be better than getting the MAX_INT, because sequence may have been created/altered with a specific max value that is different from MAX_INT.

Change varchar to boolean in PostgreSQL

I've started working on a project where there is a fairly large table (about 82,000,000 rows) that I think is very bloated. One of the fields is defined as:
consistency character varying NOT NULL DEFAULT 'Y'::character varying
It's used as a boolean, the values should always either be ('Y'|'N').
Note: there is no check constraint, etc.
I'm trying to come up with reasons to justify changing this field. Here is what I have:
It's being used as a boolean, so make it that. Explicit is better than implicit.
It will protect against coding errors because right now there anything that can be converted to text will go blindly in there.
Here are my question(s).
What about size/storage? The db is UTF-8. So, I think there really isn't much of a savings in that regard. It should be 1 byte for a boolean, but also 1 byte for a 'Y' in UTF-8 (at least that's what I get when I check the length in Python). Is there any other storage overhead here that would be saved?
Query performance? Will Postgres get any performance gains for a where cause of "=TRUE" vs. "='Y'"?
PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!
What about size/storage?
A boolean column occupies 1 byte on disk.
(The manual) about text or character varying:
the storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string
That's at least 2 bytes for a single character.
Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:
How many records can I store in 5 MB of PostgreSQL on Heroku?
Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.
In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.
Query performance?
Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.
Instead of:
WHERE consistency = 'Y'
You could write:
WHERE consistency = true
But rather simplify to just:
WHERE consistency
No further evaluation needed.
Change type
Transforming your table is simple:
ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;
This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.
Neither storage size nor query performance will be significantly better switching from a single VARCHAR to a BOOLEAN. Although you are right that it's technically cleaner to use boolean when you are talking about a binary value the cost to change is probably significantly higher than the benefit. If you're worried about correctness then you could put a check on the column, for example
ALTER TABLE tablename ADD CONSTRAINT consistency CHECK (consistency IN ('Y', 'N'));

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.