I have a PG DB containing mostly numeric data, where accuracy of numeric data is important so fields are typically Numeric (7,3) because the largest expected non error value is around 1300. I am importing some data which contains some records exceeding the 10^4 limit, due to sensor faults, so I have applied an error code to these values using:
UPDATE schema.table1 SET fieldname1 = 3333 WHERE fieldname1 > 3333;
(i.e. If an operator sees 3333 in the data it is known to be an erroneous value because it is outside the normally expected range and it's known to be human applied due to the sequence which is unlikely in nature).
The success of this query was confirmed by:
MAX (fieldname1)
which returns 3333 for all fields which previously had values exceeding 3333.
This initial tidying of data was done in a temporary table with the Fields defined as Numeric (10,3), and I now need to import the data into the main table, using:
INSERT INTO schema.table2
(DateTime, Fieldname1)
SELECT
DISTINCT ON (DateTime)
DateTime,
Fieldname1
FROM Schema.table1
WHERE DateTime IS NOT NULL
But I get an error message saying 'datatype Numeric(7,3) cannot contain values exceeding 10^4' (or words to that effect).
As an experiment I tried redefining the datatypes in the temporary table as Numeric (7,3). This worked for most of the fields, but for a few fields I got the 'datatype Numeric(7,3) cannot contain values exceeding 10^4' message, implying that there is still data >10^4 despite the MAX(Fieldname1) command returning 3333.
Tried VACUUM and ANALYZE to clean up the tables, no cigar.
Is there a known issue here? am I going about this the wrong way?
Related
I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.
The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.
I have a table in Postgres with a JSONB column, each row of the table contains a large JSONB object (~4500 keys, JSON string is around 110 KB in a txt file). I want to query these rows and get the entire JSONB object.
The query is fast -- when I run EXPLAIN ANALYZE, or omit the JSONB column, it returns in 100-300 ms. But when I execute the full query, it takes on the order of minutes. The exact same query on a previous version of the data was also fast (each JSONB was about half as large).
Some notes:
This ends up in Python (via SQLAlchemy/psycopg2). I'm worried that the query executor is converting JSONB to JSON, then it gets encoded to text for transfer over the wire, then gets JSON encoded again on the Python end.
Is this correct? If so how could I mitigate this issue? When I select the JSONB column as ::text, the query is roughly twice as fast.
I only need a small subset of the JSON (around 300 keys or 6% of keys). I tried methods of filtering the JSON output in the query but they caused a substantial further performance hit -- it ended up being faster to return the entire object.
This is not necessarily a solution, but here is an update:
By casting the JSON column to text in the Postgres query, I was able to substantially cut down query execution and results fetching on the Python end.
On the Python end, doing json.loads for every single row in the result set brings me to the exact timing as using the regular query. However, with the ujson library I was able to obtain a significant speedup. The performance of casting to text in the query, then calling ujson.loads on the python end is roughly 3x faster than simply returning JSON in the query.
I store the following rows in my table ('DataScreen') under a JSONB column ('Results')
{"Id":11,"Product":"Google Chrome","Handle":3091,"Description":"Google Chrome"}
{"Id":111,"Product":"Microsoft Sql","Handle":3092,"Description":"Microsoft Sql"}
{"Id":22,"Product":"Microsoft OneNote","Handle":3093,"Description":"Microsoft OneNote"}
{"Id":222,"Product":"Microsoft OneDrive","Handle":3094,"Description":"Microsoft OneDrive"}
Here, In this JSON objects "Id" amd "Handle" are integer properties and other being string properties.
When I query my table like below
Select Results->>'Id' From DataScreen
order by Results->>'Id' ASC
I get the improper results because PostgreSql treats everything as a text column and hence does the ordering according to the text, and not as integer.
Hence it gives the result as
11,111,22,222
instead of
11,22,111,222.
I don't want to use explicit casting to retrieve like below
Select Results->>'Id' From DataScreen order by CAST(Results->>'Id' AS INT) ASC
because I will not be sure of the datatype of the column due to the fact that JSON structure will be dynamic and the keys and values may change next time. and Hence could happen the same with another JSON that has Integer and string keys.
I want something so that Integers in Json structure of JSONB column are treated as integers only and not as texts (string).
How do I write my query so that Id And Handle are retrieved as Integer Values and not as strings , without explicit casting?
I think your assumtions about the id field don't make sense. You said,
(a) Either id contains integers only or
(b) it contains strings and integers.
I'd say,
If (a) then numerical ordering is correct.
If (b) then lexical ordering is correct.
But if (a) for some time and then (b) then the correct order changes, too. And that doesn't make sense. Imagine:
For the current database you expect the order 11,22,111,222. Then you add a row
{"Id":"aa","Product":"Microsoft OneDrive","Handle":3095,"Description":"Microsoft OneDrive"}
and suddenly the correct order of the other rows changes to 11,111,22,222,aa. That sudden change is what bothers me.
So I would either expect a lexical ordering ab intio, or restrict my id field to integers and use explicit casting.
Every other option I can think of is just not practical. You could, for example, create a custom < and > implementation for your id field which results in 11,111,22,222,aa. ("Order all integers by numerical value and all strings by lexical order and put all integers before the strings").
But that is a lot of work (it involves a custom data type, a custom cast function and a custom operator function) and yields some counterintuitive results, e.g. 11,111,22,222,0a,1a,2a,aa (note the position of 0a and so on. They come after 222).
Hope, that helps ;)
If Id always integer you can cast it in select part and just use ORDER BY 1:
select (Results->>'Id')::int From DataScreen order by 1 ASC
I just came across scenario when occasionally (not for all sets of data) I'm getting "Error: SQL0802 - Data conversion or data mapping error." exception when adding ORDER BY to simple query. For example, this works:
SELECT
market,
locationCode,
locationName
FROM locations
and the following is failing miserably:
SELECT
market,
locationCode,
locationName
FROM locations
ORDER BY locationName
I'm getting: Error: SQL0802 - Data conversion or data mapping error. (State:S1000, Native Code: FFFFFCDE)
I get the same error if I try to sort by name, or population, or anything really.... but only sometimes, meaning, when it errors on name or code, it would error if sorted by any field in locations subset. If it works for particular subset of locations, then it works for any sort order.
There are no null values in any of the fields, code and name fields are character fields.
Initially, I got this error when I added ROW_NUMBER column:
ROW_NUMBER() OVER(PARTITION BY market ORDER BY locationCode) as rowNumber
since, I narrowed it down to failing order case. I don't know which direction to go with it. Any thoughts?
update: there are no blank values for location name field. And even if I remove all fields in this subset and leave only 7 digit numeric id and sort by that field. I still get the same error.
WITH locs as (
SELECT id
FROM locations
)
SELECT *
FROM locs
ORDER BY id
I get this error when I SELECT DISTINCT any field from the subset too.
I had/have the exact same situation as described. The error seemed to be random, but would always appear when sorting was added. Although I can't precisely describe the technical details, what I think is occurring is the "randomness" was actually due to the size of the tables, and the size of the cached chunks of returned rows from the query.
The actual cause of the problem is junk values and/or blanks in the key fields used by the join. If there was no sorting, and the first batch of cached results didn't hit the records with the bad fields, the error wouldn't occur at first...but eventually always it did.
And the two things that ALWAYS drew out the error IMMEDIATELY were sorting or paging through the results. That's because in order to sort, it has to hit every one of the those key fields, and then cache the complete results. I think. Like I said, I don't know the complete technobabble, but I'm pretty sure that's close in laygeek terms.
I was able to solve the error by force-casting the key columns to integer. I changed the join from this...
FROM DAILYV INNER JOIN BXV ON DAILYV.DAITEM=BXV.BXPACK
...to this...
FROM DAILYV INNER JOIN BXV ON CAST(DAILYV.DAITEM AS INT)=CAST(BXV.BXPACK AS INT)
...and I didn't have to make any corrections to the tables. This is a database that's very old, very messy, and has lots of junk in it. Corrections have been made, but it's a work in progress.