Text fields with finite domain: to factor into sub-tables or not? - postgresql

A new database is to store the log data of a (series of) web servers. The structure of the log records has been converted to this ‘naïve’ general schema:
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost TEXT,
path TEXT,
user_agent TEXT,
-- and so on
);
And so on, there are many more fields, but this shows the general principle. The bulk of the data are contained in free-text fields as shows above. Of course, this will make the database rather big in the long run. We are talking about a web server log, so this doesn’t come as a big surprise.
The domain of those text fields is limited, though. There is e.g. a very finite set of vhosts that will be seen, a much larger, but still decidedly finite set of paths, user agents and so on. In cases like this, would it be more appropriate to factor the text fields out into sub-tables, and reference them only via identifiers? I am thinking along a line like this:
CREATE TABLE vhost ( ident SERIAL PRIMARY KEY, vhost TEXT NOT NULL UNIQUE );
CREATE TABLE path ( ident SERIAL PRIMARY KEY, path TEXT NOT NULL UNIQUE );
CREATE TABLE user_agent ( ident SERIAL PRIMARY KEY, user_agent TEXT NOT NULL UNIQUE );
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost INTEGER REFERENCES vhost ( ident ) ,
path INTEGER REFERENCES path ( ident ),
user_agent INTEGER REFERENCES user_agent ( ident ),
-- and so on
);
I have tried both approaches now. As expected, the second one is much smaller, by give or take factor three. However, querying it becomes significantly slower due to all the joins involved. The difference is by about an order of magnitude.
From what I understand, the table should be sufficiently normal in both cases. At some later point in the project, there’ll maybe be additional attributes attached to the various text values (like additional information about each vhost and so on).
The practical considerations are obvious, it’s basically a space/time tradeoff. In the long run, what is considered best practice in such a case? Are there other, perhaps more theoretical implications for such a scenario that I might want to be aware of?

The domain of those text fields is limited, though. There is e.g. a
very finite set of vhosts that will be seen, a much larger, but still
decidedly finite set of paths, user agents and so on. In cases like
this, would it be more appropriate to factor the text fields out into
sub-tables, and reference them only via identifiers?
There are a couple of different ways to look at this kind of problem. Regardless of which way you look at it, appropriate is a fuzzy word.
Constraints
Let's imagine that you create a table and a foreign key constraint to allow the "vhost" column to accept only five values. Can you also constrain the web server to write only those five values to the log file? No, you can't.
You can add some code to insert new virtual hosts into into the referenced table. You can even automate that with triggers. But when you do that, you're no longer constraining the values for "vhost". This remains true whether you use natural keys or surrogate keys.
Data compression with ID numbers
You can also think of this as a problem of data compression. You save space--potentially a lot of space--by using integers as foreign keys to tables of unique text. You might not save time. Queries that require a lot of joins are often slower than queries that just read data directly. You've already seen that.
In your case, which has to do with machine-generated log files, I prefer to store them as they come from the device (web server, flow sensor, whatever) unless there's a compelling reason not to.
In a few cases, I've worked on systems where domain experts have determined that certain kinds of values shouldn't be transferred from a log file to a database. For example, domain experts might decide that negative numbers from a sensor means the sensor is broken. This kind of constraint is better handled with a CHECK() constraint, but the principle is the same.
Take what the device gives you unless there's a compelling reason not to.

Related

What is the best way to store varying size columns in postres for language translation?

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.
EVENT:
id
EVENT_NAME (fk to LANGUAGE.short)
EVENT_DESCRIPTION (fk to LANGUAGE.long)
0
1
2
LANGUAGE:
language_id
language_type
short (char varying 200)
med (char varying 50)
long (char varying 2000)
1
english
game of soccer
null
null
1
spanish
partido de footbol
null
null
2
english
null
null
A really great game of soccer
2
spanish
null
null
Un gran partido de footbol
If I want the language specific version I would create a parameterized statement and pass in the language like this:
select event.id, name.short, desc.long
from event e, language name, language desc
where e.id = 0
and e.event_name = name.language_id
and name.language_type = ?
and e.event_desc = desc.language_id
and desc.language_type = ?
My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).
In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?
It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?
Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.
In the words of the documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.
Use text.
With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.
I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).
This is incorrect with PostgreSQL 14. text is the most performant.
There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].
Storage for varchar and text is the same.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Do you need to reinvent this wheel?
Internationalization and localization tools already exist; they're built into many application frameworks.
Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.
Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.
If you really must do it in the database, here's some notes:
language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.
create table languages (
id serial primary key,
tag text unique not null
);
-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
id bigserial primary key
);
create table message_localizations (
-- If a message or language is deleted, so will its localization.
message_id bigint not null references messages on delete cascade,
language_id int not null references languages on delete cascade,
localization text not null,
-- This serves as both a primary key and enforces one localization
-- per message and language.
primary key(message_id, language_id)
);
create table events (
id bigserial primary key,
name_id bigint not null references messages,
description_id bigint not null references messages
);
Then join each message with its localization.
select
events.id,
ml1.localization as name,
ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
on ml2.message_id = description_id and ml2.language_id = lt.id
Demonstration.
But, again, you probably want to use an existing localization tool.

Can a primary key in postgres have zero value?

There is one table at my database that have a row with ID equals to 0 (zero).
The primary key is a serial column.
I'm used to see sequences starting with 1. So, is there a ploblem if i keep this ID as zero?
The Serial data type creates integer columns which happen to auto-increment. Hence you should be able to add any integer value to the column (including 0).
From the docs
The type names serial and serial4 are equivalent: both create integer columns.
....(more about Serial) we have created an integer column and arranged for its default values to be assigned from a sequence generator
http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-SERIAL
This is presented as an answer because it’s too long for a comment.
You’re actually talking about two things here.
A primary key is a column designated to be the unique identifier for the table. There may be other unique columns, but the primary key is the one you have settled on, possibly because it’s the most stable value. (For example a customer’s email address is unique, but it’s subject to change, and it’s harder to manage).
The primary key can be any common data type, as long as it is guaranteed to be unique. In some cases, the primary key is a natural property of the row data, in which case it is a natural primary key.
In (most?) other cases, the primary key is an arbitrary value with no inherent meaning. In that case it is called a surrogate key.
The simplest surrogate key, the one which I like to call the lazy surrogate key, is a serial number. Technically, it’s not truly surrogate in that there is an inherent meaning in the sequence, but it is otherwise arbitrary.
For PostgreSQL, the data type typically associated with a serial number is integer, and this is implied in the SERIAL type. If you were doing this in MySQL/MariaDB, you might use unsigned integer, which doesn’t have negative values. PostgreSQL doesn’t have unsigned, so the data can indeed be negative.
The point about serial numbers is that they normally start at 1 and increment by 1. In PostgreSQL, you could have set up your own sequence manually (SERIAL is just a shortcut for that), in which case you can start with any value you like, such as 100, 0 or even -100 etc.
To actually give an answer:
A primary key can have any compatible value you like, as long as it’s unique.
A serial number can also have any compatible value, but it is standard practice to start as 1, because that’s how we humans count.
Reasons to override the start-at-one principle include:
I sometimes use 0 as a sort of default if a valid row hasn’t been selected.
You might use negative ids to indicate non-standard data, such as for testing or for virtual values; for example a customer with a negative id might indicate an internal allocation.
You might start your real sequence from a higher number and use lower ids for something similar to the point above.
Note that modern versions of PostgreSQL have a preferred standard alternative in the form of GENERATED BY DEFAULT AS IDENTITY. In line with modern SQL trends, it is much more verbose, but it is much more manageable than the old SERIAL.

PostgreSQL: How to store UUID value?

I am trying to store UUID value into my table using PostgreSQL 9.3 version.
Example:
create table test
(
uno UUID,
name text,
address text
);
insert into test values(1,'abc','xyz');
Note: How to store integer value into UUID type?
The whole point of UUIDs is that they are automatically generated because the algorithm used virtually guarantees that they are unique in your table, your database, or even across databases. UUIDs are stored as 16-byte datums so you really only want to use them when one or more of the following holds true:
You need to store data indefinitely, forever, always.
When you have a highly distributed system of data generation (e.g. INSERTS in a single "system" which is distributed over multiple machines each having their own local database) where central ID generation is not feasible (e.g. a mobile data collection system with limited connectivity, later uploading to a central server).
Certain scenarios in load-balancing, replication, etc.
If one of these cases applies to you, then you are best off using the UUID as a primary key and have it generated automagically:
CREATE TABLE test (
uno uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
name text,
address text
);
Like this you never have to worry about the UUIDs themselves, it is all done behind the scenes. Your INSERT statement would become:
INSERT INTO test (name, address) VALUES ('abc','xyz') RETURNING uno;
With the RETURNING clause obviously optional if you want to use that to reference related data.
It's not allowed to simply cast an integer into a UUID type. Generally, UUIDs are generated either internally in Postgres (see http://www.postgresql.org/docs/9.3/static/uuid-ossp.html for more detail on ways to do this) or via a client program.
If you just want unique IDs for your table itself, but don't actually need UUIDs (those are geared toward universal uniqueness, across tables and servers, etc.), you can use the serial type, which creates an implicit sequence for you which automatically increments when you INSERT into a table.

postgresql hstore key/value vs traditional SQL performance

I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.

How to solve log slowness with or without NoSql

I am having a problem regarding a Log Searching Speed and Disk Size.
It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.
How does it work?
The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).
The log consists with two tables (name changed to english so the people here are able to understand):
LogTable
Id, DateTime, User, Url, Action and TableName.
Action is what the used did: insert/delete/update
TableName is which table in the database was affected.
LogTableFields
Id, LogTable_Id, FieldName, NewValue, OldValue.
LogTable_Id is foreign key from LogTable.
FieldName is the field of the table from DB.
Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).
Why it was made this way?
Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.
Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.
But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.
Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)
Some ideas suggested until now:
Idea 1:
I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.
Idea 2:
Separate the Log Tables physically from the Database or from the machine.
Idea 3: Create a mirror from every table with a version field to keep history.
I'd like a macro optimization or architecture change if needed.
This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.
It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.
If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.
Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:
create table LogTable (
LogTableID int identity(1,1),
TableName varchar(100),
Url varchar(200)
)
create table LogUser (
LogUserID int indentity(1,1),
UserName varchar(100)
)
create table LogField (
LogFieldID int identity(1,1),
FieldName varchar(100),
)
create table LogData (
LogDataID bigint identity(1,1),
LogDate datetime,
LogTableID int references LogTable(LogTableID),
LogFieldID int references LogField(LogFieldID),
LogUserID int references LogUserID(LogUserID),
Action char(1), -- U = update, I = insert, D = delete
OldValue varchar(100),
NewValue varchar(100)
)
This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.
Adding proper indices is going to be the biggest improvement you can make. You don't mention having any indices, so I assume you don't have any. That would make it very slow.
For example, limiting your query to a particular range of DateTime doesn't help at all unless you have an index on DateTime. Without an index, the database still needs to touch nearly all 25GB of data to find the few rows that are in the right time range. But with an index, it could quickly identify the few rows that are in the time range you care about.
In general, you should always ask your database what plan it is using to execute a query that is taking too long. I'm not particularly familiar with Sql Anywhere, but I know it has a Plan Viewer that can do this. You want to identify big sequential scans and put indices on those fields instead.
I doubt you would see a measurable improvement from breaking up the table and using integer foreign keys. To the extent that your queries touch many columns, you'll just end up joining all those tables back together anyway.