postgres store with composite value type, or a better way of attributing an inverted index

postgres store with composite value type, or a better way of attributing an inverted index - postgresql

can't seem to figure out the syntax for populating a hstore with a value of composite type -- note: I do not want to convert a record to a hstore.
select hstore('hello => ROW(1,2)');
I know it's something simple; However, google is not my friend today.
use case : custom inverted index.
The data is modelling an inverted index of lexemes, the composite data types are various probabilities related to the lexemes which I will use to implement document clustering. Does anyone know a better way of doing this ? I'm open to using an external system if it allows attaching attributes to key->posting pairs in the inverted index.
I'd use something external if it had solid support for what I am trying to do, I suspect that sticking 3-10k lexemes per tuple and then doing batch processing on them is gonna be nasty as the whole hstore will have to be parsed and converted.
At the moment my lexemes are in the range of 50-1k per tuple, mostly in the low hundreds, and i'm just doing it for my research algorithms. But there has to be a better way of doing this.

Strings in hstore are double-quoted. hstore only supports text values, nothing else, so you must store other types as their text representations:
SELECT hstore('k => "(1,2)"');
eg:
regress=> SELECT (hstore('k => "(1,2)"')) -> 'k';
?column?
----------
(1,2)
(1 row)
This means you have to cast the values to use them, eg:
regress=> CREATE TYPE pair AS (a integer, b integer);
CREATE TYPE
regress=> SELECT ((hstore('k => "(1,2)"')) -> 'k')::pair;
pair
-------
(1,2)
(1 row)
or using arrays instead to avoid the composite type:
regress=> SELECT ((hstore('k => "{1,2}"')) -> 'k')::integer[];
int4
-------
{1,2}
(1 row)
Arrays are indexed from 1 with the [] operator, eg [1].
This is generally horrendously inefficient because the values must be parsed and converted every single time. It's not really practical to suggest alternatives when you haven't said much about the nature of your problem and why you want hstore in the first place.

Related

Postgres Varchar(n) or text which is better [duplicate]

What's the difference between the text data type and the character varying (varchar) data types?
According to the documentation
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
and
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
So what's the difference?

There is no difference, under the hood it's all varlena (variable length array).
Check this article from Depesz: http://www.depesz.com/index.php/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/
A couple of highlights:
To sum it all up:
char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing
spaces, plus it is problematic to change the limit
varchar(n) – it's problematic to change the limit in live environment (requires exclusive lock while altering table)
varchar – just like text
text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
The article does detailed testing to show that the performance of inserts and selects for all 4 data types are similar. It also takes a detailed look at alternate ways on constraining the length when needed. Function based constraints or domains provide the advantage of instant increase of the length constraint, and on the basis that decreasing a string length constraint is rare, depesz concludes that one of them is usually the best choice for a length limit.

As "Character Types" in the documentation points out, varchar(n), char(n), and text are all stored the same way. The only difference is extra cycles are needed to check the length, if one is given, and the extra space and time required if padding is needed for char(n).
However, when you only need to store a single character, there is a slight performance advantage to using the special type "char" (keep the double-quotes — they're part of the type name). You get faster access to the field, and there is no overhead to store the length.
I just made a table of 1,000,000 random "char" chosen from the lower-case alphabet. A query to get a frequency distribution (select count(*), field ... group by field) takes about 650 milliseconds, vs about 760 on the same data using a text field.

(this answer is a Wiki, you can edit - please correct and improve!)
UPDATING BENCHMARKS FOR 2016 (pg9.5+)
And using "pure SQL" benchmarks (without any external script)
use any string_generator with UTF8
main benchmarks:
2.1. INSERT
2.2. SELECT comparing and counting
CREATE FUNCTION string_generator(int DEFAULT 20,int DEFAULT 10) RETURNS text AS $f$
SELECT array_to_string( array_agg(
substring(md5(random()::text),1,$1)||chr( 9824 + (random()*10)::int )
), ' ' ) as s
FROM generate_series(1, $2) i(x);
$f$ LANGUAGE SQL IMMUTABLE;
Prepare specific test (examples)
DROP TABLE IF EXISTS test;
-- CREATE TABLE test ( f varchar(500));
-- CREATE TABLE test ( f text);
CREATE TABLE test ( f text CHECK(char_length(f)<=500) );
Perform a basic test:
INSERT INTO test
SELECT string_generator(20+(random()*(i%11))::int)
FROM generate_series(1, 99000) t(i);
And other tests,
CREATE INDEX q on test (f);
SELECT count(*) FROM (
SELECT substring(f,1,1) || f FROM test WHERE f<'a0' ORDER BY 1 LIMIT 80000
) t;
... And use EXPLAIN ANALYZE.
UPDATED AGAIN 2018 (pg10)
little edit to add 2018's results and reinforce recommendations.
Results in 2016 and 2018
My results, after average, in many machines and many tests: all the same (statistically less than standard deviation).
Recommendation
Use text datatype, avoid old varchar(x) because sometimes it is not a standard, e.g. in CREATE FUNCTION clauses varchar(x)≠varchar(y).
express limits (with same varchar performance!) by with CHECK clause in the CREATE TABLE e.g. CHECK(char_length(x)<=10). With a negligible loss of performance in INSERT/UPDATE you can also to control ranges and string structure e.g. CHECK(char_length(x)>5 AND char_length(x)<=20 AND x LIKE 'Hello%')

On PostgreSQL manual
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I usually use text
References: http://www.postgresql.org/docs/current/static/datatype-character.html

In my opinion, varchar(n) has it's own advantages. Yes, they all use the same underlying type and all that. But, it should be pointed out that indexes in PostgreSQL has its size limit of 2712 bytes per row.
TL;DR:
If you use text type without a constraint and have indexes on these columns, it is very possible that you hit this limit for some of your columns and get error when you try to insert data but with using varchar(n), you can prevent it.
Some more details: The problem here is that PostgreSQL doesn't give any exceptions when creating indexes for text type or varchar(n) where n is greater than 2712. However, it will give error when a record with compressed size of greater than 2712 is tried to be inserted. It means that you can insert 100.000 character of string which is composed by repetitive characters easily because it will be compressed far below 2712 but you may not be able to insert some string with 4000 characters because the compressed size is greater than 2712 bytes. Using varchar(n) where n is not too much greater than 2712, you're safe from these errors.

text and varchar have different implicit type conversions. The biggest impact that I've noticed is handling of trailing spaces. For example ...
select ' '::char = ' '::varchar, ' '::char = ' '::text, ' '::varchar = ' '::text
returns true, false, true and not true, true, true as you might expect.

Somewhat OT: If you're using Rails, the standard formatting of webpages may be different. For data entry forms text boxes are scrollable, but character varying (Rails string) boxes are one-line. Show views are as long as needed.

A good explanation from http://www.sqlines.com/postgresql/datatypes/text:
The only difference between TEXT and VARCHAR(n) is that you can limit
the maximum length of a VARCHAR column, for example, VARCHAR(255) does
not allow inserting a string more than 255 characters long.
Both TEXT and VARCHAR have the upper limit at 1 Gb, and there is no
performance difference among them (according to the PostgreSQL
documentation).

The difference is between tradition and modern.
Traditionally you were required to specify the width of each table column. If you specify too much width, expensive storage space is wasted, but if you specify too little width, some data will not fit. Then you would resize the column, and had to change a lot of connected software, fix introduced bugs, which is all very cumbersome.
Modern systems allow for unlimited string storage with dynamic storage allocation, so the incidental large string would be stored just fine without much waste of storage of small data items.
While a lot of programming languages have adopted a data type of 'string' with unlimited size, like C#, javascript, java, etc, a database like Oracle did not.
Now that PostgreSQL supports 'text', a lot of programmers are still used to VARCHAR(N), and reason like: yes, text is the same as VARCHAR, except that with VARCHAR you MAY add a limit N, so VARCHAR is more flexible.
You might as well reason, why should we bother using VARCHAR without N, now that we can simplify our life with TEXT?
In my recent years with Oracle, I have used CHAR(N) or VARCHAR(N) on very few occasions. Because Oracle does (did?) not have an unlimited string type, I used for most string columns VARCHAR(2000), where 2000 was at some time the maximum for VARCHAR, and in all practical purposes not much different from 'infinite'.
Now that I am working with PostgreSQL, I see TEXT as real progress. No more emphasis on the VAR feature of the CHAR type. No more emphasis on let's use VARCHAR without N. Besides, typing TEXT saves 3 keystrokes compared to VARCHAR.
Younger colleagues would now grow up without even knowing that in the old days there were no unlimited strings. Just like that in most projects they don't have to know about assembly programming.

I wasted way too much time because of using varchar instead of text for PostgreSQL arrays.
PostgreSQL Array operators do not work with string columns. Refer these links for more details: (https://github.com/rails/rails/issues/13127) and (http://adamsanderson.github.io/railsconf_2013/?full#10).

If you only use TEXT type you can run into issues when using AWS Database Migration Service:
Large objects (LOBs) are used but target LOB columns are not nullable
Due to their unknown and sometimes large size, large objects (LOBs) require more processing
and resources than standard objects. To help with tuning migrations of systems that contain
LOBs, AWS DMS offers the following options
If you are only sticking to PostgreSQL for everything probably you're fine. But if you are going to interact with your db via ODBC or external tools like DMS you should consider using not using TEXT for everything.

character varying(n), varchar(n) - (Both the same). value will be truncated to n characters without raising an error.
character(n), char(n) - (Both the same). fixed-length and will pad with blanks till the end of the length.
text - Unlimited length.
Example:
Table test:
a character(7)
b varchar(7)
insert "ok " to a
insert "ok " to b
We get the results:
a | (a)char_length | b | (b)char_length
----------+----------------+-------+----------------
"ok "| 7 | "ok" | 2

Alphanumeric sorting without any pattern on the strings [duplicate]

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?

Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14

One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC

The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.

This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.

you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"

I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).

I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)

I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort

The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

What PostgreSQL type is good for stroring array of strings and offering fast lookup afterwards

I am using PostgreSQL 11.9
I have a table containing a jsonb column with arbitrary number of key-values. There is a requirement when we perform a search to include all values from this column as well. Searching in jsonb is quite slow so my plan is to create a trigger which will extract all the values from the jsonb column:
select t.* from app.t1, jsonb_each(column_jsonb) as t(k,v)
with something like this. And then insert the values in a newly created column in the same table so I can use this column for faster searches.
My question is what type would be most suitable for storing the keys and then searchin within them. Currently the search looks like this:
CASE
WHEN something IS NOT NULL
THEN EXISTS(SELECT value FROM jsonb_each(column_jsonb) WHERE value::text ILIKE search_term)
END
where the search_term is what the user entered from the front end.

This is not going to be pretty, and normalizing the data model would be better.
You can define a function
CREATE FUNCTION jsonb_values_to_string(
j jsonb,
separator text DEFAULT ','
) RETURNS text LANGUAGE sql IMMUTABLE STRICT
AS 'SELECT string_agg(value->>0, $2) FROM jsonb_each($1)';
Then you can query like
WHERE jsonb_values_to_string(column_jsonb, '|') ILIKE 'search_term'
and you can define a trigram index on the left hand side expression to speed it up.
Make sure that you choose a separator that does not occur in the data or the pattern...

PostgreSQL Base Types (Scalar Type)

I have a use case where a custom base type in a PostgreSQL database would be very beneficial for dealing with non-linear data. The examples of this include defining using a input and output function to a C function. In my case I would rather just define the inp and out functions using SQL and then using the "LIKE" to inherit everything else from the double precision. Has anyone done this? is it even possible?
Possible example:
-- sample linear to logrithmic functions
CREATE FUNCTION to_linear(anyelement) RETURNS double precision
LANGUAGE SQL
AS
$$
SELECT CASE WHEN $1 > 0 THEN 30 / log($1::double precision) ELSE 0 END
$$;
create function to_log(anyelement) returns double precision
language sql
as $$
select 10^($1::double precision/30.0);
$$;
-- create the base type
create type mylogdata
(
INPUT = to_linear,
OUTPUT = to_log,
LIKE = double precision
) ;
-- sample use in a table definition
CREATE TABLE test_table(
mydata mylogdata
);
What I'm really after is a "sudo" or "partial" base-type to allow for a simple in-out conversions while allowing the existing functions (sum, average, etc...) to work on the inherited type (in this case, double precision); basically avoiding to write/rewrite functions in C.
Thoughts? Ideas? Comments? Not possible? :)
Much Thanks!
On a side note, if we had do go down the 'C' route, I think there could be an opportunity to create a more generic logarithmic scalar/base-type like the Char, Varchar, or Arbitrary Precision Number which could allow for the dynamic declaration of the log base and scale of the non-linear data.
Something like this could a big win for the science community and those of us dealing with "wave" based data like sound, vibration, earth quakes, light, radiation, etc. Here is a sample definition of the base:
Logarithmic(base, scale)
-- Below my idea for use in a table definition
-- Obviously the IN/OUT functions would have to be modified to use the base and scaling
-- as defined ( most likely in C ?? )
CREATE TABLE test_table
(
mydata logarithmic(10, 30)
);
If someone is interested in partnering in creating something like this, let me know.

If you want to write a data type with new type input and output functions, you have to write that in C. No doubt you can reuse a lot of functions from double precision.
But I would go the other way. Rather than having a type that looks like a number, but the known arithmetic operators behave weirdly, define a new set of operators on an existing data type. That can be done in SQL, and it feels more natural to me.
If you create an operator class for your new operators, you can also use them with indexes.

Tsquery return exact matched keyword

I have a query like
select * from mytable where posttext ## to_tsquery('Intelence');
I just want to return results with exact match of the keyword 'Intelence' rather than 'intel', how can I do this in postgresql?
Thanks.

This is not possible with full-text search unless you want to tell PostgreSQL not to stem Intelence at all by changing the text search dictionary. Pg doesn't include the word in the index, only the stems:
regress=> SELECT to_tsvector('english', 'Intelence');
to_tsvector
-------------
'intel':1
(1 row)
You can suppress stemming entirely with the simple dictionary:
regress=> SELECT to_tsvector('simple','Intelence');
to_tsvector
---------------
'intelence':1
(1 row)
but that must be done on the index, you can't do it per-query let alone per search term. So the text cats are bothering me would not match a search for cat in the simple dictionary because of the plural, or bother because the unstemmed words are not the same.
If you want to make individual exceptions you can edit the english dictionary used by tsearch2 and define a custom dictionary with the desired changes, then use that dictionary instead of english in queries where you want the exceptions. Again, you must use the same dictionary for the index creation and the queries, though.
This might land up with you needing multiple fulltext indexes, which is very undesirable from the point of view of slowing down updates/inserts/deletes and from a memory use efficiency perspective.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse