Postgresql Text Search Performance - postgresql

I have been looking into text search (without tsvector) of a varchar field (more or less between 10 to 400 chars) that has the following format:
field,field_a,field_b,field_c,...,field_n
The query I am planning to run is probably similar to:
select * from information_table where fields like '%field_x%'
As there are no spaces in fields, I wonder if there are some performance issues if I run the search across 500k+ rows.
Any insights into this?
Any documentation around performance of varchar and maybe varchar index?
I am not sure if tsvector will work on a full string without spaces. What do you think about this solution? Do you see another solutions that could help improve the performance?
Thanks and I look forward to hearing from you.
R

In general the text search parser will treat commas and spaces the same, so if you want to use FTS, the structure with commas does not pose a problem. pg_trgm also treats commas and spaces the same, so if you want to use that method instead it will also not have a problem due to the commas.
The performance is going to depend on how popular or rare the tokens in the query are in the body of text. It is hard to generalize that based on one example row and one example query, neither of which looks very realistic. Best way to figure it out would be to run some real queries with real (or at least realistic) data with EXPLAIN (ANALYZE, BUFFERS) and with track_io_timing turned on.

Related

Pattern matching performance issue Postgres

I found the query like below taking longer time as this pattern matching causes the performance in my batch job,
Query:
select a.id, b.code
from table a
left join table b
on a.desc_01 like '%'||b.desc_02||'%';
I have tried with LEFT, STRPOS functions to improve the performance. But at the end am losing few data if i apply these functions.
Any other suggestion please.
It's not that clear what your data (or structure) really looks like, but your search is performing a contains comparison. That's not the simplest thing to optimize because a standard index, and many matching algorithms, are biased towards the start of the string. When you lead with %, then then a B-tree can't be used efficiently as it splits/branches based on the front of the string.
Depending on how you really want to search, have you considered trigram indexes? they're pretty great. Your string gets split into three letter chunks, which overcomes a lot of the problems with left-anchored text comparison. The reason why is simple: now every character is the start of a short, left-anchored chunk. There are traditionally two methods of generating trigrams (n-grams), one with leading padding, one without. Postgres uses padding, which is the better default. I got help with a related question recently that may be relevant to you:
Searching on expression indexes
If you want something more like a keyword match, then full text search might be of help. I had not been using them much because I've got a data set where converting words to "lexemes" doesn't make sense. It turns out that you can tell the parser to use the "simple" dictionary instead, and that gets you a unique word list without any stemming transformations. Here's a recent question on that:
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes/251185#251185
If that sounds more like what you need, you might also want to get rid of stop/skip/noise words. Here's a thread that I think is a bit clearer on the docs regarding how to set this up (it's not hard):
https://dba.stackexchange.com/questions/145016/finding-the-most-commonly-used-non-stop-words-in-a-column/186754#186754
The long term answer is to clean up and re-organize your data so you don't need to do this.
Using a pg_trgm index might be the short term answer.
create extension pg_trgm;
create index on a using gin (desc_01 gin_trgm_ops);
How fast this will be is going to depend on what is in b.desc_02.

Fuzzy search on large table

I have a very large PostgreSQL table with 12M names and I would like to show an autocomplete. Previously I used a ILIKE "someth%" clause but I'm not really satisfied with it. For example it doesn't sort by similarity and any spelling error would cause wrong or no results. The field is a string, usually one or two words (in any language). I need a fast response because suggestions are shown live to the user while he is typing (i.e. autocomplete). I cannot restrict the fuzzy match to a subset because all names are equally important. I can also say that most names are different.
I have tried pg_trgm but even with a gin index is very slow. The search of a name similar to 'html' takes a few milliseconds, but - don't ask me why - other searches like 'htm' take a lot of seconds - e.g. 25 seconds. Also other people has reported performance issues with pg_trgm on large tables.
Is there anything I can do to efficiently show an autocomplete on that field?
Would a full text search engine (e.g. Lucene, Solr) be an appropriate solution? Or I would encounter the same inefficiency?

How to make substring-matching query work fast on a large table?

I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...
Check Waiting for 9.1 – Faster LIKE/ILIKE blog post from depesz for a solution using trigrams.
You'd need to use yet unreleased Postgresql 9.1 for this. And your writes would be much slower then, as trigram indexes are huge.
Full text search suggested by user12861 would help only if you're searching for words, not substrings.
You probably want to look into full text indexing. It's a bit complicated, maybe someone else can give a better description, or you might try some links, like this one for example:
http://wiki.postgresql.org/wiki/Full_Text_Indexing_with_PostgreSQL

Is it a good idea to store attributes in an integer column and perform bitwise operations to retrieve them?

In a recent CODE Magazine article, John Petersen shows how to use bitwise operators in TSQL in order to store a list of attributes in one column of a db table.
Article here.
In his example he's using one integer column to hold how a customer wants to be contacted (email,phone,fax,mail). The query for pulling out customers that want to be contacted by email would look like this:
SELECT C.*
FROM dbo.Customers C
,(SELECT 1 AS donotcontact
,2 AS email
,4 AS phone
,8 AS fax
,16 AS mail) AS contacttypes
WHERE ( C.contactmethods & contacttypes.email <> 0 )
AND ( C.contactmethods & contacttypes.donotcontact = 0 )
Afterwards he shows how to encapsulate this in to a table function.
My questions are these:
1. Is this a good idea? Any drawbacks? What problems might I run in to using this approach of storing attributes versus storing them in two extra tables (Customer_ContactType, ContactType) and doing a join with the Customer table? I guess one problem might be if my attribute list gets too long. If the column is an integer then my attribute list could only be at most 32.
2. What is the performance of doing these bitwise operations in queries as you move in to the tens of thousands of records? I'm guessing that it would not be any more expensive than any other comparison operation.
If you wish to filter your query based on the value of any of those bit values, then yes this is a very bad idea, and is likely to cause performance problems.
Besides, there simply isn't any need - just use the bit data type.
The reason why using bitwise operators in this way is a bad idea is that SQL server maintains statistics on various columns in order to improve query performance - for example if you have an email column, SQL server can tell you roughly what percentage of values that email column are true and select an appropriate execution plan based on that knowledge.
If however you have flags column, SQL server will have absolutely no idea how many records in a table match flags & 2 (email) - it doesn't maintain these sorts of indexes. Without this sort of information available to it SQL server is far more likely to choose a poor execution plan.
And don't forget the maintenance problems using this technique would cause. As it is not standard, all new devs will probably be confused by the code and not know how to adjust it properly. Errors will abound and be hard to find. It is also hard to do reporting type queries from. This sort of trick stuff is almost never a good idea from a maintenance perspective. It might look cool and elegant, but all it really is - is clunky and hard to work with over time.
One major performance implication is that there will not be a lookup operator for indexes that works in this way. If you said WHERE contact_email=1 there might be an index on that column and the query would use it; if you said WHERE (contact_flags & 1)=1 then it wouldn't.
** One column stores one piece of information only - it's the database way. **
(Didnt see - Kragen's answer also states this point, way before mine)
In opposite order: The best way to know what your performance is going to be is to profile.
This is, most definately, an "It Depends" question. I personally would never store such things as integers. For one thing, as you mention, there's the conversion factor. For another, at some point you or some other DBA, or someone is going to have to type:
Select CustomerName, CustomerAddress, ContactMethods, [etc]
From Customer
Where CustomerId = xxxxx
because some data has become corrupt, or because someone entered the wrong data, or something. Having to do a join and/or a function call just to get at that basic information is way more trouble than it's worth, IMO.
Others, however, will probably point to the diversity of your options, or the ability to store multiple value types (email, vs phone, vs fax, whatever) all in the same column, or some other advantage to this approach. So you would really need to look at the problem you're attempting to solve and determine which approach is the best fit.

How can I limit DataSet.WriteXML output to typed columns?

I'm trying to store a lightly filtered copy of a database for offline reference, using ADO.NET DataSets. There are some columns I need not to take with me. So far, it looks like my options are:
Put up with the columns
Get unmaintainably clever about the way I SELECT rows for the DataSet
Hack at the XML output to delete the columns
I've deleted the columns' entries in the DataSet designer. WriteXMl still outputs them, to my dismay. If there's a way to limit WriteXml's output to typed rows, I'd love to hear it.
I tried to filter the columns out with careful SELECT statements, but ended up with a ConstraintException I couldn't solve. Replacing one table's query with SELECT * did the trick. I suspect I could solve the exception given enough time. I also suspect it could come back again as we evolve the schema. I'd prefer not to hand such a maintenance problem to my successors.
All told, I think it'll be easiest to filter the XML output. I need to compress it, store it, and (later) load, decompress, and read it back into a DataSet later. Filtering the XML is only one more step — and, better yet, will only need to happen once a week or so.
Can I change DataSet's behaviour? Should I filter the XML? Is there some fiendishly simple way I can query pretty much, but not quite, everything without running into ConstraintException? Or is my approach entirely wrong? I'd much appreciate your suggestions.
UPDATE: It turns out I copped ConstraintException for a simple reason: I'd forgotten to delete a strongly typed column from one DataTable. It wasn't allowed to be NULL. When I selected all the columns except that column, the value was NULL, and… and, yes, that's profoundly embarrassing, thank you so much for asking.
It's as easy as Table.Columns.Remove("UnwantedColumnName"). I got the lead from
Mehrdad's wonderfully terse answer to another question. I was delighted when Table.Columns turned out to be malleable.