Small Colum and large table - Should i use FTS or Like - sql-server-2008-r2

At the moment i am using Full text (2008 R2) on small columns like 'Client Name', 'PO Number' and etc ? but i was wondering if it is really worth using FTS on small columns and could use 'Like' for searching.
Table has over 11k rows which is not alot but this table is growing.
If it is better to use 'Like' than do i have to remove columns from the catalog?
What is meant by unstructured text data here?
"In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned. "

If you're already using full text, why change it? LIKE queries may be suitable now but the performance will degrade sharply as your table grows, as stated in the MSDN article you quoted.
If it is better to use 'Like' than do i have to remove columns from
the catalog?
No, the full text catalog has no impact on LIKE queries.
What is meant by unstructured text data here?
Full text can be used on binary formats that SQL Server supports (imagine searching across Word files stored in SQL) and XML data. LIKE cannot do this.

Related

What is the best way to store a large set of single-column data in PostgreSQL

I have a massive list of strings in a text file, the file is about 100gb uncompressed.
Each line of the text file is a single word (rougly 50 characters long), no spaces or punctuation.
The table will be created and populated from this text file once, further updates to the table will not be necessary and if it helps the table can become read only.
The use-case is a function which would look something like this:
/**
* Search the table and return true if the word exists, false if not.
* /
wordExists(wordToCheck: string): boolean {}
I'm looking for advice here on what would be the best way to store the data to ensure that lookups are as fast and efficient as possible.
I'm not sure if breaking the word up into parts to try and assist in indexing it would help or not, I'm also not sure if it will help to partition this list.
Anyone have any advice for me?
100GB is fairly long, and assuming each word averages 50 characters in length, this means that roughly a single column would have 2 billion words/records in it. This is not so large that it is beyond the capability of Postgres.
I suggest creating and populating your table, and then adding a hash index:
CREATE INDEX idx ON yourTable USING HASH (text_col);
Now any query against this table should be able to use the index for very rapid lookup. For example, to see if a word exists, use:
SELECT EXISTS (SELECT 1 FROM yourTable WHERE text_col = 'meatballs');
If you run the explain plan, you should see the hash index being used with a fast lookup time.

PostgreSQL: to_tsquery starts with and ends with search

Recently, I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it. This table has over 200 million rows and querying using to_tsquery worked pretty well for the column of type tsvector.
Now I need to hit the following queries but reading the documentation I couldn't find how to do it (or it's there and I didn't understand because full-text search is something new to me yet)
Starts with
Ends with
How can I make the query below return true only if the query is "The cat" (starts with) and "the book" (ends with), if it's possible in full-text search.
select to_tsvector('The cat is on the book') ## to_tsquery('Cat')
I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it.
How did you do that? FTS doesn't apply for LIKE queries. It applies for FTS queries, such as ##.
You can't directly look for strings starting and ending with certain words. You can use the index to filter on cat and book, then refilter those rows for ones having them in the right place.
select * from whatever where tsv_col ## to_tsquery('cat & book') and text_col LIKE 'The cat % the book';
Unless you want to match something like 'The cathe book' then you would have to do something else, with two different LIKE.

Designing a database for storing tags of audio files

I want to build a database which contains all the tags of a collection of audio
files (FLAC, Vorbis, MP3, whatever). I've already sorted out the extraction
(that was the easy part), but now I have some doubts about how to properly
design the database that will contain them.
At the moment I have normalised it like this
as a straightforward 1:m relationship:
file: filename, size, last_modified, …
tags: filename, tag, seq, value
Where filename is the primary key for the file table and ( filename, tag,
seq ) the primary key for the tag table. Some tags do appear more than once;
the seq column is just a number which remembers the exact order of those.
However, with a design like this extracting meaningful information about the
files becomes a real pain. If I e.g. want to have just the ARTIST, ALBUM AND
TITLE fields for each track I already have to join the file and tags table
three times:
SELECT filename, artist.value, album.value, title.value
FROM file
LEFT OUTER JOIN tags artist USING ( filename )
LEFT OUTER JOIN tags album USING ( filename )
LEFT OUTER JOIN tags title USING ( filename );
WHERE
artist.tag = 'ARTIST'
AND album.tag = 'ALBUM'
AND title.tag = 'TITLE';
It's beyond question that this is not only extremely cumbersome to write, but
is also quite slow because of all those joins. And this is only a simple
example. In effect, all the queries that I eventually want to pose will piece
together all the tags that they need as if they were stored as the columns of a
large table.
I've already thought about not normalising the tags and just keep them as
columns of the FILE table. But the number of tags is highly variable; some of
the more standard tags like ARTIST and TITLE are almost guaranteed to be
present, some of the more obscure ones are only on some of the files, but I need
to work with them too.
To me it looks like I am trying to do it the wrong way, especially the tags
table is "structured". Is there a better way to deal with this kind of data?
For reference: I'm using PostgreSQL.
I gather from this post that my schema above is an EAV model, so it looks like I'm up for quite a hard problem…
But the number of tags is highly variable; some of the more standard tags like ARTIST and TITLE are almost guaranteed to be present, some of the more obscure ones are only on some of the files, but I need to work with them too.
You could have separate tables for the (mostly) guaranteed tags, and use the EAV model for the optional tags.
Relational databases are designed to join tables. Don't worry about performance problems with joins until you actually have a performance problem. Worry about getting your data relationships correct.
Instead of just sticking with the EAV model and letting the DBMS sort out the resulting jungle of joins, I have found suggestions to store all the tags in a single column as an XML document and query it via XPath when extracting values. PostgreSQL's HSTORE follows basically the same idea.
This way, I get rid of the EAV structure but there are other drawbacks. HSTORE has some rather strict limitations on how large the tag values can be, and XML poses a significant overhead both in storing and parsing.
In the end, the 'original' queries with all the JOINs are much clearer than the convoluted XML/Xpath stuff or the cumbersome string escapings needed for HSTORE. So the suggestion from the accepted answer seems best.

How to improve insert speed with index on text column

I am using Postgresql database for our project and doing some performance testing. We need to insert millions of record with indexed columns. We have 5 columns in table. I created index on integer only then performance is good but when I created index on text column as well then the performance reduced to 1/8th times. My question is how I can improve performance when inserting data using index on text column?
Short answer is you can't.
It is well known that adding indexes on db columns is like a 2 edged sword:
on one (positive) side it adds improved speed to you read queries
on the other, it adds performance penalty to insert/update/delete operations and your data will occupy a little more disk space
A possible solution would be to use some full text search engines like Sphinx which will index your text entities in your DB

Better performance for SQLite Select Statement

I'm developing an Iphone App where the user types in any string into a searchbar and presses the search button. After that a result list should appear.
In my SQLite I have four columns a, b, c, d. Let's say they have the following Values:
Dataset 1:
a: code1
b: report1
c: description1_1
d: description1_2
Dataset 2:
a: code2
b: report2
c: description2_1
d: description2_2
So if the user enters a value of: "1_1" then the first dataset will be selected because of clumn c.
If the user enters a value of: "report" then the first and second dataset will be selected.
As I'm using a database with nearly 60.000 Datasets searching for a part-string is really killing the performance.
Setting an index at all 4 columns will make the size of the SQLite database much too huge.
So I didn't use an index at all.
My Select Statement looks like this:
NSString *sql = [NSString stringWithFormat:#"SELECT * FROM scode WHERE a LIKE '%#%#%#' OR c LIKE '%#%#%#' OR d LIKE '%#%#%#'", wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard];
Is there any good way to enhance the performance of searching for a part-string in all columns?
Thank you and kind regards,
Daniel
You're after Full Text Searching, which SQLite doesn't natively support. I don't have any experience with 3rd party support, but based on results there are a few options.
You answered your own question: Do the index on all four columns. And measure the size difference. Considering the storage capacity of the iPhone, you're probably out of balance trying to reduce storage.
The rule of thumb with SQLite performance is not to doa query that isn't indexed.
You can see what SQLite is actually doing by creating your database on the Mac using the same schema and EXPLAIN QUERY PLAN. (There's also EXPLAIN, which is more detailed but less obvious.)
You can create a separate table, with two columns: a pattern string and a key value (which is used to refer to your data tables). Lets call this table "search_index".
Then, on any change to your data table entries, you update the "search_index" table:
remove rows with keys of changed data table rows
for each column in data table, use the first X characters of the data, and add them to search_index with the key
You can work out the details yourself, but in this way, you just build your own (partial) search index.
When querying, you can use up to X characters to search in the search_index table alone. If the user types more than X characters you at least have a limited set of data table rows to search in. So you can search those 60k rows easily.
Find a good value for X to balance storage requirements and usability and performance.
EDIT: Looks like you do not want to search only the beginning of the words? Well, then you should not just use the "first X characters", but you should split the data into single words, and use the full words in search_index. Though in practice you will still have around a fourth of the index storage requirements compared to giving all columns an index. So, its still a good thing to build your own "search_index".