Sorry if this has already been asked. I couldn't see it in previously asked questions.
I have a table - 'eightks'.
This file contains 1,000,000 text documents.
I only need those that mention the word 'other events'. So I am trying to do some text matching and then output these files into a new table.
My current code is;
SELECT * FROM eightks\d
WHERE to_tsvector(text) ## to_tsquery('other_events');
When I run this I get the following error
string is too long for tsvector (2368732 bytes, max 1048575 bytes)
Also How do I output the matching rows into a new table?
Any help is appreciated.
That's a documented limitation.
The length of a tsvector (lexemes + positions) must be less than 1 megabyte
It might be possible to change the source code and recompile. See ts_type.h. I suspect it won't be simple, though.
You might need to break the documents up into smaller pieces for searching, then combine the pieces for presentation to the user.
As for inserting the rows into another table, you can just insert a correct select statement. Basically . . .
insert into table_name
select ...
You might need to supply column names.
Related
I have a massive list of strings in a text file, the file is about 100gb uncompressed.
Each line of the text file is a single word (rougly 50 characters long), no spaces or punctuation.
The table will be created and populated from this text file once, further updates to the table will not be necessary and if it helps the table can become read only.
The use-case is a function which would look something like this:
/**
* Search the table and return true if the word exists, false if not.
* /
wordExists(wordToCheck: string): boolean {}
I'm looking for advice here on what would be the best way to store the data to ensure that lookups are as fast and efficient as possible.
I'm not sure if breaking the word up into parts to try and assist in indexing it would help or not, I'm also not sure if it will help to partition this list.
Anyone have any advice for me?
100GB is fairly long, and assuming each word averages 50 characters in length, this means that roughly a single column would have 2 billion words/records in it. This is not so large that it is beyond the capability of Postgres.
I suggest creating and populating your table, and then adding a hash index:
CREATE INDEX idx ON yourTable USING HASH (text_col);
Now any query against this table should be able to use the index for very rapid lookup. For example, to see if a word exists, use:
SELECT EXISTS (SELECT 1 FROM yourTable WHERE text_col = 'meatballs');
If you run the explain plan, you should see the hash index being used with a fast lookup time.
I am working on a solution that involves merging two queries in Power Query to retrieve a single data table back to Excel. The first query is always populated but the other query comes from an ERP and might be empty (empty table) from time to time.
Appending the two queries involves making the header names the same in the two queries before the appending takes place. As the second query sometimes results in an empty table, the error arises in the steps when Power Query is modifying the header names in the second table (it cannot modify the header names as there are no headers).
"Error message: Expression.Error: The column 'PartMtl_Company' of the table wasn't found.
Details: PartMtl_Company" where the PartMtl_Company is the leftmost column in my table.
I am kind of thinking that I would need to evaluate whether the second table is empty and skip the renaming steps if that is the case. I assume merging the populated first table with an empty table would cause no problem and would only result in the first table. I have tried to look around for a suitable M-code but have not come across such.
I'm thinking you might be able to use Table.RowCount to solve this. Something along the lines of:
= if Table.RowCount(Table2) > 0 then...
You would modify the headers only if there is data in the second table. Same goes for the appending of the tables: you would only append if there is data in the second table, since you won't have renamed any headers otherwise.
Thank you Marc! That did the trick.
In the end, I wrote some in the lines of
= if Table.RowCount(Table2) > 0 then... (code that works on a non-empty table) ...else Table2
, which returns the empty table if it is empty to begin with. Appending the second table into the first table did not throw an error but returned only the first table like planned.
Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/
I have been searching endlessly for the answer to this problem I have been having:
Our team uses a query that returns a dataset with 13 columns. We want to narrow down the results by returning only rows where any string value in column "Actual Collection" is in the adjacent column "PrvPrComments". Additionally we want to do the same thing for column "Actual Manufacturer" and "PrvPrComments". If a string value in either Actual collection or Actual manufacturer exsists in PrvPrComments then we want to return that row and if it does not then exclude it.
The tricky part is that PrvPrComments is a column that has long text strings in them and so the query needs to parse through to find and match the string. They also need to be exact matches so "Pillow Perfect" and "pillow" would not be the same thing.
Here is an example posted below. I would want to return rows that contains "cowboy" and "chandelier" because there is a match but not the others:
Example of data
My initial guess would be to write a query that uses Full Text Index and/or contains. Any help would be greatly appreciated and I apologize for not having a foundation code to post here, I'm fairly new to this and am having trouble with where to start.
Thank you
where '%' + actualCollection + '%' like PrvPrComments
If data is not that much you can use (like expression) to return the data,
WHERE PrvPrComments LIKE '%' + actualCollection + '%'
But if data is huge and full-text search will not be that much useful, you might have another column as a flag and populate the same at INSERTION time, (when the actualCollection is LIKE PrvPrComments then set the flag as 1 ). later you need to query against rows having flag as 1
I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).