Full text search when words have the same start - postgresql

I'm new to full text search engine, I have a table filled with french and english words, but I'm having wierd issues when I try to implement the requests. The goal is to do a search engine with auto completion. Currently there the where are made with ILIKE but I heard that it's not 'scalable'
Let's say that I have a book table :
CREATE TABLE t_books (
title TEXT
);
INSERT INTO t_books VALUES('Admin');
INSERT INTO t_books VALUES('Administratif');
INSERT INTO t_books VALUES('Adminare');-- no french words
INSERT INTO t_books VALUES('Admininids');
INSERT INTO t_books VALUES('Admin2');
When I run this query :
SELECT b_id
FROM (
SELECT title as b_id,
to_tsvector('french',title) as document
FROM t_books
) p_search
WHERE p_search.document ## to_tsquery('french','admin');
With the likes, I get all the 5 rows, but with this query this only returns "Admin".
However, it looks like it's works better with english words (those above are french and english, for example with the titles 'book','booking','booked'; when i'm looking up for book it returns the 3 rows.
http://rextester.com/RKNE72921
What is wrong there ? Is it a lack of french words in the dictionnary ?

Related

using 1 WITH statement with multiple (n) INSERTs

I wanted to know if it is possible to create one WITH statement and add multiple data values to other table with select, or do any equivalent thing.
I have 2 tables
one has data
create table bg_item(
item_id text primary key default 'a'||nextval('bg_item_seq'),
sellerid bigint not null references bg_user(userid),
item_type char(1)not null default 'N', --NORMAL (public)
item_upload_date date NOT NULL default current_date,
item_name varchar(30) not null,
item_desc text not null,
other has images link
create table item_images(
img_id bigint primary key default nextval('bg_item_image_seq'),
item_id text not null references bg_item (item_id),
image_link text not null
);
The user can add item to sell and upload images of it, now these images can be 3 or more, now when i will add the images, and complete item's description and everything from app, my request goes to backend, and i want to perform the query that it adds the user's item, sends me the item's id (which is a sequence in PostgresSQL) and use that id to reference images that i am inserting.
Currently i was doing this (for 1 image):
WITH ins1 AS (
INSERT INTO bg_item(sellerid,item_type,item_date,item_name,item_desc,item_specs,item_category)
VALUES (1005, 'k',default,'asdf','asdf','asd','asd')
RETURNING item_id
)
INSERT INTO item_images (item_id, image_link)
select item_id,'asdfg.asd.asdf.com' from ins1
(for 3 images)
INSERT INTO bg_item(sellerid,item_type,item_date,item_name,item_desc,item_specs,item_category)
VALUES (1005, 'k',default,'asdf','asdf','asd','asd')
RETURNING item_id
)
INSERT INTO item_images (item_id, image_link)
select item_id,'asdfg.asd.asdf.com' from ins1
select item_id,'asdfg.asdaws3f.com' from ins1
select item_id,'asdfg.gooolefnsd.sfsjf.com' from ins1
This would work for 3 images.
So my question is how to do it with n number of images? (as user can upload from 1 to n images)
Can i write a for loop?
a procedure or function?
References:
With and Insert
Sql multiple insert select
I didn't understand the Edit 3 (if it is related to my answer) in the above one.
One Solution i can think of is to write a procedure to return me item_id and write one more procedure to run multiple inserts, but i want a more efficient solution.
If you are going to work with SQL then there is a concept you need to expel from your thoughts -- LOOP. As soon as you think it, it is time to rethink. It does not exist is SQL and is not typically needed. SQL works in sets of qualifying things not individual things.
Now to your issue, it can be done is 1 statement. You pass your image list as an array of text in the with clause, then unnest that array and join to your existing cte during the Insert/Select:
with images (ilist) as
(
select array['image1','image2','image3','image4','image5']
)
, item (item_id) as
(
insert into bg_item(sellerid,item_type,item_date,item_name,item_desc,item_specs,item_category)
values (1005, 'k',default,'asdf','asdf','asd','asd')
returning item_id
)
insert into item_images (item_id, image_link)
select item_id,unnest (ilist)
from images
join item on true;

Counting the Number of Occurrences of a Multi-Word Phrase in Text with PostgreSQL

I have a problem, I need to count the frequency of a word phrase appearing within a text field in a PostgreSQL database.
I'm aware of functions such as to_tsquery() and I'm using it to check if a phrase exists within the text using to_tsquery('simple', 'sample text'), however, I'm unsure of how to count these occurrences accurately.
If the words are contained just once in the string (I am supposing here that your table contains two columns, one with an id and another with a text column called my_text):
SELECT
count(id)
FROM
my_table
WHERE
my_text ~* 'the_words_i_am_looking_for'
If the occurrences are more than one per field, this nested query can be used:
SELECT
id,
count(matches) as matches
FROM (
SELECT
id,
regexp_matches(my_text, 'the_words_i_am_looking_for', 'g') as matches
FROM
my_table
) t
GROUP BY 1
The syntax of this function and much more about string pattern matching can be found here.

Summarize repeated data in a Postgres table

I have a Postgres 9.1 table called ngram_sightings. Each row is a record of seeing an ngram in a document. An ngram can appear multiple times in a given document.
CREATE TABLE ngram_sightings
(
ngram VARCHAR,
doc_id INTEGER
);
I want summarize this table in another table called ngram_counts.
CREATE TABLE ngram_counts
(
ngram VARCHAR PRIMARY INDEX,
-- the number of unique doc_ids for a given ngram
doc_count INTEGER,
-- the count of a given ngram in ngram_sightings
corpus_count INTEGER
);
What is the best way to do this?
ngram_sightings is ~1 billion rows.
Should I create an index on ngram_sightings.ngram first?
Give this a shot!
INSERT INTO ngram_counts (ngram, doc_count, corpus_count)
SELECT
ngram
, count(distinct doc_id) AS doc_count
, count(*) AS corpus_count
FROM ngram_counts
GROUP BY ngram;
-- EDIT --
Here is a longer version using some temporary tables. First, count how many documents each ngram is associated with. I'm using 'tf' for "term frequency" and 'df' for "doc frequency", since you are heading in the direction of tf-idf vectorization and you may as well use the standard language, it will help with the next few steps.
CREATE TEMPORARY TABLE ngram_df AS
SELECT
ngram
, count(distinct doc_id) AS df
FROM ngram_counts
GROUP BY ngram;
Now you can create table for the total count of each ngram.
CREATE TEMPORARY TABLE ngram_tf AS
SELECT
ngram
, count(*) AS tf
FROM ngram_counts
GROUP BY ngram;
Then join the two on ngram.
CREATE TABLE ngram_tfidf AS
SELECT
tf.ngram
, tf.tf
, df.df
FROM ngram_tf
INNER JOIN ngram_df ON ngram_tf.ngram = ngram_df.ngram;
At this point, I expect you will be looking up ngram quite a bit, so it makes sense to index the last table on ngram. Keep me posted!

using like tsql with a list of values

In transact sql i have:
DECLARE #phrase='KeyWord1 KeyWord2 ,KeyWord3 ' -- and my be more separated by space,comma or ;(but mainly by space=it's a phrase)
I have a table Students
Students
(
StudentId bigint,
FullName nvarchar(50),
Article nvarchar(max)
)
I want to filter students by articles by bringing those whom article conatains a word of #phrase
Something like:
DECLARE #WOrdTable TABLE
(
Word nvarchar(50)
)
INSERT INTO #WOrdTable
SELECT WOrd of #phrase
SELECT *
FROM Students
WHERE Article LIKE (Word in #phrase)
I would split your string (comma delimited) into a temp table on your word phrases and perform a join to the Students table. From there you can make better use of the data than you would have in string format
There are plenty ways of splitting a string into a table:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50648
Once you have your temp table you can use something like this.
SELECT S.*
FROM Students S (NOLOCK)
JOIN #tmpArticles A
ON S.Articles LIKE '%' + A.Article '%'
A word of caution though, using LIKE on %X% has terrible performance, so question your approach if you have a LOT of string data.
This problem seems more geared towards a Full Text Search approach (FTS)
http://msdn.microsoft.com/en-us/library/ms142571.aspx

SQL query question

I have a table that has about 6 fields. I want to write a SQL statement that will return all records that do not have "England" in the country field, English in the language field & english in the comments field.
What would the sql query be like?
Well, your question depends a lot on what DBMS you're using and what your table set up looks like. This would be one way to do it in MySQL or TSQL:
SELECT *
FROM tbl
WHERE country NOT LIKE '%England%' AND language NOT LIKE '%english%'
AND comments NOT LIKE '%english%';
The way you word your question makes it sound like all these fields could contain a lot of text, in which case the above query would be the way to go. However, more likely than not you'd be looking for exact matches in a real database:
SELECT *
FROM tbl
WHERE country!='England' AND language!='english'
AND comments NOT LIKE '%english%';
Start with this and modify as necessary:
SELECT *
FROM SixFieldTable
WHERE Country <> 'England'
AND language <> 'english'
AND comments NOT LIKE '%english%'
Hope this helps.
Are you wanting something like
select * from myTableOfMadness
where country <> 'England'
and language <> 'English'
and comments not like '%english%'
Not sure if you want 'and's or 'or's, or all 'not' comparisons. Your sentence structure is somewhat misleading.
The above solutions do not appear to account for possible nulls in the columns. The likes of
Where country <> 'England'
will erroneously exclude entries where Country is null, under default SQL Server connection settings.
Instead, you could try using
IsNull(Country, '') <> 'England'
To ignore case:
SELECT *
FROM SixFieldTable
WHERE LOWER(Country) <> 'england' AND
LOWER(language) <> 'english' AND
LOWER(comments) NOT LIKE '%english%'
Try This
Select * From table
Where Country Not Like '%England%'
And Language Not Like '%English%'
And comments Not Like '%English%'