Postgresql tsvector structure - postgresql

lo here.
i am trying to utilizing tsvector for counting frequencies of terms.
i think i am almost there but i cannot find a way to obtain terms from tsvector structure.
what I have done is, after creating tsvector column:
select term_tsv, count(*) count from (select unnest(term_tsv) term_tsv from document_tsv) t group by term_tsv order by count desc;
the result is like:
stem_tsv | count
------------------------+-------
(3,{9},{D}) | 1
i am lost for not knowing what kind of expression the parenthesis represents.
can anybody tell me how to extract the term from shell?
thank you.

i figured out that something like the following lists top 10 frequent entries,
which was written in the official manual.
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
just for record.

Related

sql restriction for join table with string similarity rule

My Db is building from some tables that are similar to each other and share the same column names. The reason is to perform a comparison between data from each resource.
table_A and table_B: id, product_id, capacitor_name, ressitance
It is easy to join tables by product_id and see the comparison,
but I need to compare data between product_id if exists in both tables and if not I want to compare by name similarity and if similarity restricts the result for up to 3 results.
The names most of the time are not equal this is why I'm using a similarity.
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
similarity(ta.name,tb.name) > 0.8
It works fine. But the problem is sometimes I'm getting more data than I need, how can I restrict it? (and moreover, order it by similarity in order to get higher similarity names).
If you want to benefit from an trigram index, you need to use the operator form (%), not the function form. Then you would order on two "columns", the first to be exact matches first, the 2nd to put most similar matches after and in order. And use LIMIT to do the limit. I've assumed you have some WHERE condition to restrict this to just one row of table_a. If not, then your question is not very well formed. To what is this limit supposed to apply? Each what should be limited to just 3?
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
ta.name % tb.name
WHERE ta.id=$1
ORDER BY ta.product_id = tb.product_id desc, similarity(ta.name,tb.name) desc
LIMIT 3

PostgreSQL nested selects

Is there anybody who can help me with making a query with the following functionality:
Let's have a simple statement like:
SELECT relname FROM pg_catalog.pg_class WHERE relkind = 'r';
This will produce a nice result with a single column - the names of all tables.
Now lets imagine that one of the tables has name "table1". If we execute:
SELECT count(*) FROM table1;
we will get the number of rows of the table "table1".
Now the real question - how these two queries can be unified and to have one query, which to give the result of two columns: name of the table and number of rows? Written in pseudo SQL it should be something like this:
SELECT relname, (SELECT count(*) FROM relname::[as table name]) FROM pg_catalog.pg_class WHERE relkind = 'r';
And here is and example - if there are 3 tables in the database and the names are table1, table2 and table 3, and they have respectively 20, 30 and 40 rows, the query result should be like this:
-------------
|relname| rows|
|-------------|
|table1 | 20|
|-------------|
|table2 | 30|
|-------------|
|table3 | 40|
-------------
Thanks to everyone who is willing to help ;-)
P.S. Yes I know that the table name is not schema-qualified ;-) Let's hope that all tables in the database have unique names ;-)
(Corrected typos from rename to relname in last query)
EDIT1: The question is not related to "how can I find the number of rows in a table". What I'm asking is: how to build a query with 2 selects and the second to have as FROM the value of a column from the result of the first select.
EDIT2: As #jdigital suggested I've tried the dynamic querying and it does the job, but can be used only in PL/pgSQL. So it doesn't fit my needs. In additional I tried with PREPARE and EXECUTE statement - yet again it is not working. Anyway - I'll stick with the two queries approach. But I'm damn sure that PostgreSQL is capable of this ....
With PL/pgSQL (postgres SQL Procedural Language), you can execute dynamic queries by building a string and then executing it as SQL. Note that this is postgres-specific, but other databases are likely to have something equivalent. Even more generally, if you are willing to go beyond SQL, you can do this with any programming language (or shell/cmd script).
By the way, you'll get better results searching for "postgres dynamic query" since "nested select" has a different meaning.

Alternative when IN clause is inputed A LOT of values (postgreSQL)

I'm using the IN clause to retrieve places that contains certain tags. For that I simply use
select .. FROM table WHERE tags IN (...)
For now the number of tags I provide in the IN clause is around 500) but soon (in the near future) number tags will probably jump off to easily over 5000 (maybe even more)
I would guess there is some kind of limition in both the size of the query AND in the number values in the IN clause (bonus question for curiosity what is this value?)
So my question is what is a good alternative query that would be future proof even if in the future I would be matching against let's say 10'000 tags ?
ps: I have looked around and see people mentioning "temporary table". I have never used those. How will they be used in my case? Will i need to create a temp table everytime I make a query ?
Thanks,
Francesco
One option is to join this to a values clause
with parms (tag) as (
values ('tag1'), ('tag2'), ('tag3')
)
select t.*
from the_table t
join params p on p.tag = t.tag;
You could create a table using:
tablename
id | tags
----+----------
1 | tag1
2 | tag2
3 | tag3
And then do:
select .. FROM table WHERE tags IN (SELECT * FROM tablename)

SqlAlchemy: count of distinct over multiple columns

I can't do:
>>> session.query(
func.count(distinct(Hit.ip_address, Hit.user_agent)).first()
TypeError: distinct() takes exactly 1 argument (2 given)
I can do:
session.query(
func.count(distinct(func.concat(Hit.ip_address, Hit.user_agent))).first()
Which is fine (count of unique users in a 'pageload' db table).
This isn't correct in the general case, e.g. will give a count of 1 instead of 2 for the following table:
col_a | col_b
----------------
xx | yy
xxy | y
Is there any way to generate the following SQL (which is valid in postgresql at least)?
SELECT count(distinct (col_a, col_b)) FROM my_table;
distinct() accepts more than one argument when appended to the query object:
session.query(Hit).distinct(Hit.ip_address, Hit.user_agent).count()
It should generate something like:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT ON (hit.ip_address, hit.user_agent)
hit.ip_address AS hit_ip_address, hit.user_agent AS hit_user_agent
FROM hit) AS anon_1
which is even a bit closer to what you wanted.
The exact query can be produced using the tuple_() construct:
session.query(
func.count(distinct(tuple_(Hit.ip_address, Hit.user_agent)))).scalar()
Looks like sqlalchemy distinct() accepts only one column or expression.
Another way around is to use group_by and count. This should be more efficient than using concat of two columns - with group by database would be able to use indexes if they do exist:
session.query(Hit.ip_address, Hit.user_agent).\
group_by(Hit.ip_address, Hit.user_agent).count()
Generated query would still look different from what you asked about:
SELECT count(*) AS count_1
FROM (SELECT hittable.user_agent AS hittableuser_agent, hittable.ip_address AS sometable_column2
FROM hittable GROUP BY hittable.user_agent, hittable.ip_address) AS anon_1
You can add some variables or characters in concat function in order to make it distinct. Taking your example as reference it should be:
session.query(
func.count(distinct(func.concat(Hit.ip_address, "-", Hit.user_agent))).first()

tsearch2 word statistics

I have a column with a tsvector in my table. Now I'd like to find out witch words are represented above average, to add these to the stop word list. Is there a function for tsearch2 to list the frequency of all words in the index?
SELECT * FROM ts_stat('SELECT ts_vector_col FROM mytable')
ORDER BY nentry DESC, ndoc DESC, word ;
Gathering Document Statistics