Postgres and Word Clouds - postgresql

I would like to know if its possible to create a Postgres function to scan some table rows and create a table that contains WORD and AMOUNT (frequency)? My goal is to use this table to create a Word Cloud.

There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:
SELECT string_to_array(lower(words), ' ') FROM table;
With those arrays, you can use unnest to aggregate them:
WITH words AS (
SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;
This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.
Other, and probably better option, is to use PostgreSQL full text search.

Late to the party but I also needet this and wanted to use full text search.
Which conveniently removes html tags.
So basically you convert the text to a tsvector and then use ts_stat:
select word, nentry
from ts_stat($q$
select to_tsvector('simple', '<div id="main">a b c <b>b c</b></div>')
$q$)
order by nentry desc
Result:
|word|nentry|
|----|------|
|c |2 |
|b |2 |
|a |1 |
But this does not scale well, so here is what I endet up with:
Setup:
-- table with a gist index on the tsvector column
create table wordcloud_data (
html text not null,
tsv tsvector not null
);
create index on wordcloud_data using gist (tsv);
-- trigger to update the tsvector column
create trigger wordcloud_data_tsvupdate
before insert or update on wordcloud_data
for each row execute function tsvector_update_trigger(tsv, 'pg_catalog.simple', html);
-- a view for the wordcloud
create view wordcloud as select word, nentry from ts_stat('select tsv from wordcloud_data') order by nentry desc;
Usage:
-- insert some data
insert into wordcloud_data (html) values
('<div id="id1">aaa</div> <b>bbb</b> <i attribute="ignored">ccc</i>'),
('<div class="class1"><span>bbb</span> <strong>ccc</strong> <pre>ddd</pre></div>');
After that your wordcloud view should look like this:
|word|nentry|
|----|------|
|ccc |2 |
|bbb |2 |
|ddd |1 |
|aaa |1 |
Bonus features:
Replace simple with for example english and postgres will strip out stop words and do stemming for you.

Related

Postgresql + psycopg: Bulk Insert large data with POSTGRESQL function call

I am working with large, very large amount of very simple data (point clouds). I want to insert this data into a simple table in a Postgresql database using Python.
An example of the insert statement I need to execute is as follows:
INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));
Note the call to the Postgresql function ST_MakePoint in the INSERT statement.
I must call this billions (yes, billions) of times, so obviously I must insert the data into the Postgresql in a more optimized way. There are many strategies to bulk insert the data as this article presents in a very good and informative way (insertmany, copy, etc).
https://hakibenita.com/fast-load-data-python-postgresql
But no example shows how to do these inserts when you need to call a function on the server-side. My question is: how can I bulk INSERT data when I need to call a function on the server-side of a Postgresql database using psycopg?
Any help is greatly appreciated! Thank you!
Please note that using a CSV doesn't make much sense because my data is huge.
Alternatively, I tried already to fill a temp table with simple columns for the 3 inputs of the ST_MakePoint function and then, after all data is into this temp function, call a INSERT/SELECT. The problem is that this takes a lot of time and the amount of disk space I need for this is nonsensical.
The most important, in order to do this within reasonable time, and with minimum effort, is to break this task down into component parts, so that you could take advantage of different Postgres features seperately.
Firstly, you will want to first create the table minus the geometry transformation. Such as:
create table temp_table (
id_scan bigint,
scandist numeric,
pt_1 numeric,
pt_2 numeric,
pt_3 numeric
);
Since we do not add any indexes and constraints, this will be most likely the fastest way to get the "raw" data into the RDBMS.
The best way to do this would be with COPY method, which you can use either from Postgres directly (if you have sufficient access), or via the Python interface by using https://www.psycopg.org/docs/cursor.html#cursor.copy_expert
Here is example code to achieve this:
iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
iconn = psycopg2.connect(iconn_string)
import_cursor = iconn.cursor()
csv_filename = '/path/to/my_file.csv'
copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
import_cursor.copy_expert(copy_sql, csv_file)
iconn.commit()
The next step will be to efficiently create the table you want, from the existing raw data. You will then be able to create your actual target table with single SQL statement, and let RDBMS to do its magic.
Once data is in the RDBMS, makes sense to optimize it a little and add an index or two if applicable (primary or unique index preferably to speed up transformation)
This will be dependent on your data / use case, but something like this should help:
alter table temp_table add primary key (id_scan); --if unique
-- or
create index idx_temp_table_1 on temp_table(id_scan); --if not unique
To move data from raw into your target table:
with temp_t as (
select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
)
INSERT INTO points_postgis (id_scan, scandist, pt)
SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
FROM temp_t;
This will in one go select all data from the previous table and transform it.
Second option that you could use is similar. You can load all raw data to points_postgis directly, while keeping it separated into 3 temp columns. Then use alter table points_postgis add column pt geometry; and follow up with an update, and removal of the temp columns: update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;
The main takeaway is that the most performant option would be to not concentrate on the final final table state, but to break it down in easily achievable chunks. Postgres will easily handle both import of billion of rows, and transformation of it afterwards.
Some simple examples using a function that generates a UPC A barcode with check digit:
Using execute_batch. execute_batch has page_size argument that allows you to batch the inserts using a multi-line statement. By default this is set at 100 which will insert 100 rows at a time. You can bump this up to make fewer round trips to the server.
Using just execute and selecting data from another table.
import psycopg2
from psycopg2.extras import execute_batch
con = psycopg2.connect(dbname='test', host='localhost', user='postgres',
port=5432)
cur = con.cursor()
cur.execute('create table import_test(id integer, suffix_val varchar, upca_val
varchar)')
con.commit()
# Input data as a list of tuples. Means some data is duplicated.
input_list = [(1, '12345', '12345'), (2, '45278', '45278'), (3, '61289',
'61289')]
execute_batch(cur, 'insert into import_test values(%s, %s,
upc_check_digit(%s))', input_list)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
# Input data as list of dicts and using named parameters to avoid duplicating data.
input_list_dict = [{'id': 50, 'suffix_val': '12345'}, {'id': 51, 'suffix_val': '45278'}, {'id': 52, 'suffix_val': '61289'}]
execute_batch(cur, 'insert into import_test values(%(id)s,
%(suffix_val)s, upc_check_digit(%(suffix_val)s))', input_list_dict)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
50 | 12345 | 744835123458
51 | 45278 | 744835452787
52 | 61289 | 744835612891
# Create a table with values to be used for inserting into final table
cur.execute('create table input_vals (id integer, suffix_val varchar)')
con.commit()
execute_batch(cur, 'insert into input_vals values(%s, %s)', [(100, '76234'),
(101, '92348'), (102, '16235')])
con.commit()
cur.execute('insert into import_test select id, suffix_val,
upc_check_digit(suffix_val) from input_vals')
con.commit()
select * from import_test ;
id | suffix_val | upca_val
-------+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
12345 | 12345 | 744835123458
45278 | 45278 | 744835452787
61289 | 61289 | 744835612891
100 | 76234 | 744835762343
101 | 92348 | 744835923485
102 | 16235 | 744835162358

How to break apart a column includes keys and values into separate columns in postgres

I am new in postgres and basically have no experience. I have a table with a column includes key and value. I need write a query which return a table with the all the columns of the table and additional columns as key as the column name and the value under it.
My input is like:
id | name|message
12478| A |{img_type:=png,key_id:=f235, client_status:=active, request_status:=open}
12598| B |{img_type:=none,address_id:=c156, client_status:=active, request_status:=closed}
output will be:
id |name| Img_type|Key_id|address_id|Client_status|Request_status
12478| A | png |f235 |NULL |active | open
12598| B | none |NULL |c156 |active | closed
Any help would be greatly appreciated.
The only thing I can think of, is a regular expression to extract the key/value pairs.
select id, name,
(regexp_match(message, '(img_type:=)([^,}]+),{0,1}'))[2] as img_type,
(regexp_match(message, '(key_id:=)([^,}]+),{0,1}'))[2] as key_id,
(regexp_match(message, '(client_status:=)([^,}]+),{0,1}'))[2] as client_status,
(regexp_match(message, '(request_status:=)([^,}]+),{0,1}'))[2] as request_status
from the_table;
regexp_match returns an array of matches. As the regex contains two groups (one for the "key" and one for the "value"), the [2] takes the second element of the array.
This is quite expensive and error prone (e.g. if any of the values contains a , and you need to deal with quoted values). If you have any chance to change the application that stores the value, you should seriously consider changing your code to store a proper JSON value, e.g.
{"img_type": "png", "key_id": "f235", "client_status": "active", "request_status": "open"}'
then you can use e.g. message ->> 'img_type' to retrieve the value for the key img_type
You might also want to consider a properly normalized table, where each of those keys is a real column.
I can do it with function.
I am sure about the performance but here is my suggestion:
CREATE TYPE log_type AS (img_type TEXT, key_id TEXT, address_id TEXT, client_status TEXT, request_status TEXT);
CREATE OR REPLACE FUNCTION populate_log(data TEXT)
RETURNS log_type AS
$func$
DECLARE
r log_type;
BEGIN
select x.* into r
from
(
select
json_object(array_agg(array_data)) as json_data
from (
select unnest(string_to_array(trim(unnest(string_to_array(substring(populate_log.data, '[^{}]+'), ','))), ':=')) as array_data
) d
) d2,
lateral json_to_record(json_data) as x(img_type text, key_id text, address_id text, client_status text, request_status text);
RETURN r;
END
$func$ LANGUAGE plpgsql;
with log_data (id, name, message) as (
values
(12478, 'A', '{img_type:=png,key_id:=f235, client_status:=active, request_status:=open}'),
(12598, 'B', '{img_type:=none,address_id:=c156, client_status:=active, request_status:=closed}')
)
select id, name, l.*
from log_data, lateral populate_log(message) as l;
What you finally write in query will be something like this, imagine that the data is in a table named log_data :
select id, name, l.*
from log_data, lateral populate_log(message) as l;
I suppose that message column is a text, in Postgres it might be an array, in that case you have to remove some conversions, string_to_array(substring(populate_log.data)) -> populate_log.data

Change column name from aggregate function default postgresql

I created a (big) table like so:
create table names_and_pics as (
select e.emp_name, e.dept, max(p.prof_pic)
from e.employees
left join profiles p
on e.emp_id = p.emp_id )
select * from names_and_pics;
emp_name | dept | max(p.prof_pic)
Dan | IT | 1234.img
Phil | HR | 3344.img
...
Because I forgot to give the 3rd field a name, I need to rename it now to "img_link" The syntax I've been trying is
alter table names_and_pics rename max(p.prof_pic) to img_link;
That gives the following error:
Syntax Error at or near "("
Any ideas how to fix this?
You need to put the column names in double quotes because it contains invalid characters:
alter table names_and_pics rename "max(p.prof_pic)" to img_link;
More about quoted identifiers in the manual
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
Btw: the parentheses around the select in your create table ... as select statement are useless noise

How to convert tsvector?

A typical and relevant application of tsvectot is to query and summarize information about the set of occurred words and about its frequency... And JSONB is the natural choice (!) to represent tsvectot datatype for these "querying applications"... So,
There are a simple workaround to cast tsvector into JSONB?
Example: counting global frequency of words of a cached tsvectot's, will be something like this query
SELECT r.key as word, SUM(r.value) as occurrences
FROM (
SELECT jsonb_each(kx_tsvectot::jsonb) as r FROM terms
) t
GROUP BY 1;
You can use ts_stat() function, which will give you exactly what you need
word text — the value of a lexeme
ndoc integer — number of documents (tsvectors) the word occurred in
nentry integer — total number of occurrences of the word
Example may be the following:
CREATE TABLE t (
tsv TSVECTOR
);
INSERT INTO t VALUES
('word'::TSVECTOR),
('second word'::TSVECTOR),
('third word'::TSVECTOR);
SELECT * FROM
ts_stat('SELECT tsv FROM t');
Result:
word | ndoc | nentry
--------+------+--------
word | 3 | 3
third | 1 | 1
second | 1 | 1
(3 rows)
If you still want to convert it to jsonb you can use cast word from text to jsonb.

Search inside full search column using certain letters

I want to search inside a full search column using certain letters, I mean:
select "Name","Country","_score" from datatable where match("Country", 'China');
Returns many rows and is ok. My question is, how can I search for example:
select "Name","Country","_score" from datatable where match("Country", 'Ch');
I want to see, China, Chile, etc.
I think that match_type phrase_prefix can be the answer, but I don't know how I can use (correct syntax).
The match predicate supports different types by use of using match_type [with (match_parameter = [value])].
So in your example using the phrase_prefix match type:
select "Name","Country","_score" from datatable where match("Country", 'Ch') using phrase_prefix;
gives you your desired results.
See the match predicate documentation: https://crate.io/docs/en/latest/sql/fulltext.html?#match-predicate
If you just need to match the beginning of a string column, you don't need a fulltext analyzed column. You can use the LIKE operator instead, e.g.:
cr> create table names_table (name string, country string);
CREATE OK (0.840 sec)
cr> insert into names_table (name, country) values ('foo', 'China'), ('bar','Chile'), ('foobar', 'Austria');
INSERT OK, 3 rows affected (0.049 sec)
cr> select * from names_table where country like 'Ch%';
+---------+------+
| country | name |
+---------+------+
| Chile | bar |
| China | foo |
+---------+------+
SELECT 2 rows in set (0.037 sec)