Postgres: Retrieve first n words from column - postgresql

I know that I can do a text search in Postgres with TextSearch and get some result with
select ts_headline('german',content, tq, 'MaxFragments=4, MinWords=5, MaxWords=12,
ShortWord=3, StartSel = <strong>, StopSel = </strong>') as highlight, ...
FROM to_tsquery('german', 'test') tq ...
Is there a similar way to apply to content the same limitations? i.e. to get directly up to 12 words from the column content.

You could use regular expressions:
SELECT (regexp_match(
regexp_replace(content, '[^\w\s]+', ' ', 'g'),
'^\s*((?:\w+\s+){9}\w+)'
))[1] FROM ...
That will first replace everything that is not a space or alphanumerical character with a space and then return the first 10 words.

Related

Square brackets in PgAdmin 4 for null values

In pgAdmin 4, the column value is seen as a square bracket [...] instead of an empty value.
The column data type is character(4) and name is carr_desig_icao_cd. Database is postgreSql.
How to avoid the square brackets? I tried pgAdmin 4 preferences but no luck.
Thanks for your help.
Output from psql is as below:
Could it be something to do with converting to_char?
Here are results of my testing.
This query produced no brackets and dots:
select at1.score
This query produced brackets and dots
select to_char(at1.score, '999')
I noticed when this was downloaded and opened in Excel, there were no brackets but three spaces at the start of the column.
This query removed the brackets and dots in PgAdmin and removed the spaces after downloading to Excel:
select replace(to_char(at1.score, '999'), ' ', '')
Just noting also that this was a sub query as part of a bigger query that looked a bit like this:
select
cm.course_id
, us.user_id
, gm.title
, (select replace(to_char(at1.score * 20, '999'), ' ', '') || '% ' || to_char(at1.attempt_date, 'yyyy-mm-dd') from attempt at1 where at1.pk1 = gg.highest_attempt_pk1)
from
etc (joins of course_main cm, users us, gradebook_main gm, gradebook_grade gg, attempt at)
This screenshot shows before and after

PostgreSQL return last n words

How to return last n words using Postgres.
I have tried using LEFT method.
SELECT DISTINCT LEFT(name, -4) FROM my_table;
but it return last 4 characters ,i want to return last 3 words.
demo:db<>fiddle
You can do this using a the SUBSTRING() function and regular expressions:
SELECT
SUBSTRING(name FROM '((\S+\s+){0,3}\S+$)')
FROM my_table
This has been explained here: How can I match the last two words in a sentence in PostgreSQL?
\S+ is a string of non-whitespace characters
\s+ is a string of whitespace characters (e.g. one space)
(\S+\s+){0,3} Zero to three words separated by a space
\S+$ one word at the end of the text.
-> creates 4 words (or less if there are no more).
One way is to use regexp_split_to_array() to split the string into the words it contains and then put a string back together using the last 3 words in that array.
SELECT coalesce(w.words[array_length(w.words, 1) - 2] || ' ', '')
|| coalesce(w.words[array_length(w.words, 1) - 1] || ' ', '')
|| coalesce(w.words[array_length(w.words, 1)], '')
FROM mytable t
CROSS JOIN LATERAL (SELECT regexp_split_to_array(t."name", ' ') words) w;
db<>fiddle
RIGHT() should do
SELECT RIGHT('MYCOLUMN', 4); -- returns LUMN
UPD
You can convert to array and then back to string
SELECT array_to_string(sentence[(array_length(sentence,1)-3):(array_length(sentence,1))],' ','*')
FROM
(
SELECT regexp_split_to_array('this is the one of the way to get the last four words of the string', E'\\s+') AS sentence
) foo;
DEMO HERE

Postgresql: Remove hyphens and whitespaces

I am currently working on DB data which contains whitespaces and hyphens. I searched over the net and found this Remove/replace special characters in column values? . I tried to follow the answer but I am still getting hyphens. I tried playing around with it, I can only remove the whitespace
conn_p = p.connect("dbname='p_test' user='postgres' password='postgres' host='localhost'")
conn_t = p.connect("dbname='t_mig1' user='postgres' password='postgres' host='localhost'")
cur_p = conn_p.cursor()
cur_t = conn_t.cursor()
cur_t.execute("SELECT CAST(REGEXP_REPLACE(studentnumber, ' ', '') as integer), firstname, middlename, lastname FROM sprofile")
rows = cur_t.fetchall()
for row in rows:
print "Inserting ", row[0], row[1], row[2], row[3]
cur_p.execute(""" INSERT INTO "a_recipient" (id, first_name, middle_name, last_name) VALUES ('%s', '%s', '%s', '%s') """ % (row[0], row[1], row[2], row[3]))
cur_p.commit()
cur_pl.close()
cur_t.close()
What I would like to achieve is if I got a studentnumber of 001-2012-1456, it will be displayed as 000120121456.
To wipe out all characters in a set efficiently use translate. It takes a set of characters to translate into another set of characters. If the other set is empty it deletes them.
test=> select translate('001-2012-145 6', '- ', '');
translate
-------------
00120121456
While translate is simpler and faster for this particular job, it's important to know how to use regexes for others. To do it with regexp_replace there's two changes you need to make.
First, you have to match the set of - and as [- ].
Then, you have to specify to replace all occurrences, otherwise it will stop after the first one. That's done with the g flag.
test=> select regexp_replace('001-2012-145 6', '[- ]', '', 'g');
regexp_replace
----------------
00120121456
Here's a tutorial on POSIX regular expressions and character sets.
Its very simple to use inbuilt translate function.
Example:
select translate('001-2012-145 6', '- ', '');
Output of above command :
00120121456

Padding Fields With White Space

I have the following piece of code in my SELECT statement -
SELECT convert(varchar (24),ra.Reference)
If a result is - R0_2, so 4 characters, how do you go about padding the trailing space (to the right) with the remaining 20 characters to make up 24?
Similar in that if I have a figure of say 18.00 what I want is to add a # to the front, which I know I can achieve with a CONCAT.
However this field I want to be 16 characters and any leading space to be filled with white space, so this example would look like -
'xxxxxxxxxx#18.00' (where x is a blank space)
Thank you for any advice.
One trick you can use is to just concatenate to the string an amount of padding which is guaranteed to fill the missing spaces. For the case of a string 24 characters long, in your first example, we can concatenate 24 spaces to the end of that string. Then, take the first 24 characters from the left, and the resulting string should be right padded by spaces. Similar logic applies to the other case.
First query:
SELECT LEFT(CONVERT(varchar(24), ra.Reference) + ' ', 24)
FROM yourTable
Second query:
SELECT RIGHT(' ' + '#' + CONVERT(varchar(16), ra.TotalValue), 16)
FROM yourTable
You could also use REPLICATE to accurately pad based on the length of text for each cell to ensure it's always 24 characters:
DECLARE #Test1 VARCHAR(24) = 'Test',
#Test2 VARCHAR(24) = 'Longer String'
SELECT CONCAT(#Test1, REPLICATE(N' ', 24 - LEN(#Test1))),
CONCAT(#Test2, REPLICATE(N' ', 24 - LEN(#Test2)))
And for the #....
DECLARE #Number DECIMAL(4,2) = 18.00
SELECT CONCAT(REPLICATE(' ', 15 - LEN(CONVERT(VARCHAR(16), #Number))), '#',#Number)
I used 15 here despite it being 16 characters to account for the addition of the #

Truncating leading zero from the string in postgresql

I'm trying to truncate leading zero from the address. example:
input
1 06TH ST
12 02ND AVE
123 001St CT
expected output
1 6TH ST
12 2ND AVE
123 1St CT
Here is what i have:
update table
set address = regexp_replace(address,'(0\d+(ST|ND|TH))','?????? need help here')
where address ~ '\s0\d+(ST|ND|TH)\s';
many thanks in advance
assuming that the address always has some number/letter address (1234, 1a, 33B) followed by a sequence of 1 or more spaces followed by the part you want to strip leading zeroes...
select substr(address, 1, strpos(address, ' ')) || ltrim(substr(address, strpos(address, ' ')), ' 0') from table;
or, to update the table:
update table set address = substr(address, 1, strpos(address, ' ')) || ltrim(substr(address, strpos(address, ' ')), ' 0');
-g
What you are looking for is the back references in the regular expressions:
UPDATE table
SET address = regexp_replace(address, '\m0+(\d+\w+)', '\1', 'g')
WHERE address ~ '\m0+(\d+\w+)'
Also:
\m used to match the beginning of a word (to avoid replacing inside words (f.ex. in 101Th)
0+ truncates all zeros (does not included in the capturing parenthesis)
\d+ used to capture the remaining numbers
\w+ used to capture the remaining word characters
a word caracter can be any alphanumeric character, and the underscore _.