In a word game similar to Ruzzle or Letterpress, where users have to construct words out of a given set of letters:
I keep my dictionary in a simple SQL table:
create table good_words (
word varchar(16) primary key
);
Since the game duration is very short I do not want to check every entered word by calling a PHP script, which would look that word up in the good_words table.
Instead I'd like to download all possible words by one PHP script call before the round starts - since all letters are known.
My question is: if there is a nice SQLish way to find such words?
I.e. I could run a longer-taking script once to add a column to good_words table, which would have same letters as in the word columnt, but sorted alphabetically... But I still can't think of a way to match for it given a set of letters.
And doing the word matching inside of a PHP script (vs. inside the database) would probably take too long (because of bandwidth: would have to fetch every row from the database to the PHP script).
Any suggestions or insights please?
Using postgresql-8.4.13 with CentOS Linux 6.3.
UPDATE:
Other ideas I have:
Create a constantly running script (cronjob or daemon) which would prefill an SQL table with precompiled letters board and possible words - but still feels like a waste of bandwidth and CPU, I would prefer to solve this inside the database
Add integer columns a, b, ... , z and whenever I store a word into good_words, store the letter occurences there. I wonder if it is possible to create an insert trigger in Pl/PgSQL for that?
Nice question, I upvoted.
What you're up to is a list of all possible permutations of the given letters of a given length. As described in the PostgreSQL wiki, you can create a function and call it like this (matches highlighted letters in your screenshot):
SELECT * FROM permute('{E,R,O,M}'::text[]);
Now, to query the good_words use something like:
SELECT gw.word, gw.stamp
FROM good_words gw
JOIN permute('{E,R,O,M}'::text[]) s(w) ON gw.word=array_to_string(s.w, '');
This could be a start, except that it doesn't check if we have enough letters, only if he have the right letters.
SELECT word from
(select word,generate_series(0,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','h','e','l','e','t','t','e','r','s')
GROUP BY word
HAVING count(*)>=length(word);
http://sqlfiddle.com/#!1/2e3a2/3
EDIT:
This query select only the valid words though it seems a bit redundant. It's not perfect but certainly proves it can be done.
WITH words AS
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'))
SELECT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM
(SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l)
as q)
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word);
Fiddle with all possible 3+ letters words (485) for that image: http://sqlfiddle.com/#!1/2fc66/1
Fiddle with 699 words out of which 485 are correct: http://sqlfiddle.com/#!1/4f42e/1
Edit 2:
We can use array operators like so to get a list of words that contain the letters we want:
SELECT word as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <# ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'];
So we can use it to narrow down the list of words we need to check.
WITH words AS
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from
(
SELECT word from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <# ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s']
)as q) as q)
SELECT DISTINCT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM
(SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l)
as q)
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word) ORDER BY w.word;
http://sqlfiddle.com/#!1/4f42e/44
We can use GIN indexes to work on arrays so we probably could create a table that would store the arrays of letters and make words point to it (act, cat and tact would all point to array [a,c,t]) so probably that would speed things up but that's up for testing.
Create a table that has entries (id, char), be n the number of characters you are querying for.
select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) and count = n group by id;
OR (for partial matching)
select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) group by id order by count;
The result of that query has all the word-id's that fit the specifications. Cache the result in a HashSet and simple do a lookup whenever a word is entered.
You can add the column with sorterd letters formatted like '%a%c%t%'. Then use query:
select * from table where 'abcttx' like sorted_letters
to find words that can be built from letters 'abcttx'. I don't know about performance, but simplicity probably can't be beaten :)
Here is a query that finds the answers that can be found by walking through adjacent fields.
with recursive
input as (select '{{"t","e","s","e"},{"r","e","r","o"},{"r","e","m","a"},{"s","d","s","s"}}'::text[] as inp),
dxdy as(select * from (values(-1,-1),(-1,0),(-1,1),(0,1),(0,-1),(1,-1),(1,0),(1,1)) as v(dx, dy)),
start_position as(select * from generate_series(1,4) x, generate_series(1,4) y),
work as(select x,y,inp[y][x] as word from start_position, input
union
select w.x + dx, w.y + dy, w.word || inp[w.y+dy][w.x+dx]
from dxdy cross join input cross join work w
inner join good_words gw on gw.word like w.word || '%'
)
select distinct word from work
where exists(select * from good_words gw where gw.word = work.word)
(other answers don't take this into account).
Sql fiddle link: http://sqlfiddle.com/#!1/013cc/14 (notice You need an index with varchar_pattern_ops for the query to be reasonably fast).
Does not work in 8.4. Probably 9.1+ only. SQL Fidlle
select word
from (
select unnest(string_to_array(word, null)) c, word from good_words
intersect all
select unnest(string_to_array('TESTREROREMASDSS', null)) c, word from good_words
) s
group by word
having
array_agg(c order by c) =
(select array_agg(c order by c) from unnest(string_to_array(word, null)) a(c))
My own solution is to create an insert trigger, which writes letter frequencies into an array column:
create table good_words (
word varchar(16) primary key,
letters integer[26]
);
create or replace function count_letters() returns trigger as $body$
declare
alphabet varchar[];
i integer;
begin
alphabet := regexp_split_to_array('abcdefghijklmnopqrstuvwxyz', '');
new.word := lower(new.word);
for i in 1 .. array_length(alphabet, 1)
loop
-- raise notice '%: %', i, alphabet[i];
new.letters[i] := length(new.word) - length(replace(new.word, alphabet[i], ''));
end loop;
return new;
end;
$body$ language plpgsql;
create trigger count_letters
before insert on good_words
for each row execute procedure count_letters();
Then I generate similar array for the random board string tesereroremasdss
and compare both arrays using the array contains operator #>
Any new ideas or improvements are always welcome!
Related
I have two strings as below:
_var_1 text := '815 PAADLEY ROAD PL';
_var_2 text := 'PAADLEY ROAD PL';
_var_3 text;
I want to merge these two strings into one string and to remove duplicates:
_var_3 := _var_1 || _var_2;
As a result, the variable (_var_3) should contain only - 815 PAADLEY ROAD PL without dublicate.
Can you advise or help recommend any PostgreSQL feature?
I read the documentation and could not find the necessary string function to solve this problem... I am trying to use regexp_split_to_table but nothing is working.
I tried to use this method, but it's not what I need and the words in the output are mixed up::
WITH ts AS (
SELECT
unnest(
string_to_array('815 PAADLEY ROAD PL PAADLEY ROAD PL', ' ')
) f
)
SELECT
f
FROM ts
GROUP BY f
-- f
-- 815
-- ROAD
-- PL
-- PAADLEY
I assume you want to treat strings as word lists and then you have to concat them like they were a sets to be unioned, with retaining order. This is basically done by following SQL:
with splitted (val, input_number, word_number) as (
select v, 1, i
from unnest(regexp_split_to_array('815 PAADLEY 2 ROAD 3 PL',' ')) with ordinality as t(v,i)
union
select v, 2, i
from unnest(regexp_split_to_array('PAADLEY ROAD 4 PL',' ')) with ordinality as t(v,i)
), numbered as (
select val, input_number, word_number, row_number() over (partition by val order by input_number, word_number) as rn
from splitted
)
select string_agg(val,' ' order by input_number, word_number)
from numbered
where rn = 1
string_agg
815 PAADLEY 2 ROAD 3 PL 4
fiddle
However this is not kind of task to be solved in SQL in smart and elegant way. Moreover, it is not clear from your specification what to do with duplicate words or if you want to process multiple input pairs (both requirements would be possible, though SQL is probably not the right tool). At least please provide more sample inputs with expected outputs.
SUMMARY: I've two tables I want to derive info out of: family_values (family_name, item_regex) and product_ids (product_id) to be able to update the property family_name in a third.
Here the plan is to grab a json array from the small family_values table and use the column value item_regex to do a test match against the product_id for every row in product_ids.
MORE DETAILS: Importing static data from CSV to table of orders. But, in evaluating cost of goods and market value I'm needing to continuously determine family from a prefix regex (item_regex from family_values) match on the product_id.
On the client this looks like this:
const families = {
FOOBAR: 'Big Ogre',
FOOBA: 'Wood Elf',
FOO: 'Valkyrie'
};
// And to find family, and subsequently COGs and Market Value:
const findFamily = product_id => Object.keys(families).find(f => new RegExp('^' + f).test(product_id));
This is a huge hit for the client so I made a family_values table in PG to include a representative: family_name, item_regex, cogs, market_value.
Then, the product_ids has a list of only the products the app cares about (out of millions). This is actually used with an insert trigger 'on before' to ignore any CSV entries that aren't in the product_ids view. So, I guess after that the product_ids view could be taken out of the equation because the orders, after inserting readonly data, has its own matching product_id. It does NOT have family_name, so I still have the issue of determining that client-side.
PSUEDO CODE: update family column of orders with family_name from family_values regex match against orders.product_id
OR update the product_ids table with a new family column and use that with the existing on insert trigger (used to left pad zeros and normalize data right now). Now I'm thinking this may be just an update as suggested, but not real good with regex in PG. I'm a PG novice.
PROBLEM: But, I'm having a hangup in doing what I thought would be like a JS Array Find operation. The family_values have been sorted on the item_regex so that the most strict match would be on top, and therefor found first.
For example, with sorting we have:
family_values_array = [
{"family_name": "Big Ogre", "item_regex": "FOOBAR"},
{"family_name": "Wood Elf", "item_regex": "FOOBA"},
{"family_name": "Valkyrie", "item_regex": "FOO"}]
So, that the comparison of product_id of ^FOOBA would yield family "Wood Elf".
SOLUTION:
The solution I finally came about using was simply using concat to write out the front-anchored regex. It was so simple in the end. The key line I was missing is:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
Whole function:
create or replace function iol.populate_families () returns void as $$
declare
product_row record;
family_value_row record;
begin
for product_row in
select product_id, lvl3_id from iol.products
loop
-- family_name is what we want after finding the BEST match fr a product_id against item_regex
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id like concat(item_regex, '%') limit 1;
-- update family_name and value columns
update iol.products set
family_name = family_value_row.family_name,
cog_cents = family_value_row.cog_cents,
market_value_cents = family_value_row.market_value_cents
where product_id = product_row.product_id;
end loop;
end;
$$
LANGUAGE plpgsql;
Use concat as updated above:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
I have a user table that contains a "skills" column which is a text array. Given some input array, I would like to find all the users whose skills % one or more of the entries in the input array, and order by number of matches (according to the % operator from pg_trgm).
For example, I have Array['java', 'ruby', 'postgres'] and I want users who have these skills ordered by the number of matches (max is 3 in this case).
I tried unnest() with an inner join. It looked like I was getting somewhere, but I still have no idea how I can capture the count of the matching array entries. Any ideas on what the structure of the query may look like?
Edit: Details:
Here is what my programmers table looks like:
id | skills
----+-------------------------------
1 | {javascript,rails,css}
2 | {java,"ruby on rails",adobe}
3 | {typescript,nodejs,expressjs}
4 | {auth0,c++,redis}
where skills is a text array.
Here is what I have so far:
SELECT * FROM programmers, unnest(skills) skill_array(x)
INNER JOIN unnest(Array['ruby', 'node']) search(y)
ON skill_array.x % search.y;
which outputs the following:
id | skills | x | y
----+-------------------------------+---------------+---------
2 | {java,"ruby on rails",adobe} | ruby on rails | ruby
3 | {typescript,nodejs,expressjs} | nodejs | node
3 | {typescript,nodejs,expressjs} | expressjs | express
*Assuming pg_trgm is enabled.
For an exact match between the user skills and the searched skills, you can proceed like this :
You put the searched skills in the target_skills text array
You filter the users from the table user_table whose user_skills array has at least one common element with the target_skills array by using the && operator
For each of the selected users, you select the common skills by using unnest and INTERSECT, and you calculate the number of these common skills
You order the result by the number of common skills DESC
In this process, the users with skill "ruby" will be selected for the target skill "ruby", but not the users with skill "ruby on rails".
This process can be implemented as follow :
SELECT u.user_id
, u.user_skills
, inter.skills
FROM user_table AS u
CROSS JOIN LATERAL
( SELECT array( SELECT unnest(u.user_skills)
INTERSECT
SELECT unnest(target_skills)
) AS skills
) AS inter
WHERE u.user_skills && target_skills
ORDER BY array_length(inter.skills, 1) DESC
or with this variant :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN LATERAL unnest(target_skills) AS t_skill
WHERE u.user_skills && array[t_skill]
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC
This query can be accelerated by creating a GIN index on the user_skills column of the user_table.
For a partial match between the user skills and the target skills (ie the users with skill "ruby on rails" must be selected for the target skill "ruby"), you need to use the pattern matching operator LIKE or the regular expression, but it is not possible to use them with text arrays, so you need first to transform your user_skills text array into a simple text with the function array_to_string. The query becomes :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN unnest(target_skills) AS t_skill
WHERE array_to_string(u.user_skills, ' ') ~ t_skill
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC ;
Then you can accelerate the queries by creating the following GIN (or GiST) index :
DROP INDEX IF EXISTS user_skills ;
CREATE INDEX user_skills
ON user_table
USING gist (array_to_string(user_skills, ' ') gist_trgm_ops) ; -- gin_trgm_ops and gist_trgm_ops indexes are compliant with the LIKE operator and the regular expressions
In any case, managing the skills as text will ever fail if there are typing errors or if the skills list is not normalized.
I accepted Edouard's answer, but I thought I'd show something else I adapted from it.
CREATE OR REPLACE FUNCTION partial_and_and(list1 TEXT[], list2 TEXT[])
RETURNS BOOLEAN AS $$
SELECT EXISTS(
SELECT * FROM unnest(list1) x, unnest(list2) y
WHERE x % y
);
$$ LANGUAGE SQL IMMUTABLE;
Then create the operator:
CREATE OPERATOR &&% (
LEFTARG = TEXT[],
RIGHTARG = TEXT[],
PROCEDURE = partial_and_and,
COMMUTATOR = &&%
);
And finally, the query:
SELECT p.id, p.skills, array_agg(t_skill) AS inter_skills
FROM programmers AS p
CROSS JOIN LATERAL unnest(Array['ruby', 'java']) AS t_skill
WHERE p.skills &&% array[t_skill]
GROUP BY p.id, p.skills
ORDER BY array_length(inter_skills, 1) DESC;
This will output an error saying column 'inter_skills' does not exist (not sure why), but oh well point is the query seems to work. All credit goes to Edouard.
In short, I am looking for a single recursive query that can perform multiple replaces over one string. I have a notion it can be done, but am failing to wrap my head around it.
Granted, I'd prefer the biz-layer of the application, or even the CLR, to do the replacing, but these are not options in this case.
More specifically, I want to replace the below mess - which is C&P in 8 different stored procedures - with a TVF.
SET #temp = REPLACE(RTRIM(#target), '~', '-')
SET #temp = REPLACE(#temp, '''', '-')
SET #temp = REPLACE(#temp, '!', '-')
SET #temp = REPLACE(#temp, '#', '-')
SET #temp = REPLACE(#temp, '#', '-')
-- 23 additional lines reducted
SET #target = #temp
Here is where I've started:
-- I have a split string TVF called tvf_SplitString that takes a string
-- and a splitter, and returns a table with one row for each element.
-- EDIT: tvf_SplitString returns a two-column table: pos, element, of which
-- pos is simply the row_number of the element.
SELECT REPLACE('A~B!C#D#C!B~A', MM.ELEMENT, '-') TGT
FROM dbo.tvf_SplitString('~-''-!-#-#', '-') MM
Notice I've joined all the offending characters into a single string separated by '-' (knowing that '-' will never be one of the offending characters), which is then split. The result from this query looks like:
TGT
------------
A-B!C#D#C!B-A
A~B!C#D#C!B~A
A~B-C#D#C-B~A
A~B!C-D-C!B~A
A~B!C#D#C!B~A
So, the replace clearly works, but now I want it to be recursive so I can pull the top 1 and eventually come out with:
TGT
------------
A-B-C-D-C-B-A
Any ideas on how to accomplish this with one query?
EDIT: Well, actual recursion isn't necessary if there's another way. I'm pondering the use of a table of numbers here, too.
You can use this in a scalar function. I use it to remove all control characters from some external input.
SELECT #target = REPLACE(#target, invalidChar, '-')
FROM (VALUES ('~'),(''''),('!'),('#'),('#')) AS T(invalidChar)
I figured it out. I failed to mention that the tvf_SplitString function returns a row number as "pos" (although a subquery assigning row_number could also have worked). With that fact, I could control cross join between the recursive call and the split.
-- the cast to varchar(max) matches the output of the TVF, otherwise error.
-- The iteration counter is joined to the row number value from the split string
-- function to ensure each iteration only replaces on one character.
WITH XX AS (SELECT CAST('A~B!C#D#C!B~A' AS VARCHAR(MAX)) TGT, 1 RN
UNION ALL
SELECT REPLACE(XX.TGT, MM.ELEMENT, '-'), RN + 1 RN
FROM XX, dbo.tvf_SplitString('~-''-!-#-#', '-') MM
WHERE XX.RN = MM.pos)
SELECT TOP 1 XX.TGT
FROM XX
ORDER BY RN DESC
Still, I'm open to other suggestions.
I am trying to create the following select statement in a stored proc
#dealerids nvarchar(256)
SELECT *
FROM INVOICES as I
WHERE convert(nvarchar(20), I.DealerID) in (#dealerids)
I.DealerID is an INT in the table. and the Parameter for dealerids would be formatted such as
(8820, 8891, 8834)
When I run this with parameters provided I get no rows back. I know these dealerIDs should provided rows as if I do it individually I get back what I expect.
I think I am doing
WHERE convert(nvarchar(20), I.DealerID) in (#dealerids)
incorrectly. Can anyone point out what I am doing wrong here?
Use a table values parameter (new in SQl Server 2008). Set it up by creating the actual table parameter type:
CREATE TYPE IntTableType AS TABLE (ID INTEGER PRIMARY KEY)
Your procedure would then be:
Create Procedure up_TEST
#Ids IntTableType READONLY
AS
SELECT *
FROM ATable a
WHERE a.Id IN (SELECT ID FROM #Ids)
RETURN 0
GO
if you can't use table value parameters, see: "Arrays and Lists in SQL Server 2005 and Beyond, When Table Value Parameters Do Not Cut it" by Erland Sommarskog, then there are many ways to split string in SQL Server. This article covers the PROs and CONs of just about every method. in general, you need to create a split function. This is how a split function can be used:
SELECT
*
FROM YourTable y
INNER JOIN dbo.yourSplitFunction(#Parameter) s ON y.ID=s.Value
I prefer the number table approach to split a string in TSQL but there are numerous ways to split strings in SQL Server, see the previous link, which explains the PROs and CONs of each.
For the Numbers Table method to work, you need to do this one time table setup, which will create a table Numbers that contains rows from 1 to 10,000:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this split function:
CREATE FUNCTION [dbo].[FN_ListToTable]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this will not return empty rows
----------------
SELECT
ListValue
FROM (SELECT
LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(#SplitOn, List2, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS List2
) AS dt
INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
WHERE SUBSTRING(List2, number, 1) = #SplitOn
) dt2
WHERE ListValue IS NOT NULL AND ListValue!=''
);
GO
You can now easily split a CSV string into a table and join on it:
Create Procedure up_TEST
#Ids VARCHAR(MAX)
AS
SELECT * FROM ATable a
WHERE a.Id IN (SELECT ListValue FROM dbo.FN_ListToTable(',',#Ids))
You can't use #dealerids like that, you need to use dynamic SQL, like this:
#dealerids nvarchar(256)
EXEC('SELECT *
FROM INVOICES as I
WHERE convert(nvarchar(20), I.DealerID) in (' + #dealerids + ')'
The downside is that you open yourself up to SQL injection attacks unless you specifically control the data going into #dealerids.
There are better ways to handle this depending on your version of SQL Server, which are documented in this great article.
Split #dealerids into a table then JOIN
SELECT *
FROM INVOICES as I
JOIN
ufnSplit(#dealerids) S ON I.DealerID = S.ParsedIntDealerID
Assorted split functions here (I'd probably a numbers table in this case for a small string