Finding out the hash value of a group of rows - postgresql

I have a table person in my PostgresSQL database, which contains data of different users.
I need to write a test case, which ensures that some routine does modify the data of user 1, and does not modify data of user 2.
For this purpose, I need to
a) calculate a hash code of all rows of user 1 and those of user 2,
b) then perform the operation under test,
c) calculate the hash code again and
d) compare hash codes from steps a) and c).
I found a way to calculate the hash code for a single row:
SELECT md5(CAST((f.*)AS text))
FROM person f;
In order to achieve my goal (find out whether rows of user 2 have been changed), I need to perform a query like this:
SELECT user_id, SOME_AGGREGATE_FUNCTION(md5(CAST((f.*)AS text)))
FROM person f
GROUP BY user_id;
What aggregate function can I use in order to calculate the hash code of a set of rows?
Note: I just want to know whether any rows of user 2 have been changed. I do not want to know, what exactly has changed.

The simplest way - just concat all the string form md5 with string_agg. But to use this aggregate correctly you must specify ORDER BY.
Or use md5(string_agg(md5(CAST((f.*)AS text)),'')) with some ORDER BY - it will change if any field of f.* changes and it is cheap to compare.

An even simpler way to do it
SELECT user_id, md5(textin(record_out(A))) AS hash
FROM person A

Related

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

Postgresql full text search on really short documents (filename)

I have a database of filenames in which I'm trying to search using PGs full text search facility. I'm running the search query on a table of filenames, the problem is that the ranking functions are not ranking the results as I'd like them to do. For the sake of argument, let's assume the schema looks like this:
create table files (
id serial primary key,
filename text,
filename_ft tsvector
);
The query that I run looks something like this:
select filename, ts_rank(filename_ft, query)
from files, to_tsquery('simple', 'a|b|c') as query
where query ## name_ft
order by rank desc limit 5;
This will return the 5 results with the highest rank. However, those search queries are coming from another process, and in most cases the queries have some 'garbage' in them. For instance, a query for 'a xxxx' might be executed, where xxxxx is just a bunch of other terms. In most cases this still returns the correct results, because the suffix is simply not in the database.
However, sometimes a query contains some extraneous information that screws with the ranking function. For instance, a query for 'a b c' will return a filename containing the tokens 'b c' as first result, and an exact match on 'a' as second result, my guess this is due to the fact the the first result contains a larger percentage of the actual search tokens.
In most cases (if not all) the most important token appears as the first token in the query, so my question is, is there a way to give the tokens in the query a weight?
is there a way to give the tokens in the query a weight?
Yes, there is. See the documentation; search for "weight".
Whether assigning weights is the right choice is another matter. It sounds to me like you really want to exclude some of the data from the inputs to to_tsvector in index creation and searching, so you just don't include that garbage in the index.

SQL Select rows by comparison of value to aggregated function result

I have a table listing (gameid, playerid, team, max_minions) and I want to get the players within each team that have the lowest max_minions (within each team, within each game). I.e. I want a list (gameid, team, playerid_with_lowest_minions) for each game/team combination.
I tried this:
SELECT * FROM MinionView GROUP BY gameid, team
HAVING MIN(max_minions) = max_minions;
Unfortunately, this doesn't seem to work as it seems to select a random row from the available rows for each (gameid, team) and then does the HAVING comparison. If the randomly selected row doesn't match, it's simply skipped.
Using WHERE won't work either since you can't use aggregate functions within WHERE clauses.
LIMIT won't work since I have many more games and LIMIT limits the total number of rows returned.
Is there any way to do this without adding another table/view that contains (gameid, teamid, MIN(max_minions))?
Example data:
sqlite> SELECT * FROM MinionView;
gameid|playerid|team|champion|max_minions
21|49|100|Champ1|124
21|52|100|Champ2|18
21|53|100|Champ3|303
21|54|200|Champ4|356
21|57|200|Champ5|180
21|58|200|Champ6|21
64|49|100|Champ7|111
64|50|100|Champ8|208
64|53|100|Champ9|8
64|54|200|Champ0|226
64|55|200|ChampA|182
64|58|200|ChampB|15
...
Expected result (I mostly care about playerid, but included champion, max_minions here for better overview):
21|52|100|Champ2|18
21|58|200|Champ6|21
64|53|100|Champ9|8
64|58|200|ChampB|15
...
I'm using Sqlite3 under Python 3.1 if that matters.
This is in SQL Server, hopefully the syntax works for you too:
SELECT
MV.*
FROM
(
SELECT
team, gameid, min(max_minions) as maxmin
FROM
MinionView
GROUP BY
team, gameid
) groups
JOIN MinionView MV ON
MV.team = groups.team
AND MV.gameid = groups.gameid
AND MV.max_minions = groups.maxmin
In words, first you make the usual grouping query (the nested one). At this point you have the min value for each group but you don't know to which row it belongs. For this you join with the original table and match the "keys" (team, game and min) to get the other columns as well.
Note that if a team will have more than one member with the same value for max_minions then all these rows will be selected. If you only want one of them then that's probably a bit more complicated.

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)

Stored Procedure in postgresql, multiple queries w/ aggregates

I'm trying to write a store procedure that can take some input parameters (obviously), run multiple queries against those, taking the output from those and doing calculations, and from those calculations and the original queries, outputting a formatted text string like:
Number of Rows for max(Z) matching condition x and y of total rows matching x (x&y/x*100).
To explain the max(Z) bit, this will be the username field, it won't matter which actual entry is picked, because the where clause will filter the results by user id, is there a saner way to do this?
For starters break the code up into multiple procedures. Don't create one procedure that does all of these things.