PostgreSQL: return all rows including duplicates - postgresql

Some background: my client is requesting a way to find out who is using his computers. Each computer user has a unique barcode attached to their account which ends up in a log file (recording date and time among other things), but the log file he looks at does not report the residence for these users which he needs. I have a separate read-only PostgreSQL database that I can search against to find the residence of the user using their barcode. I setup a web form using a textarea field to allow the client to input a list of barcodes. I then capture the entire string into a variable and put together a SQL query that looks like this:
SELECT
n.last_name as name,
p.barcode as barcode,
p.home as residence
FROM db.pt_view p JOIN db.pt_record_fullname n ON p.id = n.patron_record_id
WHERE p.barcode IN ('25260045344400','25233423433332','25233423433332','...)
This works, but the "IN" operator of course removes all of the duplicate barcodes. I need all of the barcodes (duplicate or not) to match up to the number of entries in the log file. A barcode can appear many times in the log at different dates and times. Using this query above, I entered in 918 barcodes and only returned 450.
I'm a relative PostgreSQL and database noob, so I'm sure there's a better way to handle this and return all of the records (with duplicates). Thanks in advance for any help.

If IN isn't doing what you want, perhaps the on-the-fly capabilities of VALUES as a subquery will do.
SELECT
n.last_name as name, /* don't recommend using reserved word name this way */
p.barcode as barcode,
p.home as residence
FROM db.pt_view p
JOIN db.pt_record_fullname n ON p.id = n.patron_record_id
JOIN (VALUES('25260045344400'),('25233423433332'),('25233423433332') /* , ...*/))
AS codes(barcode)
ON p.barcode=codes.barcode;

try cartesian product
if you use JOIN without ON,will create a cartesian product,this is a product ALL x ALL rows and use a condicional to end.
somethink like:
SELECT
n.last_name as name,
p.barcode as barcode,
p.home as residence
FROM db.pt_view p JOIN db.pt_record_fullname n
WHERE p.barcode IN ('25260045344400','25233423433332','25233423433332','...) and p.id = n.patron_record_id
PD: my english is not good xD

Related

Postgresql Query - Return all matching search terms for each result row when using an ANY query and LIKE

Essentially what I'm trying to figure out is if there is a way to return all matching search terms in addition to the matched row when running a query that looks up a list of items using ANY or IN. In most cases the search term will exactly match the returned column value but in cases such as text search or with certain extensions like IP4r this is not always the case. In addition, you can have multiple search terms match on a single row.
To make this concrete suppose this is my query:
SELECT id, item_name, description FROM items WHERE description LIKE ANY('{%gaming%, %computer%, %socks%, %men%}');
and it returns the following two rows:
id, item_name, description
1, 'computer', 'super fast gaming computer that will help you win'
5, 'socks', 'These socks are sure to please the men in your family'
What I'd like to know is which original search terms map to the result row that was returned. In other words, I'd like the returned rows to look like this:
id, search_terms, item_name, description
1, '{%gaming%, %computer%}', 'computer', 'super fast gaming computer that will help you win'
5, '{%socks%, %men%}', 'socks', 'These socks are sure to please the men in your family'
Is there a way to efficiently do this in PostgreSQL? In the example above we're using LIKE with strings but in my real-world scenario I'm using the IP4r extension to do IP lookups against CIDR ranges where you can have multiple IP addresses in the same returned CIDR range.
I previously asked this question: PostgreSQL 9.5: Return matching search terms in each result row when using LIKE which used a CASE statement to almost solve the problem I'm describing here.
The added complexity in the scenario above is that you can have multiple search terms match a single row (e.f., gaming and computer are both matches for the description super fast gaming computer that will help you win). If you use a CASE statement then only the first match in the CASE statement gets set as the search term and you miss any other matching search terms.
Thank you for your help!
This would be a way using VALUES:
SELECT i.id, i.item_name, i.description, m.pat
FROM items AS i
JOIN (VALUES ('%gaming%'), ('%computer%'), ('%socks%'), ('%men%')) AS m(pat)
ON i.description LIKE m.pat;

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

field values as columns from master detail in PostgreSQL

i got two tables in a master-detail relation and want to select all data in a single select query.
table "person":
id;name
1;Allen
2;Bert
3;Chris
table "connectivity":
id;personId;type;value
1;1;phone;+123456789
2;1;mail;allen#allen.allen
3;2;mail;bert#bert.bert
4;3;phone;+987654321
5;3;fax;+987654322
The query output should be something like
person.id;person.name;phone;mail;fax
1;Allen;+123456789;allen#allen.allen;
2;Bert;;bert#bert.com;
3;Chris;+987654321;;+987654322
Any idea possibly without writing some function?
It should dynamically add the colummns when the detail-table is extended. E.g. adding a row to the detail table like
6;2;icq;0123456789
My prefered solution would fit into a select-query.
Thanks!
Patrick
It is not possible to add columns to a static SQL query.
The model you are using is called "Entity–attribute–value model". You can google for details on its different implementations.
The only "easy" way (I can think of) to have many dynamic properties per object in SQL is to dump them all into a single structure like HSTORE, JSON(B), BLOB... In this case the output will loke like:
id;name;params
1;Allen;{"phone":"+123456789", "email":"allen#allen.allen"};
2;Bert;{"email":"bert#bert.com"};
3;Chris;{"phone":"+987654321", "fax":"+987654322"};
You need a JOIN and a CASE to select the values you need. Something like this:
SELECT person.id,
person.name,
CASE connectivity.type WHEN 'phone' THEN value END AS phone,
CASE connectivity.type WHEN 'mail' THEN value END AS mail,
CASE connectivity.type WHEN 'fax' THEN value END AS fax
FROM person
JOIN connectivity ON person.id = connectivity.personId;
Offtopic: Don't use a mix of UPPER and lower case, PostgreSQL only uses lower case unless you put everything between "double quotes".
select
p.name,
phone.value phone,
mail.value mail
from person p
left join connectivity phone on phone.personid = p.id and phone.type = 'phone'
left join connectivity mail on mail.personid = p.id and mail.type = 'mail'

Transact-SQL Ambiguous column name

I'm having trouble with the 'Ambiguous column name' issue in Transact-SQL, using the Microsoft SQL 2012 Server Management Studio.
I´ve been looking through some of the answers already posted on Stackoverflow, but they don´t seem to work for me, and parts of it I simply don´t understand or loses the general view of.
Executing the following script :
USE CDD
SELECT Artist, Album_title, track_title, track_number, Release_Year, EAN_code
FROM Artists AS a INNER JOIN CD_Albumtitles AS c
ON a.artist_id = c.artist_id
INNER JOIN Track_lists AS t
ON c.title_id = t.title_id
WHERE track_title = 'bohemian rhapsody'
triggers the following error message :
Msg 209, Level 16, State 1, Line 3
Ambiguous column name 'EAN_code'.
Not that this is a CD database with artists names, album titles and track lists. Both the tables 'CD_Albumtitles' and 'Track_lists' have a column, with identical EAN codes. The EAN code is an important internationel code used to uniquely identify CD albums, which is why I would like to keep using it.
You need to put the alias in front of all the columns in your select list and your where clause. You're getting that error because one of the columns you have currently is coming from multiple tables in your join. If you alias the columns, it will essentially pick one or the other of the tables.
SELECT a.Artist,c.Album_title,t.track_title,t.track_number,c.Release_Year,t.EAN_code
FROM Artists AS a INNER JOIN CD_Albumtitles AS c
ON a.artist_id = c.artist_id
INNER JOIN Track_lists AS t
ON c.title_id = t.title_id
WHERE t.track_title = 'bohemian rhapsody'
so choose one of the source tables, prefixing the field with the alias (or table name)
SELECT Artist,Album_title,track_title,track_number,Release_Year,
c.EAN_code -- or t.EAN_code, which should retrieve the same value
By the way, try to prefix all the fields (in the select, the join, the group by, etc.), it's easier for maintenance.

COUNT(field) returns correct amount of rows but full SELECT query returns zero rows

I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.
Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.