how to "join" two tables in postgres - postgresql

I have two tables contains almost the same things. But they got datas from different sources, and in perfect world they are identical. Practicaly - they differs. The goal is to find matching records and connect each other, and then the umatched records are the result.
first_table:
id1, date1, value1
second_table:
id2, date2, value2
I create third table "joiner":
id1,id2
And now use this spell:
INSERT INTO joiner (SELECT id1,id2 FROM first_table,second_table WHERE value1=value2 and date1=date2 ORDER BY date1,date2,id1,id2);
(sorting is important, because sometimes some packages are missed, so i have to add it later)
And everything would be great, but... sometimes there are more than one record with the same value and date, and there's no way to identify it. The accepted solution is to join first from first_table with first from second_table, and second from first_table with second from second_table ,etc.
And here comes the problem.
Because the joiner has the unique keys on each column - the insert raises unique_violation error, because the example result is:
id1|id2
-------
a1| b1
a1| b2
a2| b1
a2| b2
If I use SELECT distinct id1,id2 of course nothing changes (a1,b1)!=(a1,b2)
If I use SELECT distinct on (id1) id1,id2 - the result sometimes is:
id1|id2
-------
a1| b1
a2| b1
I tried to use WHERE NOT EXISTS (SELECT 1 FROM first_table f WHERE f.id1<>first_table.id1) AND NOT EXISTS (SELECT 1 FROM second_table s WHERE s.id2<>second_table.id2) - still nothing
I tried to add function with EXCEPTION, but this is also wrong - because it raises exception but joiner is still empty...
Any ideas?
update
I don't know why some people vote down for my question without any comment. Maybe because it is not enough clear - so especially for those example:
first_table:
id1, value1, date1
1,10, 2015-03-01
2,11, 2015-03-01
3,10, 2015-03-01
4,14, 2015-03-02
second_table:
id2, value2, date2
1,10, 2015-03-01
2,11, 2015-03-01
3,10, 2015-03-01
4,15, 2015-03-02
expected joiner
id1, id2
1,1
2,2
3,3
As you can see id1=4 and id2=4 doesn't have joiner - because value differs (auditor needs to manualy check and fix).
And there is a problem with id1=1 and id1=3 - are identical, so joiner without uniqness would looks like:
id1, id2
1,1
1,3
2,2
3,1
3,3
Which is wrong.

The solution to your problem is to use row_number() to enumerate the values for common date/value pairs in each table.
You query can be improved in other ways as well:
When using insert, always list the columns.
Learn to use proper explicit join syntax. Simple rule: never use comma in the from clause.
Use table aliases to specify where columns are coming from.
The query is:
INSERT INTO joiner(id1, id2)
SELECT id1, id2
FROM (select ft.*, row_number() over (partition by value1, date1 order by value1) as seqnum
from first_table ft
) ft JOIN
(select st.*, row_number() over (partition by value2, date2 order by value2) as seqnum
from second_table st
) st
ON ft.value1 = st.value2 and ft.date1 = st.date2 and ft.seqnum = st.seqnum
ORDER BY ft.date1, st.date2, ft.id1, st.id2;
I don't think the order by is important, but I'm leaving it in because you think it is relevant.

Related

Need to split column into rows and columns

I have a table like this:
ID cst
1 string1;3;string2;string3;34;string4;-1;string5;string6;12;string7;5;string8,string9, 65
2 string10;-3;string11;string12;56;string13;6;string14;string15;9
etc.
Now I want to split the cst column into 5 columns and multiple rows.
So like this:
ID C1 C2 C3 C4 C5
1 string1 3 string2 string3 34
1 string4 -1 string5 string6 12
1 string7 5 string8 string9 65
2 string10 -3 string11 string12 56
2 string13 6 string14 string15 9
etc.
How to accomplish this? I am on SQL-server 2017, so I can use the string_split function. The problem with this function is that it produces only one output column...
Preferably I would like yo create an UDF that outputs a table. The function would use these input parameters: the string, the separator character, the number of columns. So the function can be used dynamically with a varying number of columns.
ps. the strings can be of variable length of course.
Try it along this:
Hint: There are some "normal" commas in your sample data.
I suspected these as wrong and used semicolons.
If this is wrong, you might use a general REPLACE() to use ";" instead of ",".
Create a declared table to simulate your issue
DECLARE #tbl TABLE(ID INT, cst VARCHAR(1000));
INSERT INTO #tbl(ID,cst)
VALUES(1,'string1;3;string2;string3;34;string4;-1;string5;string6;12;string7;5;string8;string9; 65')
,(2,'string10;-3;string11;string12;56;string13;6;string14;string15;9');
--The query (for almost any version of SQL-Server, find v2017+ as UPDATE below)
WITH cte AS
(
SELECT t.ID
,B.Nr
,A.Casted.value('(/x[sql:column("B.Nr")]/text())[1]','varchar(max)') AS ValueAtPosition
,(B.Nr-1) % 5 AS Position
,(B.Nr-1)/5 AS GroupingKey
FROM #tbl t
CROSS APPLY(SELECT CAST('<x>' + REPLACE(t.cst,';','</x><x>') + '</x>' AS XML)) A(Casted)
CROSS APPLY(SELECT TOP(A.Casted.value('count(x)','int')) ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) FROM master..spt_values) B(Nr)
)
SELECT ID
,GroupingKey
,MAX(CASE WHEN Position=0 THEN ValueAtPosition END) AS C1
,MAX(CASE WHEN Position=1 THEN ValueAtPosition END) AS C2
,MAX(CASE WHEN Position=2 THEN ValueAtPosition END) AS C3
,MAX(CASE WHEN Position=3 THEN ValueAtPosition END) AS C4
,MAX(CASE WHEN Position=4 THEN ValueAtPosition END) AS C5
FROM cte
GROUP BY ID,GroupingKey
ORDER BY ID,GroupingKey;
The idea in short:
we use APPLY to add your string casted to XML to the result set. This will help to split the string ("a;b;c" => <x>a</x><x>b</x><x>c</x>)
We use another APPLY to create a tally on the fly with a computed TOP-clause. It will return as many virtual rows as there are elements in the XML
We use sql:column() to grab each element's value by its position and some simple maths to create a grouping key and a running number from 0 to 4 and so on.
We use GROUP BY together with MAX(CASE...) to place the values in the fitting column (old-fashioned pivot or conditional aggregation).
Hint: If you want this fully generically, with a number of columns not knwon in advance. You cannot use any kind of function or ad-hoc query. You would rather need some kind of dynamic statement creation together with EXEC within a stored procedure.
to be honest: This might be a case of XY-problem. Such approaches are the wrong idea - at least in almost all situations I can think of.
UPDATE for SQL-Server 2017+
You are on v2017, this allows for JSON, which is a bit faster in position safe string splitting. Try this:
SELECT t.ID
,A.*
FROM #tbl t
CROSS APPLY OPENJSON(CONCAT('["',REPLACE(t.cst,';','","'),'"]')) A
The general idea is the same. We transform a string to a JSON-array ("a,b,c" => ["a","b","c"]) and read it with APPLY OPENJSON().
You can perform the same maths at the "key" column and do the rest as above.
Just because it is ready here, this is the full query for v2017+
WITH cte AS
(
SELECT t.ID
,A.[key]+1 AS Nr
,A.[value] AS ValueAtPosition
,A.[key] % 5 AS Position
,A.[key]/5 AS GroupingKey
FROM #tbl t
CROSS APPLY OPENJSON(CONCAT('["',REPLACE(t.cst,';','","'),'"]')) A
)
SELECT ID
,GroupingKey
,MAX(CASE WHEN Position=0 THEN ValueAtPosition END) AS C1
,MAX(CASE WHEN Position=1 THEN ValueAtPosition END) AS C2
,MAX(CASE WHEN Position=2 THEN ValueAtPosition END) AS C3
,MAX(CASE WHEN Position=3 THEN ValueAtPosition END) AS C4
,MAX(CASE WHEN Position=4 THEN ValueAtPosition END) AS C5
FROM cte
GROUP BY ID,GroupingKey
ORDER BY ID,GroupingKey;
The easiest option here honestly might be the following steps:
Write out the current table to a CSV flat file, using semicolon as the separator (which is also the separator for the current cst column
Then load the CSV using SQL Server's bulk loading tool, again with semicolon as the column separator. This will yield a table with 16 columns, ID, and then C1 through and including C15.
Create a new table (ID, C1, C2, C3, C4, C5)
Then populate the above table using:
INSERT INTO newTable (ID, C1, C2, C3, C4, C5)
SELECT ID, C1, C2, C3, C4, C5 FROM loadedTable UNION ALL
SELECT ID, C6, C7, C8, C9, C10 FROM loadedTable UNION ALL
SELECT ID, C11, C12, C13, C14, C15 FROM loadedTable;
While the above suggestion might seem like a lot of work, SQL Server has poor support for regex and complex string splitting operations, especially on earlier versions. Working directly with your current table might be either not possible or more work than the above.

T-SQL select all IDs that have value A and B

I'm trying to find all IDs in TableA that are mentioned by a set of records in TableB and that set if defined in Table C. I've come so far to the point where a set of INNER JOIN provide me with the following result:
TableA.ID | TableB.Code
-----------------------
1 | A
1 | B
2 | A
3 | B
I want to select only the ID where in this case there is an entry for both A and B, but where the values A and B are based on another Query.
I figured this should be possible with a GROUP BY TableA.ID and HAVING = ALL(Subquery on table C).
But that is returning no values.
Since you did not post your original query, I will assume it is inside a CTE. Assuming this, the query you want is something along these lines:
SELECT ID
FROM cte
WHERE Code IN ('A', 'B')
GROUP BY ID
HAVING COUNT(DISTINCT Code) = 2;
It's an extremely poor question, but you you probably need to compare distinct counts against table C
SELECT a.ID
FROM TableA a
GROUP BY a.ID
HAVING COUNT(DISTINCT a.Code) = (SELECT COUNT(*) FROM TableC)
We're guessing though.

Populate random data from another table

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.
SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4 in a
random fashion, we need some intermediate steps.
First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1, ..., id_20 are the ids present on dataset1.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2, we need to create another list of (column_1,rn), that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
SELECT * FROM dataset1 ;
id | column4
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4 can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2 might not be used at all, whereas others will be used more than once.
dbfiddle here
Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:
update dataset1.test
set column4 = (select
case when dataset1.test.column4 = dataset1.test.column4
then column1 end
from dataset2
order by random()
limit 1
)

Retrieving the rows using join query

I have two tables like this
A B
---- -------
col1 col2 col1 col2
---------- -----------
A table contains 300k rows
B table contains 400k rows
I need to count the col1 for table A if it is matching col1 for table B
I have written a query like this:
select count(distinct ab.col1) from A ab join B bc on(ab.col1=bc.col1)
but this takes too much time
could try a group by...
Also ensure that the col1 is indexed in both tables
SELECT COUNT (col1 )
FROM
(
SELECT aa.col1
FROM A aa JOIN B bb on aa.col1 = bb.col1
GROUP BY (aa.col1)
)
It's difficult to answer without you positing more details: did you analyze the tables? Do you have an index on col1 on each table? How many rows are you counting?
That being said, there aren'y so many potential query plans for your query. You likely have two seq scans that are hash joined together, which is about the best you can do... If you've a material numbers of rows, you'll be counting a gazillion rows, and this takes time.
Perhaps you could rewrite the query differently? If every B.col1 is in A.col1, you could get the same result without the join:
select count(distinct col1) from B
If A has low cardinality, it might be faster to rely on exists():
with vals as (
select distinct A.col1 as val from A
)
select count(*) from vals
where exists(select 1 from B where B.col1 = vals.val)
Or, if you know every possible value from A.col1 and it's reasonably small, you could unnest an array without querying A at all:
select count(*) from unnest(Array[val1, val2, ...]) as vals (val)
where exists(select 1 from B where B.col1 = vals.val)
Or vice-versa, in each of the above, if every B holds the reference values.

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)