Is there a better way to optimize this operation? - postgresql

The goal is to show only those that have the "assets" that have tags "1y2" for example
In the first one from which I start the data are: asset_id - tag_id ( insertando los datos, (1,1)(1,6)...(27,5678)...(88,5)
1 - 1,2,3,4,5,6 | 2 - 1,2,3,4,5,6 | 3 - 1,3 | 4 - 1,2,5 | 5 - 1,3 | 6 - 1,2,3,4,5 27 - 5678,4321,2,4 | 88 - 5678,4321,3,5
SELECT asset_id, ARRAY_TO_STRING(ARRAY_AGG(tag_id),',') AS ETIQUETA
FROM PUBLIC.asset_tag
GROUP BY asset_id
ORDER BY asset_id;
In this instead of only leaving the assets "1,2,4,6" also in which it has 1 and 2, assets 3 and 5.
SELECT asset_id, ARRAY_TO_STRING(ARRAY_AGG(tag_id),',') AS ETIQUETA
FROM PUBLIC.asset_tag
WHERE asset_tag.tag_id in(1,2)
GROUP BY asset_id
ORDER BY asset_id;
So what I have done is, the ones that find tag 1 and tag 2 separately
CREATE VIEW joinuno as
(SELECT asset_id, ARRAY_TO_STRING(ARRAY_AGG(tag_id),',') AS ETIQUETA
FROM PUBLIC.asset_tag
WHERE asset_tag.tag_id IN(1)
GROUP BY asset_id
ORDER BY asset_id) ;
CREATE VIEW joindos as
(SELECT asset_id, ARRAY_TO_STRING(ARRAY_AGG(tag_id),',') AS ETIQUETA
FROM PUBLIC.asset_tag
WHERE asset_tag.tag_id IN(2)
GROUP BY asset_id
ORDER BY asset_id) ;
TO FINALLY UNITE IT IN THIS:
SELECT cons.asset_id, ARRAY_TO_STRING(ARRAY_AGG(cons.tag_id),',') AS ETIQUETA
FROM PUBLIC.asset_tag AS cons
JOIN joinuno ON cons.asset_id = joinuno.asset_id
JOIN joindos ON cons.asset_id = joindos.asset_id
GROUP BY cons.asset_id
ORDER BY cons.asset_id;

Related

can you helpe me to display the latest data on each group

I have this datatables:
table1
id category
-------------
1 a
2 b
3 c
table2
id heading category_id
----------------------
1 name 1
2 adddress 2
3 phone 3
4 email 1
I want to group this table and display the latest data for that the following query was I used:
SELECT news.id,news.image,news.heading,news.description,
news.date,news.category_id,categories.category
FROM `news`
INNER JOIN categories On news.category_id=categories.id
group by category_id
But I didnt get the latest data that I entered.
Try the query below:
SELECT *
FROM table2 AS tb2 LEFT JOIN table1 AS tb1 ON tb2.category_id = tb1.id
ORDER BY tb1.id
GROUP BY tb2.category_id

Populate random data from another table

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.
SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4 in a
random fashion, we need some intermediate steps.
First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1, ..., id_20 are the ids present on dataset1.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2, we need to create another list of (column_1,rn), that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
SELECT * FROM dataset1 ;
id | column4
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4 can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2 might not be used at all, whereas others will be used more than once.
dbfiddle here
Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:
update dataset1.test
set column4 = (select
case when dataset1.test.column4 = dataset1.test.column4
then column1 end
from dataset2
order by random()
limit 1
)

Subsetting records that contain multiple values in one column

In my postgres table, I have two columns of interest: id and name - my goal is to only keep records where id has more than one value in name. In other words, would like to keep all records of ids that have multiple values and where at least one of those values is B
UPDATE: I have tried adding WHERE EXISTS to the queries below but this does not work
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 B
and the output would look like this:
> output
id name
1 1 A
2 2 A
8 2 B
9 1 B
10 2 B
How would one write a query to select only these kinds records?
Based on your description you would seem to want:
select id, name
from (select t.*, min(name) over (partition by id) as min_name,
max(name) over (partition by id) as max_name
from t
) t
where min_name < max_name;
This can be done using EXISTS:
select id, name
from test t1
where exists (select *
from test t2
where t1.id = t2.id
and t1.name <> t2.name) -- this will select those with multiple names for the id
and exists (select *
from test t3
where t1.id = t3.id
and t3.name = 'B') -- this will select those with at least one b for that id
Those records where for their id more than one name shines up, right?
This could be formulated in "SQL" as follows:
select * from table t1
where id in (
select id
from table t2
group by id
having count(name) > 1)

SQL to remove rows with duplicated value while keeping one

Say I have this table
id | data | value
-----------------
1 | a | A
2 | a | A
3 | a | A
4 | a | B
5 | b | C
6 | c | A
7 | c | C
8 | c | C
I want to remove those rows with duplicated value for each data while keeping the one with the min id, e.g. the result will be
id | data | value
-----------------
1 | a | A
4 | a | B
5 | b | C
6 | c | A
7 | c | C
I know a way to do it is to do a union like:
SELECT 1 [id], 'a' [data], 'A' [value] INTO #test UNION SELECT 2, 'a', 'A'
UNION SELECT 3, 'a', 'A' UNION SELECT 4, 'a', 'B'
UNION SELECT 5, 'b', 'C' UNION SELECT 6, 'c', 'A'
UNION SELECT 7, 'c', 'C' UNION SELECT 8, 'c', 'C'
SELECT * FROM #test WHERE id NOT IN (
SELECT MIN(id) FROM #test
GROUP BY [data], [value]
HAVING COUNT(1) > 1
UNION
SELECT MIN(id) FROM #test
GROUP BY [data], [value]
HAVING COUNT(1) <= 1
)
but this solution has to repeat the same group by twice (consider the real case is a massive group by with > 20 columns)
I would prefer a simpler answer with less code as oppose to complex ones. Is there any more concise way to code this?
Thank you
You can use one of the methods below:
Using WITH CTE:
WITH CTE AS
(SELECT *,RN=ROW_NUMBER() OVER(PARTITION BY data,value ORDER BY id)
FROM TableName)
DELETE FROM CTE WHERE RN>1
Explanation:
This query will select the contents of the table along with a row number RN. And then delete the records with RN >1 (which would be the duplicates).
This Fiddle shows the records which are going to be deleted using this method.
Using NOT IN:
DELETE FROM TableName
WHERE id NOT IN
(SELECT MIN(id) as id
FROM TableName
GROUP BY data,value)
Explanation:
With the given example, inner query will return ids (1,6,4,5,7). The outer query will delete records from table whose id NOT IN (1,6,4,5,7).
This fiddle shows the records which are going to be deleted using this method.
Suggestion: Use the first method since it is faster than the latter. Also, it manages to keep only one record if id field is also duplicated for the same data and value.
I want to add MYSQL solution for this query
Suggestion 1 : MySQL prior to version 8.0 doesn't support the WITH clause
Suggestion 2 : throw this error (you can't specify table TableName for update in FROM clause
So the solution will be
DELETE FROM TableName WHERE id NOT IN
(SELECT MIN(id) as id
FROM (select * from TableName) as t1
GROUP BY data,value) as t2;

SQL Selecting less significant entity from table

I have a problem with some query from given result set i need to select the less detail row from table under some conditions.
I have three selects that after union return this table
SELECT A_ID, B_ID, 1 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 1
UNION
SELECT A_ID, B_ID, 2 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 2
UNION
SELECT A_ID, B_ID, 3 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 3
The result can be something like this
1000 100 1
1000 200 2
1000 300 3
From this table the final result should be 1000 100 1
The best case scenario is when a value is found then is no longer searched in next select.
Some ideas ?
EDIT:
The solution presented by 'Jeffrey Kemp' one query works fine.
1000 100 1
1000 200 2
1000 300 3
1001 200 2
1001 300 3
result
1000 100 1
1001 200 2
Database: Oracle Database 10g Release 10.2.0.4.0 - 64bit Production
Without knowing the details of your query, this is one option to consider:
SELECT * FROM (
SELECT * FROM (
SELECT A_ID, B_ID, 1 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 1
UNION
SELECT A_ID, B_ID, 2 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 2
UNION
SELECT A_ID, B_ID, 3 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 3
)
ORDER BY 3
) WHERE ROWNUM = 1;
An alternative is to add conditions to the queries to determine if they need to run at all:
SELECT A_ID, B_ID, 1 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 1
UNION
SELECT A_ID, B_ID, 2 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 2
WHERE NOT EXISTS (SPECIFIC CONDITION FOR LEVEL 1)
UNION
SELECT A_ID, B_ID, 3 FROM MY_TABLE JOIN MY_TABLE2 ON SPECIFIC CONDITION FOR LEVEL 3
WHERE NOT EXISTS (SPECIFIC CONDITION FOR LEVEL 1)
AND NOT EXISTS (SPECIFIC CONDITION FOR LEVEL 2)
Of course, I don't know the nature of your "specific conditions" so I don't know if this will work for you or not.