Populate random data from another table - postgresql

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.

SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4 in a
random fashion, we need some intermediate steps.
First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1, ..., id_20 are the ids present on dataset1.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2, we need to create another list of (column_1,rn), that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
SELECT * FROM dataset1 ;
id | column4
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4 can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2 might not be used at all, whereas others will be used more than once.
dbfiddle here

Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:
update dataset1.test
set column4 = (select
case when dataset1.test.column4 = dataset1.test.column4
then column1 end
from dataset2
order by random()
limit 1
)

Related

PostgreSQL select different columns based on condition [duplicate]

I have two queries :
Queries Simplified excluding Joins
Query 1 : select ProductName,NumberofProducts (in inventory) from Table1.....;
Query 2 : select ProductName, NumberofProductssold from Table2......;
I would like to know how I can get an output as :
ProductName NumberofProducts(in inventory) ProductName NumberofProductsSold
The relationships used for getting the outputs for each query are different.
I need the output this way for my SSRS report .
(I tried the union statement but it doesnt work for the output I want to see. )
Here is an example that does a union between two completely unrelated tables: the Student and the Products table. It generates an output that is 4 columns:
select
FirstName as Column1,
LastName as Column2,
email as Column3,
null as Column4
from
Student
union
select
ProductName as Column1,
QuantityPerUnit as Column2,
null as Column3,
UnitsInStock as Column4
from
Products
Obviously you'll tweak this for your own environment...
I think you are after something like this; (Using row_number() with CTE and performing a FULL OUTER JOIN )
Fiddle example
;with t1 as (
select col1,col2, row_number() over (order by col1) rn
from table1
),
t2 as (
select col3,col4, row_number() over (order by col3) rn
from table2
)
select col1,col2,col3,col4
from t1 full outer join t2 on t1.rn = t2.rn
Tables and data :
create table table1 (col1 int, col2 int)
create table table2 (col3 int, col4 int)
insert into table1 values
(1,2),(3,4)
insert into table2 values
(10,11),(30,40),(50,60)
Results :
| COL1 | COL2 | COL3 | COL4 |
---------------------------------
| 1 | 2 | 10 | 11 |
| 3 | 4 | 30 | 40 |
| (null) | (null) | 50 | 60 |
How about,
select
col1,
col2,
null col3,
null col4
from Table1
union all
select
null col1,
null col2,
col4 col3,
col5 col4
from Table2;
The problem is that unless your tables are related you can't determine how to join them, so you'd have to arbitrarily join them, resulting in a cartesian product:
select Table1.col1, Table1.col2, Table2.col3, Table2.col4
from Table1
cross join Table2
If you had, for example, the following data:
col1 col2
a 1
b 2
col3 col4
y 98
z 99
You would end up with the following:
col1 col2 col3 col4
a 1 y 98
a 1 z 99
b 2 y 98
b 2 z 99
Is this what you're looking for? If not, and you have some means of relating the tables, then you'd need to include that in joining the two tables together, e.g.:
select Table1.col1, Table1.col2, Table2.col3, Table2.col4
from Table1
inner join Table2
on Table1.JoiningField = Table2.JoiningField
That would pull things together for you into however the data is related, giving you your result.
If you mean that both ProductName fields are to have the same value, then:
SELECT a.ProductName,a.NumberofProducts,b.ProductName,b.NumberofProductsSold FROM Table1 a, Table2 b WHERE a.ProductName=b.ProductName;
Or, if you want the ProductName column to be displayed only once,
SELECT a.ProductName,a.NumberofProducts,b.NumberofProductsSold FROM Table1 a, Table2 b WHERE a.ProductName=b.ProductName;
Otherwise,if any row of Table1 can be associated with any row from Table2 (even though I really wonder why anyone'd want to do that), you could give this a look.
Old question, but where others use JOIN to combine unrelated queries to rows in one table, this is my solution to combine unrelated queries to one row, e.g:
select
(select count(*) c from v$session where program = 'w3wp.exe') w3wp,
(select count(*) c from v$session) total,
sysdate
from dual;
which gives the following one-row output:
W3WP TOTAL SYSDATE
----- ----- -------------------
14 290 2020/02/18 10:45:07
(which tells me that our web server currently uses 14 Oracle sessions out of the total of 290 sessions; I log this output without headers in an sqlplus script that runs every so many minutes)
Load each query into a datatable:
http://www.dotnetcurry.com/ShowArticle.aspx?ID=143
load both datatables into the dataset:
http://msdn.microsoft.com/en-us/library/aeskbwf7%28v=vs.80%29.aspx
This is what you can do. Assuming that your ProductName column have common values.
SELECT
Table1.ProductName,
Table1.NumberofProducts,
Table2.ProductName,
Table2.NumberofProductssold
FROM Table1
INNER JOIN Table2
ON Table1.ProductName= Table2.ProductName
Try this:
SELECT ProductName,NumberofProducts ,NumberofProductssold
FROM table1
JOIN table2
ON table1.ProductName = table2.ProductName
Try this:
GET THE RECORD FOR CURRENT_MONTH, LAST_MONTH AND ALL_TIME AND MERGE THEM INTO SINGLE ARRAY
$analyticsData = $this->user->getMemberInfoCurrentMonth($userId);
$analyticsData1 = $this->user->getMemberInfoLastMonth($userId);
$analyticsData2 = $this->user->getMemberInfAllTime($userId);
foreach ($analyticsData2 as $arr) {
foreach ($analyticsData1 as $arr1) {
if ($arr->fullname == $arr1->fullname) {
$arr->last_send_count = $arr1->last_send_count;
break;
}else{
$arr->last_send_count = 0;
}
}
foreach ($analyticsData as $arr2) {
if ($arr->fullname == $arr2->fullname) {
$arr->current_send_count = $arr2->current_send_count;
break;
}else{
$arr->current_send_count = 0;
}
}
}
echo "<pre>";
print_r($analyticsData2);die;

Get the ID of a table and its modulo respect the total rows in the same table in Postgres

While trying to map some data to a table, I wanted to obtain the ID of a table and its modulo respect the total rows in the same table. For example, given this table:
id
--
1
3
10
12
I would like this result:
id | mod
---+----
1 | 1 <- 1 mod 4
3 | 3 <- 3 mod 4
10 | 2 <- 10 mod 4
12 | 0 <- 12 mod 4
Is there an easy way to achieve this dynamically (as in, not counting the rows on before hand or doing it in an atomic way)?
So far I've tried something like this:
SELECT t1.id, t1.id % COUNT(t1.id) mod FROM tbl t1, tbl t2 GROUP BY t1.id;
This works but you must have the GROUP BY and tbl t2 as otherwise it returns 0 for the mod column which makes sense because I think it works by multiplying the table by itself so each ID gets a full set of the table. I guess for small enough tables this is ok but I can see how this becomes problematic for larger tables.
Edit: Found another hack-ish way:
WITH total AS (
SELECT COUNT(*) cnt FROM tbl
)
SELECT t1.id, t1.id % t2.cnt mod FROM tbl t1, total t2
It similar to the previous query but it "collapses" the multiplication to a single row with the previous count.
You can use COUNT() window function:
SELECT id,
id % COUNT(*) OVER () mod
FROM tbl;
I'm sure that the optimizer is smart enough to calculate the result of the window function only once.
See the demo.

Extract words before and after a specific word

I need to extract words before and after a word like '%don%' in a ntext column.
table A, column name: Text
Example:
TEXT
where it was done it will retrieve the...
at the end of the trip clare done everything to improve
it is the only one done in these times
I would like the following results:
was done it
clare done everything
one done in
I am using T-SQL, Left and right functions did not work with ntext data type of the column containing text.
As others have said, you can use a string splitting function to split out each word and then return those you require. Using the previously linked DelimitedSplit8K:
CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
go
declare #t table (t ntext);
insert into #t values('where it was done it will retrieve the...'),('at the end of the trip clare done everything to improve'),('we don''t take donut donations here'),('ending in don');
with t as (select cast(t as nvarchar(max)) as t from #t)
,d as (select t.t
,case when patindex('%don%',s.Item) > 0 then 1 else 0 end as d
,s.ItemNumber as i
,lag(s.Item,1,'') over (partition by t.t order by s.ItemNumber) + ' '
+ s.Item + ' '
+ lead(s.Item,1,'') over (partition by t.t order by s.ItemNumber) as r
from t
cross apply dbo.DelimitedSplit8K(t.t, ' ') as s
)
select t
,r
from d
where d = 1
order by t
,i;
Output:
+---------------------------------------------------------+-----------------------+
| t | r |
+---------------------------------------------------------+-----------------------+
| at the end of the trip clare done everything to improve | clare done everything |
| ending in don | in don |
| we don't take donut donations here | we don't take |
| we don't take donut donations here | take donut donations |
| we don't take donut donations here | donut donations here |
| where it was done it will retrieve the... | was done it |
+---------------------------------------------------------+-----------------------+
And a working example:
http://rextester.com/RND43071

Hoe to split data of one column in multiple columns on the basis of a condition

I have one table having data
Category. New data
Cost of equipment. 23
Price of equipments. 45
Cost of M&C. 13
Price of M&C. 12
And one another table having
Category
Equipments
M&C
Now i want data as below
Category Cost Price
Equipment 23 45
M&C 13 12
Can you please help me in solving this
You may try this. A better approach is to change your table design.
Note that while joining I had to use RTRIM to remove s from equipments. I am not aware of any other variations in your data which might not match between the two tables. Please change the join conditions appropriately ( or use a REGEXP match instead of ILIKE if they don't )
SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE TABLE Table1
(Category varchar(19), New_data int)
;
INSERT INTO Table1
(Category, New_data)
VALUES
('Cost of equipment', 23),
('Price of equipments', 45),
('Cost of M&C', 13),
('Price of M&C', 12)
;
CREATE TABLE Table2
(Category varchar(10))
;
INSERT INTO Table2
(Category)
VALUES
('Equipments'),
('M&C')
;
Query 1:
WITH t1
AS (
SELECT b.category
,a.new_data
FROM TABLE1 a
INNER JOIN TABLE2 b ON a.Category ILIKE '%cost%' || RTRIM(b.Category, 's') || '%'
)
,t2
AS (
SELECT c.category
,a.new_data
FROM TABLE1 a
INNER JOIN TABLE2 c ON a.Category ILIKE '%price%' || RTRIM(c.Category, 's') || '%'
)
SELECT t1.category
,t1.new_data AS cost
,t2.new_data AS price
FROM t1
INNER JOIN t2 ON t1.category = t2.category
Results:
| category | cost | price |
|------------|------|-------|
| Equipments | 23 | 45 |
| M&C | 13 | 12 |

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)