Divide records into groups - quick solution - postgresql

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.

The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;

create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.

Related

Crystal Reports - Where clause in formula to calculate sum of a field from second table

I am trying to get a result in my report, which I beleive, requires a where clause and did not work for me with the select expert section.
I have 2 tables. Lets call them table 1 and table 2.
Table 1 contains unique records.
Table 2 contains multiple records for the same uniqueKey as table 1.
there are 3 fields in table 2 that play a roll for each uniqueKey from table 1.
QTY_ORD
QTY_SHIPPED
ITEM_CANCEL
Lets assume for item # 1 from table 1, there are 5 records in table 2. Each record has a values for the 3 above mentioned fields. I need to display the SUM of all the records where ITEM_CANCEL = 0 of QTY_SHIPPED - QTY_ORD.
It could be that 3 of the records have ITEM_CANCEL = 1 (We can ignore these records), but for the other 2 reocrds where ITEM_CANCEL = 0, I need the SUM of QTY_SHIPPED - SUM of QTY_ORD.
the current code I have is as follows"
if {current_order1.ITEM_CANCEL} = 0 then
sum({current_order1.QTY_ORD})-sum({current_order1.QTY_SHIPPED}) else
0
but this result gives me the sum of ALL the records, including the ones where ITEM_CANCEL = 1.
If I use ITEM_CANCEL = 0 in the select expert, then it removes ALL the results that have no value in table 2. I even tried the code without using the SUM function, but this provided the result of only 1 of the records in table 2 where ITEM_CANCEL = 0, and not the total difference of the 2 records in table 2 that I require.
Any suggestions on this?
Start with a detail-level formuls (no SUM):
if {current_order1.ITEM_CANCEL} = 0 then {current_order1.QTY_ORD} - {current_order1.QTY_SHIPPED} ELSE 0
Then, SUM that formula at whatever Group or Report levels you require.

select maximum column name from different table in a database

I am comparing from different table to get the COLUMN_NAME of the MAXIMUM value
Examples.
These are example tables: Fruit_tb, Vegetable_tb, State_tb, Foods_tb
Under Fruit_tb
fr_id fruit_one fruit_two
1 20 50
Under Vegetables_tb (v = Vegetables)
v_id v_one V_two
1 10 9
Under State_tb
stateid stateOne stateTwo
1 70 87
Under Food_tb
foodid foodOne foodTwo
1 10 3
Now here is the scenario, I want to get the COLUMN NAMES of the max or greatest value in each table.
You can maybe find out the row which contains the max value of a column. For eg:
SELECT fr_id , MAX(fruit_one) FROM Fruit_tb GROUP BY fr_id;
In order to find the out the max value of a table:
SELECT fr_id ,fruit_one FROM Fruit_tb WHERE fruit_one<(SELECT max(fruit_one ) from Fruit_tb) ORDER BY fr_id DESC limit 1;
A follow up SO for the above scenario.
Maybe you can use GREATEST in order to get the column name which has the max value. But then what I'm not sure is whether you'll be able to retrieve all the columns of different tables at once. You can do something like this to retrieve from a single table:
SELECT CASE GREATEST(`id`,`fr_id`)
WHEN `id` THEN `id`
WHEN `fr_id` THEN `fr_id`
ELSE 0
END AS maxcol,
GREATEST(`id`,`fr_id`) as maxvalue FROM Fruit_tb;
Maybe this SO could help you. Hope it helps!

Limiting number of rows in contentResolvser.query to get any group of rows

How can I pick any group of rows if I know the first and last ID?
Suppose the table has IDs from 1 to 10 and I want contentResolver.query to return the rows 4 to 8, how can I do that?
I searched for that question, but all I found are solutions to return the first n consecutive rows:
Answer1 Answer2
using LIMIT and OFFSET clauses
solution
SELECT * FROM TABLE LIMIT 5 OFFSET 3; reference

TSQL Cursor Alternative to Speed up my query

Row Status Time
1 Status1 1383264075
2 Status1 1383264195
3 Status1 1383264315
4 Status2 1383264435
5 Status2 1383264555
6 Status2 1383264675
7 Status2 1383264795
8 Status1 1383264915
9 Status3 1383265035
10 Status3 1383265155
11 Status2 1383265275
12 Status3 1383265395
13 Status1 1383265515
14 Status1 1383265535
15 Status2 1383265615
The [Time] column holds POSIX time
I want to be able to calculate the number of seconds a given [Status] is active for within a given time period without using CURSORS. If this is the only then that is fine as I've already done that.
Using the above sample data extract, how do I calculate how long "Status1" has been active for?
That is, Substract Row1.[Time] from Row4.[Time], Substract Row8.[Time] from Row9.[Time], Substract Row13.[Time] from Row15.[Time].
Thankyou in advance
Assuming that each row represents that the specific Status is active from the specified Time until the next row, one would have to somehow calculate the difference between row N and N+1. One way would be to use a nested query (try it here: SQL Fiddle).
SELECT SUM(Duration) as Duration
FROM (
SELECT f.Status, s.Time-f.Time as Duration
FROM Table1 f
JOIN Table1 s on s.Row = f.Row+1
WHERE f.Status = 'Status1') a
The solution by #erikxiv will work if the Row values have no gaps. If they do have gaps, you could try the following method:
SELECT
TotalDuration = SUM(next.Time - curr.Time)
FROM
dbo.atable AS curr
CROSS APPLY
(
SELECT TOP (1) Time
FROM dbo.atable
WHERE Row > curr.Row
ORDER BY Row ASC
) AS next
WHERE
curr.Status = 'Status1'
;
For every row matching the specified status, the correlated subquery in the CROSS APPLY clause will fetch the next Time value based on the ascending order of Row. The current row's time is then subtracted from the next row's time and all the differences are added up using SUM().
Please note that in both solutions it is implied that the order of Row values follows the order of Time values. In other words, ORDER BY Row is assumed to be equivalent to ORDER BY Time or, if Time can have duplicates, to ORDER BY Time, Row.

T-SQL Query to process data in batches without breaking groups

I am using SQL 2008 and trying to process the data I have in a table in batches, however, there is a catch. The data is broken into groups and, as I do my processing, I have to make sure that a group will always be contained within a batch or, in other words, that the group will never be split across different batches. It's assumed that the batch size will always be much larger than the group size. Here is the setup to illustrate what I mean (the code is using Jeff Moden's data generation logic: http://www.sqlservercentral.com/articles/Data+Generation/87901)
DECLARE #NumberOfRows INT = 1000,
#StartValue INT = 1,
#EndValue INT = 500,
#Range INT
SET #Range = #EndValue - #StartValue + 1
IF OBJECT_ID('tempdb..#SomeTestTable','U') IS NOT NULL
DROP TABLE #SomeTestTable;
SELECT TOP (#NumberOfRows)
GroupID = ABS(CHECKSUM(NEWID())) % #Range + #StartValue
INTO #SomeTestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
This will create a table with about 435 groups of records containing between 1 and 7 records in each. Now, let's say I want to process these records in batches of 100 records per batch. How can I make sure that my GroupID's don't get split between different batches? I am fine if each batch is not exactly 100 records, it could be a little more or a little less.
I appreciate any suggestions!
This will result in slightly smaller batches than 100 entries, it'll remove all groups that aren't entirely in the selection;
WITH cte AS (SELECT TOP 100 * FROM (
SELECT GroupID, ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY GroupID) r
FROM #SomeTestTable) a
ORDER BY GroupID, r DESC)
SELECT c1.GroupID FROM cte c1
JOIN cte c2
ON c1.GroupID = c2.GroupID
AND c2.r = 1
It'll select the groups with the lowest GroupID's, limited to 100 entries into a common table expression along with the row number, then it'll use the row number to throw away any groups that aren't entirely in the selection (row number 1 needs to be in the selection for the group to be, since the row number is ordered descending before cutting with TOP).