Add computed value as new column while also removing duplicates - postgresql

Let's say I have the following table playgrounds:
serialnumber length breadth country
1 15 10 Brazil
2 12 11 Chile
3 14 10 Brazil
4 14 10 Brazil
Now, I want to add a column area to the table, that is essentially length*breadth.
Obviously, I can do this update:
UPDATE playground set area = length*breadth where country = 'Brazil'.
Using the above statement, I will have to unnecessarily compute length * breadth twice for serial number 3 and 4. Is there a way to add group by and minimize the amount of calculations?
Something like:
UPDATE playground set area = length*breadth where country = 'Brazil'
group by length, breadth;

The first thing to note is that you should not add the area as a column. Data items that happen to be the result of simple arithmetic operations do not need their own column.
The second point is that you don't need to worry about doing a multiplication operation once each for rows 3 and 4. That's almost zero effort for the server
Third point is that if you are worried about rows 3 and 4, that means they are duplicated, and duplicated data should not be in the database. Consider deleting duplicates as described here: https://wiki.postgresql.org/wiki/Deleting_duplicates

To answer your question:
Is there a way, I could add group by and minimize the amount of calculations?
SELECT DISTINCT ON (1,2,3)
length, breadth, country, length * breadth AS area
FROM playgrounds
ORDER BY 1, 2, 3, serialnumber;
This takes the row with the smallest serialnumber from each set of duplicates. Detailed explanation:
Select first row in each GROUP BY group?
But consider the #e4c5's answer and Pavel's comment first. Don't store functionally dependent values that can be computed on the fly cheaply. Just drop duplicate rows and use a view:
To permanently delete dupes with greater serialnumber:
DELETE FROM playgrounds p
WHERE EXISTS (
SELECT 1
FROM playgrounds
WHERE length = p.length
breadth = p.breadth
country = p.country
AND serialnumber < p.serialnumber
);
Then:
CREATE VIEW playgrounds_plus AS
SELECT *, length * breadth AS area
FROM playgrounds;
Related:
Clean up SQL data before unique constraint

Related

get filtered rows by especific rules in postgres12

Aditional info: My table works from 2 lines, first and second line, id always start by 1 or 2, but sometimes, we have to reprocess it, and number get updated
i have a query that shows a lot of id's
usually, mi id's start by 1 or 2
for example:
1210001
2210001
1210002
1210003
2210002
sometimes, this rows are updated for several reason's, when system update it, first number get +2
1210001
2210001
1210002
1210003
2210002
3210001 from 1210001
4210001 from 2210001
same id can be updated from 2 to 3 times
1210001
2210001
1210002
1210003
2210002
3210001
4210001
5210001 from 3210001
7210001 from 5210001
how can I query only last updated of each id?
1210002
1210003
2210002
4210001
7210001
my table is composed by two working lines, line 1 and line 2
for example, id: 1210001 and 2210001
this id's are for line 1 and line 2.
x21xxxx this is the year and xxx0001 last numbers are consecutive for each line
first number can be odd or even, i am trying to think a query to remove old id's from result
Check out db<>fiddle example
A bit of maths helps here.
Given that the lower 6 digits of the "id" field are significant for partial grouping, these lower 6 digits can be obtained by "id" mod 1000000.
The upper digits can then be obtained by integer division of the "id" by 1000000. The upper digits define groups based on whether they are odd or even (parity), thus this query will select latest updated "id"s for each line:
select max(id) as last_updated
from t
group by ((id / 1000000) % 2, id % 1000000);
This query groups the rows by the parity of the upper digits (>= 7) and the numeric value of the lower 6 digits. It then selects the largest id of each group.
If you want to delete the older rows:
delete from t
where id not in
(select max(id) as last_updated
from t
group by ((id / 1000000) % 2, id % 1000000));
Check out db<>fiddle example

Select top x% records

Question is regarding how to get top x% of records according to their ratings.
For example I have a table with a few columns, one of which is rating:
rating smallint
value of rating is always positive.
My goal is to select top x% of entries according to their rating.
For example, for top 20%, if set of selected rows contains ratings like:
1,3,4,4,5,2,7,10,9
Then top 20% would be records with range from 8 to 10 → records with rating 9 and 10.
I implemented it in Django but it takes 2 calls to DB and I believe it can be easily achieved via SQL in PostgreSQL by just one call.
Any ideas how to implement it?
Considering that the max rating available in the column is your base for max calculation.
Try this workaround:
select * from sample where rating >=(select max(rating)-max(rating)*20/100 from sample)
Demo on fiddle

Pivot Charts in google Sheets by counting non-numeric data?

I have a dataset that I'd like to summarize in chart form. There are about 30 categories whose counts I'd like to display in a bar chart from about 300+ responses. I think a pivot table is probably the best way to do this, but when I create a pivot table and select multiple columns, each new column added gets entered as a sub-set of a previous column. My data looks something like the following
ID Country Age thingA thingB thingC thingD thingE thingF
1 US 5-9 thB thD thF
2 FI 5-9 thA thF
3 GA 5-9 thA thF
4 US 10-14 thC
5 US 10-14 thB thF
6 US 15-18
7 BR 5-9 thA
8 US 15-18 thD thF
9 FI 10-14 thA
So, I'd like to be able to create an interactive chart that showed the counts of "thing" items; I'd then like to be able to filter based upon demographic data (e.g., Country, Age). Notice that the data is non-numeric, so I have to use a CountA to see how many there are in each category.
Is there a simple way to display chart data that summarizes the counts and will allow me to filter based on different criteria?
The query can summarize the data in the form you want. The fact that you have "thA", "thB", etc, instead of "1" complicates the matter, but one can transform the strings to numeric data on the fly.
Assuming the data you've shown is in the cells A1:I10, the following formula will summarize it:
=query({B2:C10, arrayformula(if(len(D2:I10), 1, 0))}, "select Col1, Col2, count(Col3), sum(Col3), sum(Col4), sum(Col5), sum(Col6), sum(Col7) group by Col1, Col2", 0)
Explanation:
{B2:C10, arrayformula(if(len(D2:I10), 1, 0))} creates a table where the first two columns are your B,C (Country, Age) and the other six are filled with 1 or 0 depending on whether the cells in D-I are filled or not.
select Col1, Col2, count(Col3), sum(Col3), ... group by Col1, Col2 selects Country, Age, the total count of rows with this Country-Age combination, the number of rows with thingA for this Country-Age combination, etc.
the last argument, 0, indicates there are no header rows in the table passed to the query.
It's possible to give labels to the columns returned by the query, using label: see query language documentation. It would be something like
label Col1 'Country', Col2 'Age', count(Col3) 'Total count', sum(Col3) 'thingA count', ...
Add a Count column to your data with a "1" for whatever occurrence, this might solve your problem in the Pivot Table. I was just looking for a solution and thought about this. Working now for me.

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last

SQL Server 2008: Pivot column with no aggregate function workaround

Yes I know, this question has been asked MANY times but after reading all the posts I found that there wasn't an answer that fits my need. So, Heres my question. I would like to take a column of values and pivot them into rows of 6 columns.
I want to take this...... And turn it into this.......................
G Letter Date Code Ammount Name Account
081278 G 081278 12 00123535 John Doe 123456
12
00123535
John Doe
123456
I have 110000 values in this one column in one table called TempTable. I need all the values displayed because each row is an entity to itself. For instance, There is one unique entry for all of the Letter, Date, Code, Ammount, Name, and Account columns. I understand that the aggregate function is required but is there a workaround that will allow me to get this desired result?
Just use a MAX aggregate
If one row = one column (per group of 6 rows) then MAX of a single value = that row value.
However, the data you've posted in insufficient. I don't see anything to:
associate the 6 rows per group
distinguish whether a row is "Letter" or "Name"
There is no implicit row order or number to rely upon to generate the groups
Unfortunately, the max columns in a SQL 2008 select statement is 4,096 as per MSDN Max Capacity.
Instead of using a pivot, you might consider dynamic SQL to get what you want to do.
Declare #SQLColumns nvarchar(max),#SQL nvarchar(max)
select #SQLColumns=(select '''+ColName+'''',' from TableName for XML Path(''))
set #SQLColumns=left(#SQLColumns,len(#SQLColumns)-1)
set #SQL='Select '+#SQLColumns
exec sp_ExecuteSQL #SQL,N''