I am trying to reduce the size of the data a T-SQL query returns to the Reporting Services. For example, let's say we have the following row set:
ID Country City
1 Germany Berlin
2 Germany Berlin
3 Germany Berlin
4 Germany Berlin
5 Germany Hamburg
6 Germany Hamburg
7 Germany Hamburg
8 Germany Hamburg
9 Germany Berlin
10 Germany Berlin
It can be transform easily to this:
ID Country City
1 Germany Berlin
2 NULL NULL
3 NULL NULL
4 NULL NULL
5 NULL Hamburg
6 NULL NULL
7 NULL NULL
8 NULL NULL
9 NULL Berlin
10 NULL NULL
As I may have thousands of thousands of duplicated values (and hundreds of columns), I know that transforming the data using NULLs like this reduce dramatically the size of the returned data.
Is it possible, to implement a formula, which get's the previous row column value, if the current one is NULL?
I want to test if it will be faster to just render huge data or to work with smaller data but apply such expression.
Why not just use Group BY
SELECT Min(Id) as min, Max(ID) as max,Country,City
FROM myTable
GROUP BY Country,City
Or
SELECT count(Id) as rowRepeatNum,Country,City
FROM myTable
GROUP BY Country,City
The first approach would work if your Ids were sequential for this although I'm not sure if the Id is important.
The second way can give you an number to loop down to generate repeat rows in your application fairly quickly and return significantly less rows.
Can give me more info on your use case?
Related
I have 2 tables with the exact same number of rows and the same non-repeated id. Because the data comes from 2 sources I want to keep it 2 tables and not combine it. I assume the best approach would be to leave the unique id as the primary key and join on it?
SELECT * FROM tableA INNER JOIN tableB ON tableA primary key = tableB primary key
The data is used by an application that force the user to select 1 or many values from 5 drop downs in cascading order:
select 1 or many values from tableA column1.
select 1 or many values from tableA column2 but filtered from the first filter.
select 1 or many values from tableA column3 but filtered from the second filter which in turn is filtered from the first filter.
For example:
pk
Column 1
Column 2
Column 3
123
Doe
Jane
2022-01
234
Doe
Jane
2021-12
345
Doe
John
2022-03
456
Jones
Mary
2022-04
Selecting "Doe" from column1 would limit the second filter to ("Jane","John"). And selecting "Jane" from column2 would filter column3 to ("2022-01","2021-12")
And last part of the question;
The application have 3 selection options for column3:
picking the exact value (for example "2022-01") or picking the year ("2022") or picking the quarter that the month falls into ("Q1", which equates in "01","02","03").
What would be the best usage of indexes AND/OR additional columns for this scenario?
Volume of data would be 20-100 million rows.
Each filter is in the range of 5-25 distinct values.
Which version of Postgres do you operate?
The volume you state is rather daunting for such a use case of populating drop-down boxes using live data for a PG db.
No kidding, it's possible, Kibana/Elastic has even a filter widget that works exactly this way for instance.
My guess is you may consider storing the different combinations of search columns in another table simply to speed up populating the dropboxes. You can achieve that with triggers on the two main tables. So instead of additional columns/indexes you may end with an additional table ;)
Regarding indexing strategy and given the hints you stated (AND/OR), I'd say there's no silver bullet. Index the columns that will be queried the most often.
Index each column individually because Postgres starting from 11 IIRC can combine multiple indexes to answer conjunctive/disjunctive formulas in WHERE clauses.
Hope this helps
I have a table with employment records. It has Employee code, status, and date when table was updated.
Like this:
Employee
Status
Date
001
termed
01/01/2020
001
rehired
02/02/2020
001
termed
03/03/2020
001
rehired
04/04/2021
Problem - I need to get period length when Employee was working for a company, and check if it was less than a year - then don't display that record.
There could be multiple hire-rehire cycles for each Employee. 10-20 is normal.
So, I'm thinking about two separate selects into two tables, and then looking for a closest date from hire in table 1, to termination in table 2. But it seems like overcomplicated idea.
Is there a better way?
Many approaches, but something like this could work:
SELECT
Employee,
SUM(DaysWorked)
FROM
(
SELECT
a1.employee,
IsNull(DateDiff(DD, a1.[Date],
(SELECT TOP 1 [Date] FROM aaa a2 WHERE a2.employee = a1.employee AND a2.[Date] > a1.[Date] and [status] <> 'termed' ORDER BY [Date] )
),DateDiff(DD, a1.[Date], getDate())) as DaysWorked
FROM
aaa a1
WHERE
[Status] = 'termed'
) Totals
GROUP BY
Totals.employee
HAVING SUM(DaysWorked) >= 365
Also using a CROSS JOIN is an option and perhaps more efficient. In this example, replace 'aaa' with the actual table name. The IsNull deals with an employee still working.
I have a query that essentially does counting by group key in KDB, in which I want to treat some of the groups as one for the purpose of this query. A simplified description of what I'm trying to do would be to count orders by customer in a month, where I have a couple of customers in the database that are actually subsidiaries of another customer, and I want to combine the counts of the subsidiaries with their parent organisation. The real scenario us much more complicated than that and without getting into unnecessary detail, suffice to say that I can't just group by customer and manipulate the results to merge counts after the query is executed - I need the "by" clause of my query to do the merging directly.
In SQL, I would do something like this:
select customer_id, count(*) as order_count
from orders
order by select case when customer_id = 1 then 2 when customer_id = 3 then 4 else customer_id end
In the above example, customer 1 is a subsidiary of customer 2, customer 3 is a subsidiary of customer 4 and every other customer is treated normally
Let's say the equivalent code in Q (without the manipulation of group keys) is:
select order_count:count i by customer_id from orders
How would I put in the equivalent select case statement to manipulate the group key? I tried this, but got a rank error:
select order_count:count i by $[customer_id=1;2;customer_id=3;4;customer_id] from orders
I'm terrible at Q so I'm probably making a very simple mistake. Any advice greatly appreciated.
One approach might be to have a dictionary of subsidiaries and use a lookup/re-map in your by clause:
q)dict:1 3!2 4
q)show t:([] order:1+til 10;customer:1+10?6)
order customer
--------------
1 1
2 1
3 6
4 2
5 3
6 4
7 5
8 5
9 3
10 5
q)select order_count:count i by customer^dict[customer] from t
customer| order_count
--------| -----------
2 | 3
4 | 3
5 | 3
6 | 1
You will lose some information about who actually owns the orders though, you'll only know at the parent level
I have two tables: one containing distinct persons and another table containing place names. Every person is coupled to a place name ID - and the place name ID gives more information about the place (for example the name, longitude and latitude).
The place name table is skewed, there are a lot of semi-duplicates (names written a bit differently e.g. London/Londen). For every place name I now also have the 'real' place name via Google API.
Persons:
ID Name Birthplace
1 John 1
2 Sarah 2
3 Jane 3
4 Tom 4
Place names:
ID PlaceName GooglePlaceName
1 New York City New York, NY, USA
2 Amsterdam Amsterdam, Netherlands
3 Londen London, UK
4 London London, UK
So when looking at this data, Jane and Tom are actually from the same place.
I already have a query which gets the duplicate IDs from the place name table:
SELECT id FROM placenames WHERE googleplacename IN (SELECT googleplacename FROM placenames GROUP BY googleplacename HAVING COUNT (googleplacename) > 1);
This returns
ID
1 3
2 4
Now I'm wondering if it's possible to update the person table, so Jane and Tom both get the same Birthplace ID (doesn't matter if it's 3 or 4) and afterwards remove the duplicate rows from the place name table so either the place name with ID 3 or the place name with ID 4 remains, depending on which one has stayed in the persons table.
If I'm totally going in the wrong direction, by trying to solve this with SQL I'd also like to know. I'm using Java and Spring to access the database.
Since, it doesn't matter which id is used to replace, lets take the first id in a list of duplicates.
i.e.
birthplace
3
4
becomes
birthplace
3
3
to do this first create a table mapping original & replacement id values
your select statement has the original ids, to that you can add the replacement ids using the window function first_value partitioned by googleplacename
Use this mapping table in the from clause of the update persons statement, joining on records where birthplace equals an original_id but not a replacement_id
UPDATE persons
SET birthplace = replacement_id
FROM (
SELECT id original_id, FIRST_VALUE(id) OVER (PARTITION BY googleplacename) replacement_id
FROM placenames
WHERE googleplacename IN (
SELECT googleplacename FROM placenames GROUP BY 1 HAVING COUNT(*) > 1
)
) replacement_table
WHERE birthplace = original_id
AND birthplace != replacement_id
I need a select result to look like this
UK
Europe
USA
The values are fixed ( no table is need it ). The order is important so ORDER BY 1 is not working.
What is the sql query ( as simple as possible ) that will build this result ?
You could use VALUES lists:
VALUES ('UK'), ('Europe'), ('USA');
column1
---------
UK
Europe
USA
(3 rows)