The app, I am working on is like flikr but with groups concept. Each group consists of multiple users and user can do activities like upload,share,comment etc. within their group only.
I am thinking of creating a schema per group to organized data under group-name namespace in order to manage it easily & efficiently.
Will it have any adverse effect on database backup plans ?
Is there any practical limits on number of schemas per database ?
When splitting identically-structured data into schemas, you need to anticipate the fact that you won't need to query them as global entities again. Because it's as cumbersome and anti-SQL as having them in different tables of the same schema.
As an example, say you have 100 groups of users, in 100 schemas named group1..group100, each with a photo table.
To get the total number of photos in your system, you'd need to do:
select sum(n) FROM
(
select count(*) as n from group1.photos
UNION
select count(*) as n from group2.photos
UNION
select count(*) as n from group3.photos
...
UNION
select count(*) as n from group100.photos
)
This sort of query or view needs also to be rebuilt any time a group is added or removed.
This is neither easy or efficient, it's a programmer's nightmare.
Related
So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.
I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.
I have two non-partitioned tables:
q)s:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.05); co:`a`b`f`b`c)
q)t:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.07); co:`a`b`e`b`d)
In above table when I run below query it works perfectly fine.
q)select distinct co from s,t where date within 2019.07.01 2019.07.02
co
--
a
b
f
e
I have tables with same name which are partitioned by date, when I try to run same query on partitioned tables I get below error:
ERROR: 'par
(trying to update a physically partitioned table)
Why do we get above error in partitioned tables?
What is the optimized approach to get similar output as we got in non-partitioned tables?
One solution to for 2 which I feel as brute-force is:
select distinct co from((select distinct co from s where date within 2019.07.01 2019.07.02),select distinct co from t where date within 2019.07.01 2019.07.02)
I'm assuming you are only including the date name in the source tables to assist in queries. A date partitioned table will generate the virtual date column from the hdb structure, you shouldn't include it in the actual table being written to.
Why do we get above error in partitioned tables?
There is no way to avoid having to access the data of a partitioned table except through an initial a select statement.. In this case you are directly trying to perform a , operation to the s and t tables
What is the optimized approach to get similar output as we got in non-partitioned tables?
In general, there may be a trade-off between the table size and the nature and frequency of the operations, sometimes it may be worth bringing the table into memory for frequent joins, or creating a top-level flat table with the relevant subset of data.
If this is just a generalized test case for larger operations then something along the following would be ideal
distinct raze {select distinct co from x where date within 2019.07.01 2019.07.02} each `s`t
This performance is not very different from your own query however, it's just a bit more succinct.
Expecting hundreds of millions of rows and write-heavy applciation.
We need return SELECT COUNT(*) FROM orders and SELECT SUM(amount) FROM orders quite frequently and both of them are too slow to be ran on every request.
We are thinking about adding a special table called stats with just a single row. It has total_orders and total_amount, which we would increase every time we add a new order. Is this kind of SQL "cache" table a practical solution? What does it mean in terms of write performance?
Another option is to use Memcached or Redis, but they can get out of sync and are not persistent. Any other ideas?
Can anyone please help me in writing a single query joining these two queries.
I am using IBM DB2.
(SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3,TABLE_PROB
WHERE
TABLE_PROB.COL=TABLE1.COL,OTHER_CLAUSE )
UNION
(SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3,TABLE_PROB1
WHERE TABLE_PROB1.COL=TABLE1.COL,OTHER_CLAUSE )
The two queries before and after union are same except that instead of "TABLE_PROB" it is changed to "TABLE_PROB1". There are no columns is to be selected from both the tables, they are only used to filter in the where clause.
Can anyone tell me how to combine both into a single query.
This query can be considered for the following scenario.
There are few employee details table which contains details of all employees.
"TABLE_PROB" contains list of contract employees and "TABLE_PROB1" contains list of permanent employees. I need to get the details of both the contract and not contract employees based on few criteria.
Since the query has big Whereclause and select clause firing two queries by using union,increases the cost of the query. So I need to merge it by making a single query.
Thanks for the help in advance.
You cannot avoid the UNION because you still have to access both TABLE_PROB and TABLE_PROB1. Depending on your DB2 version, platform, and the system configuration this might perform a bit better:
SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3
WHERE
OTHER_CLAUSE
AND
EXISTS (
SELECT 1
FROM TABLE_PROB
WHERE COL=TABLE1.COL
UNION
SELECT 1
FROM TABLE_PROB1
WHERE COL=TABLE1.COL
)
Depending on the contents of TABLE_PROB.COL and TABLE_PROB1.COL UNION ALL instead of UNION might also prove beneficial.