I have 2 tables, let's call them T_FATHER and T_CHILD, where each father can have multiple childs, like so:
T_FATHER
--------------------------
ID - BIGINT, from Generator
T_CHILD
-------------------------------
ID - BIGINT, from Generator
FATHER_ID - BIGINT, Foreign Key
Now I want to add a counter to the T_CHILD table, that starts with 1 and adds 1 for every new child, but not globally, but per father, like:
ID | FATHER_ID | COUNTER |
--------------------------
1 | 1 | 1 |
--------------------------
2 | 1 | 2 |
--------------------------
3 | 2 | 1 |
--------------------------
My initial thought was creating a before-insert-trigger that counts how many childs are present for the given father and add 1 for the counter. This should work fine unless there are 2 inserts at the same time, which would end with the same counter. Chances are high that this never actually happens - but better save than sorry.
I don't know if it is possible to use a generator, but don't think so as there would have to be a generator per father.
My current approach is using the aforementioned trigger and add a unique index to FATHER_ID + COUNTER, so that only one of the simultaneous inserts goes through. I will have to handle the exception client-side (and reattempt the failed insert).
Is there a better way to handle this directly in Firebird?
PS: There won't be any deletes on any of the two tables, so this is not an issue.
Even with a generator per FATHER_ID you couldn't use them for this, because generators are not transaction safe. If your transaction is rolled back for whatever reason, the generator will have increased anyway, causing gaps.
If there are no deletes, I think your approach with a unique constraint is valid. I would consider an alternative however.
You could decide not to store the counter as such – storing counters in a database is often a bad idea. Instead, only track the insertion order. For that, a generator is usable, because every new record will have a higher value and gaps won't matter. In fact, you don't need anything but the ID you already have. Determine the numbering when selecting; for every child you want to know how many children there are with the same father but a lower ID. As a bonus, deletes would work normally.
Here's a proof of concept using a nested query:
SELECT ID, FATHER_ID,
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = C.FATHER_ID
AND OTHERS.ID < C.ID) AS COUNTER
FROM T_CHILD AS C
There's also the option of a window function. It has to have a derived table to also count any rows that are ultimately not being selected:
SELECT * FROM (
SELECT ID, FATHER_ID,
ROW_NUMBER() OVER(PARTITION BY FATHER_ID ORDER BY ID) AS COUNTER
FROM T_CHILD
-- Filtering that wouldn't affect COUNTER (e.g. WHERE FATHER_ID ... AND ID < ...)
)
-- Filtering that would affect COUNTER (e.g. WHERE ID > ...)
These two options have completely different performance characteristics. Which one, if either at all, is suitable for you depends on your data size and access patterns.
And when you try with a computed field and the Select solution of Thijs van Dien ?
CREATE TABLE T_CHILD(
ID INTEGER,
FATHER_ID INTEGER,
COUNTER COMPUTED BY (
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = T_CHILD.FATHER_ID
AND OTHERS.ID < T_CHILD.ID)
)
);
During the insert, you should just do a "Select...count + 1" directly inside that field.
But I would probably reconsider adding that field in the first place. It feels like redundant information that could easily be deduced at the moment you need it.(For example, by using DENSE_RANK http://www.firebirdfaq.org/faq343/)
Related
I'm working on a system that needs to be able to find the "state" of an item at a particular time in history. The state is binary (either on or off). In this case it's to determine where to direct (to a particular "keyspace") a piece of timestamped data as determined by the timestamp of the data. I'm having a hard time deciding what the best way to model the data is.
Method 1 is to use the tstzrange with state being implied by the bounds of the range:
create extension btree_gist;
create table core.range_director (
range tstzrange,
directee_id text,
keyspace text,
-- allow a directee to be directed to multiple keyspaces at once
exclude using gist (directee_id with =, keyspace with =, range with &&)
);
insert into core.range_director values
('[2021-01-15 00:00:00 -0:00,2021-01-20 00:00:00 -0:00)', 'THING_ID', 'KEYSPACE_1'),
('[2021-01-15 00:00:00 -0:00,)', 'THING_ID', 'KEYSPACE_2');
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-15'::timestamptz;
-- returns KEYSPACE_1 and KEYSPACE_2
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-21'::timestamptz;
-- returns KEYSPACE_2
Method 2 is to have explicit state changes:
create table core.status_director (
status_time timestamptz,
status text,
directee_id text,
keyspace text
); -- not sure what pk to use for this method
insert into core.status_director values
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_1'),
('2021-01-20 00:00:00 -0:00','Closed','THING_ID','KEYSPACE_1'),
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_2');
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-16'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Open KEYSPACE_2:Open
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-21'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Closed, KEYSPACE_2:Open
-- so, client code has to ensure that it only directs to status=Open keyspaces
Maybe there are other methods that would work as well, but these two seem to make the most sense to me. The benefit of the first method is the really easy query, but the down side is that you now have to update rows to close the state whereas in the second method you can just post new states which seems easier.
The table could conceivable grow into thousands or tens of thousands of rows, but will probably not grow into millions (but does the best method change depending on the expected row count?). I have a couple of similar tables with the same point-in-time "state" queries so it's really important that I get the model for them right.
My instinct is to go with Method 1, but are there any footguns or performance considerations that I'm not thinking of that would urge the use case towards Method 2 (or another method I haven't considered?)
No footguns with Method 1, just great big huge cannons. With that method how do you determine the current status. You need to scan each status change and for each one toggle the status, or perhaps use something like "count(*)%2" odd gives one state even another. What happens if any row gets deleted, or data purged and you do not know how many state transactions there were. With the Method 2 you retrieve the greatest date and directly obtain the status.
For myself I would do Method 3. That being Method1 + Method 2. Yes I would have a date range of the status and the status value itself. That gives me complex historical analysis as I have the complete history as well as direct access to current status at any time.
So after doing a bunch of research on the topic I found that my case is a variation of a "Valid-Time State Table". See ch. 2 and ch. 5 of Developing Time-Oriented Database Applications in SQL by Richard Snodgrass.
The support for these tables isn't great but it's not terrible either (at least PostgreSQL has tstzranges to work with). Method 1 of my post is largely sufficient - the main wrinkle is between the state table and other tables.
Since PostgreSQL doesn't have native support for these kinds of temporal tables, you have to build referential integrity yourself. There's a bunch of ways to do this, but for anyone in the future looking for some direction, here is an example of what that might look like for a referential query on two bitemporal tables:
create table a (
row_id bigserial, -- to track individual rows
id int,
pov tstzrange, -- period of validity
pop tstzrange -- period of presence
);
create table b (
row_id bigserial,
id int,
pov tstzrange,
pop tstzrange,
a_id int
);
-- are we good?
with each_pov as (
select bool_or(a.pov #> b.pov) as ok
from a
join b on a.id = b.a_id
and upper(a.pop) is null
and upper(b.pop) is null
group by b.pov
) select coalesce(
bool_and(each_pov.ok),
(select count(*) = 0 from b where upper(pop) is null)
) from each_pov;
You can put the query into a constraint trigger on both the main table and the referenced table to get something approaching sequenced referential integrity for the current period of presence.
Is there a way to select rows until some condition is met? I.e. a type of limit, but not limited to N rows, but to all the rows until the first non-matching row?
For example, say I have the table:
CREATE TABLE t (id SERIAL PRIMARY KEY, rank INTEGER, value INTEGER);
INSERT INTO t (rank, value) VALUES ( 1, 1), (2, 1), (2,2),(3,1);
that is:
test=# SELECT * FROM t;
id | rank | value
----+------+-------
1 | 1 | 1
2 | 2 | 1
3 | 2 | 2
4 | 3 | 1
(4 rows)
I want to order by rank, and select up until the first row that is over 1.
I.e. SELECT * FROM t ORDER BY rank UNTIL value>1
and I want the first 2 rows back?
One solution is to use a subquery and bool_or:
SELECT * FROM
( SELECT id, rank, value, bool_and(value<2) OVER (order by rank, id) AS ok FROM t ORDER BY rank) t2
WHERE ok=true
BUT wont that end up going through all rows, even if I only want a handful?
(real world context: I have timestamped events in a table, I can use a window query lead/lag to select the time between two events, I want all event from now going back as long as they happened less than 10 minutes apart – the lead/lag window query complicates things, so simplified example here)
edit: made window-function order by rank, id
What you want is a sort of stop-condition. As far as I am aware there is no such thing in SQL, at least PostgreSQL's dialect.
What you can do is use a PL/PgSQL procedure to read rows from a cursor and return them until the stop condition is met. It won't be super fast, but it'll be alright. It's just a FOR loop over a query with an IF expression THEN exit; ELSE return next; END IF;. No explicit cursor is required because PL/PgSQL will use one internally if you FOR loop over a query.
Another option is to create a cursor and read chunks of rows from it in the application, then discard part of the last chunk once the stop condition is met.
Either way, a cursor is going to be what you want.
A stop expression wouldn't actually be too hard to implement in PostgreSQL by the way. You'd have to implement a new executor node type, but the new CustomScan support would make that practical to do in an extension. Then you'd just evaluate an expression to decide whether or not to carry on fetching rows.
You can try something such as:
select * from t, (
select rank from t where value = 1 order by "rank" limit 1) x
where t.rank <= x.rank order by rank;
It will make two passes through the first part of the table (which you might be able to cut by creating an index on (rank, value = 1)) but shouldn't evaluate the rest of the table if you have an index on rank.
[If you could have window expressions in where clauses you could use a window expression to make sure any previous rows didn't have value = 1.. but even if this were possible, then getting the query evaluator to use to limit search would be yet another challenge.]
This may be no better than your solution, since you begged the question, "won't that end up going through all rows?"
I can tell you this -- the explain plan is different than your solution. I don't know how the guts of PostgreSQL works, but if I were writing a "max" function, I would think it would always be O(n). By contrast, you had an order by which is average case O(n log n), worst case O(n^2).
That said, I cannot deny that this will go through all rows:
select * from sandbox.t
where id < (select min (id) from sandbox.t where value > 1)
One thing to clarify, though, is that unless you scan all rows, I'm not sure how you could determine the minimum value. Any time you invoke an aggregate concept across all records, doesn't that mean that you must read all rows?
I need to update thousands of rows in a table. For example, I have 1000 rows with ids - 1, 2.. 1000:
mytable:
| id | value1 | value2 |
| 1 | Null | Null |
| 2 | Null | Null |
...
| 1000 | Null | Null |
Now I need to change first 10 rows. I can do it like this:
UPDATE mytable SET value1=42, value2=111 WHERE id=1
...
UPDATE mytable SET value1=42, value2=111 WHERE id=10
This requires to many requests and not very fast, so I decide to do this optimization:
UPDATE mytable SET value1=42 WHERE id in (1, 2, 3.. 10)
UPDATE mytable SET value2=111 WHERE id in (1, 2, 3.. 10)
Note: In this case I can actually write SET value1=42, value2=111 but in real world applications this sets of ids is not the same, for one rows I need to set value1, for other - value2, for some subset of rows I need to set both. Because of that I need two queries.
The problem is that I have very large amount of id's. This queries is something about 1Mb!
Q1: Is this a right way to optimize this updates?
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I can't use where statement, I've just have lots of row id's in my program.
Create a TEMPORARY TABLE and populate it with your target ids and new values. Then use UPDATE with FROM clause to join to that target and do it in a single command.
In general, whenever you have large numbers of id/values like this life gets easier if you move them into the database first.
Q1: Is this a right way to optimize this updates?
It should be still possible to write it in one single query using the CASE ... WHEN syntactic construct:
UPDATE mytable SET
value1 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value1
END,
value2 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value2
END;
etc.
You mentioned that you may have to update rows in multiple spots, and the above let you do that without problem in one single query.
Update: I overlooked the fact that speed was your main concern (you said "optimize"), and my answer is not correct in that regard. Using a temporary table as explained in the chosen answer leads to much better performances.
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I don't think that Postgresql should have much problems handling a large query (even much larger than 1mb). Remember that SQL DB initialization scripts can be way larger that 1mb.
Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.
I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum