What is better in terms of sqlite3 performance: delete unneeded row or set it as not needed? - iphone

I am writing an iPhone application where the user receives multiple messages from different users. These messages are stored in an sqlite3 database. With time the user might like to delete received messages from one user, but for sure he will continue to receive new messages from that user after deleting the old ones.
Since retrieving the messages will be done using a SELECT statement, which scenario is better to use when the user would like to delete the messages (in terms of performance):
DELETE all the old messages normally and continue to retrieve the new ones using a statement like: SELECT Messages FROM TableName WHERE UserID = (?)
Add a field to the table of type INTEGER and upon the DELETE request set this field to 1 and after that retrieve the new messages using a statement like: SELECT Messages FROM TableName WHERE UserID = (?) AND IsDeleted = 0
One more thing, if scenario 1 is used (normal DELETE) will this cause any fragmentation of the database file on the disk?
Many thanks in advance.

Using scenario 1 is much better, since both SELECT and DELETE in SQL operate at the same level of speed and scenario 1 will grant you not having dangling tuples (Unwanted Rows) in your database.
If you are wishing to perform data backup after any deletion process so scenario 2 is a must but you have to take into consideration the growing size of your database which leads to a slower performance in future.
Finally I would like to add that performing deleting operations on a database would not cause any fragmentation issues since most of databases have fragmentation and optimizing tools in their engines.

It would be a pretty lousy database if DELETE didn't work well. In absence of evidence to the contrary, I'd assume you are safe to delete as normal. The database's entire raison d'être is to make these sorts of operations efficient.

IMHO if you don't use DELETE, after a while the db will get bigger and bigger, thus making each SELECT less and less efficient.
therefore i figure that deleting rows that will never be used again is more efficient.

Related

TSQL Lock behaviour on Index Creation/Partition Switch

I currently have the use case of inserting lots of Data (3.5Mio Rows, somewhere around 200GB) into multiple Staging Tables and then switching them into the destination-Tables. Now, becaufe of the amount of Data, we discovered that it would be faster to insert the data into an empty heap-table, then creating the columnstore index so the structure is identical to the destination table, and then switching - all within one transaction.
All the Tables are in the same Database, but they do not depend on each other, so the best-case would be to fill table A-Stage and B-Stage at the same time, create the corresponding indices on them at the same time, and then switching them at the same time.
Obviously, with Creating Indices and Switching Partitions, there are plenty of locks involved. Now i was curious whether those locks can cause a deadlock at any point, especially when it comes to sys-Tables.
All Tables involved will get a SCH-M lock, and certain sys Tables will also get locked, but from what i can see, they get locked on PAGE/KEY/EXTENT level.
Now i guess my Question is:
Are the sys-Tables and other structures i might miss stored in a way that i can alter indices/partitions without running into locks, as long as those are different tables/objects that do not depend on each other (no Foreign Keys or anything for example) or will i eventually run into a scenario where Table B has to wait for Table A to finish before even starting, or worse, a deadlock?
Thanks in advance!
Tried Creating a Clustered Columnstore Index/Switching Partitons and saw that certain sys-Objects couldn´t be accessed, wondered whether this will cause locks in the future or if the locks for different objects will always work out

PostgreSQL: Backend processes are active for a long time

now I am hitting a very big road block.
I use PostgreSQL 10 and its new table partitioning.
Sometimes many queries don't return and at the time many backend processes are active when I check backend processes by pg_stat_activity.
First, I thought theses process are just waiting for lock, but these transactions contain only SELECT statements and the other backend doesn't use any query which requires ACCESS EXCLUSIVE lock. And these queries which contain only SELECT statements are no problem in terms of plan. And usually these work well. And computer resources(CPU, memory, IO, Network) are also no problem. Therefore, theses transations should never conflict. And I thoughrouly checked the locks of theses transaction by pg_locks and pg_blocking_pids() and finnaly I couldn't find any lock which makes queries much slower. Many of backends which are active holds only ACCESS SHARE because they use only SELECT.
Now I think these phenomenon are not caused by lock, but something related to new table partition.
So, why are many backends active?
Could anyone help me?
Any comments are highly appreciated.
The blow figure is a part of the result of pg_stat_activity.
If you want any additional information, please tell me.
EDIT
My query dosen't handle large data. The return type is like this:
uuid UUID
,number BIGINT
,title TEXT
,type1 TEXT
,data_json JSONB
,type2 TEXT
,uuid_array UUID[]
,count BIGINT
Because it has JSONB column, I cannot caluculate the exact value, but it is not large JSON.
Normally theses queries are moderately fast(around 1.5s), so it is absolutely no problem, however when other processes work, the phenomenon happens.
If statistic information is wrong, the query are always slow.
EDIT2
This is the stat. There are almost 100 connections, so I couldn't show all stat.
For me it looks like application problem, not postresql's one. active status means that your transaction still was not commited.
So why do you application may not send commit to database?
Try to review when do you open transaction, read data, commit transaction and rollback transaction in your application code.
EDIT:
By the way, to be sure try to check resource usage before problem appear and when your queries start hanging. Try to run top and iotop to check if postgres really start eating your cpu or disk like crazy when problem appears. If not, I will suggest to look for problem in your application.
Thank you everyone.
I finally solved this problem.
I noticed that a backend process holded too many locks. So, when I executed the query SELECT COUNT(*) FROM pg_locks WHERE pid = <pid>, the result is about 10000.
The parameter of locks_per_transactions is 64 and max_connections is about 800.
So, if the number of query that holds many locks is large, the memory shortage occurs(see calculation code of shared memory inside PostgreSQL if you are interested.).
And too many locks were caused when I execute query like SELECT * FROM (partitioned table). Imangine you have a table foo that is partitioned and the number of the table is 1000. And then you can execute SELECT * FROM foo WHERE partion_id = <id> and the backend process will hold about 1000 table locks(and index locks). So, I change the query from SELECT * FROM foo WHERE partition_id = <id> to SELECT * FROM foo_(partitioned_id). As the result, the problem looks solved.
You say
Sometimes many queries don't return
...however when other processes work, the phenomenon happens. If statistic
information is wrong, the query are always slow.
They don't return/are slow when directly connecting to the Postgres instance and running the query you need, or when running the queries from an application? The backend processes that are running, are you able to kill them successfully with pg_terminate_backend($PID) or does that have issues? To rule out issues with the statement itself, make sure statement_timeout is set to a reasonable amount to kill off long-running queries. After that is ruled out, perhaps you are running into a case of an application hanging and never allowing the send calls from PostgreSQL to finish. To avoid a situation like that, if you are able to (depending on OS) you can tune the keep-alive time: https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-KEEPALIVES-IDLE (by default is 2 hours)
Let us know if playing with any of that gives any more insight into your issue.
Sorry for late post, As #Konstantin pointed out, this might be because of your application(which is why I asked for your EDIT2). Adding a few excerpts,
table partition has no effect on these locks, that is a totally different concept and does not hold up locks in your case.
In your application, check if the connection properly close() after read() and is in finally block (From Java perspective). I am not sure of your application tier.
Check if SELECT..FOR UPDATE or any similar statement is written erroneously recently which is causing this.
Check if any table has grown in size recently and the column is not Indexed. This is very important and frequent cause of select statements running for some minutes. I'd also suggest using timeouts for select statements in your application. https://www.postgresql.org/docs/9.5/gin-intro.html This can give you a headstart.
Another thing that is fishy to me is the JSONB column, maybe your Jsonb values are pretty long, or the queries are unnecessarily selecting JSONB value even if not required?
Finally, If you don't need some special features of Jsonb data type, then you use JSON data type which is faster (magical maximum, sometimes 50x!)
It looks like the pooled connections not getting closed properly and a few queries might be taking huge time to respond back. As pointed out in other answers, it is the problem with the application and could be connection leak. Most possibly, it might be because of pending transactions over some already pending and unresolved transactions, leading to a number of unclosed transactions.
In addition, PostgreSQL generally has one or more "helper" processes like the stats collector, background writer, autovaccum daemon, walsender, etc, all of which show up as "postgres" instances.
One thing I would suggest you check in which part of the code you have initiated the queries. Try to DRY run your queries outside the application and have some benchmarking of queries performance.
Secondly, you can keep some timeout for certain queries if not all.
Thirdly, you can do kill the idle transactions after certain timeouts by using:
SET SESSION idle_in_transaction_session_timeout = '5min';
I hope it might work. Cheers!

Concurrent processes working on a PostgreSQL table

I have a simple procedure where I need to process records of a table, and ideally run multiple instances of the process without processing the same record. The way I've done this with MySQL is fairly common (although I perceive the token field to be more of a hack):
Adding a couple of fields to the table:
CREATE TABLE records (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
...actual fields...
processed_at DATETIME DEFAULT NULL,
process_token TEXT DEFAULT NULL
);
And then a simple processing script:
process_salt = md5(rand()) # or something like a process id
def get_record():
token = md5(microtime + process_salt)
db.exec("UPDATE records SET process_token = ?
WHERE processed_at IS NULL LIMIT 1", token)
return db.exec("SELECT * FROM records WHERE token = ?", token)
while (row = get_record()) is valid:
# ...do processing on row...
db.exec("UPDATE records SET processed_at = NOW(), token = NULL
WHERE id = ?", row.id)
I'm implementing such a process in a system which uses a PostgreSQL database. I know Pg could be considered more mature than MySQL with regards to locking thanks to MVCC - can I use row-locking or some other feature in Pg instead of the token field?
This approach will work with PostgreSQL but it'll tend to be pretty inefficient as you're updating each row twice - each update requires two transactions, two commits. The cost of this can be mitigated somewhat by using a commit_delay and possibly disabling synchronous_commit, but it's still not going to be fast unless you have a non-volatile write-back cache on your storage subsystem.
More importantly, because you're committing the first update there is no way to tell the difference between a worker that's still working on the job and a worker that has crashed. You could probably set the token to the worker's process ID if all workers are on the local machine then scan for missing PIDs occasionally but that's cumbersome and race-condition prone, not to mention the problems with pid re-use.
I would recommend that you adopt a real queuing solution that is designed to solve these problems, like ActiveMQ, RabbitMQ, ZeroMQ, etc. PGQ may also be of significant interest.
Doing queue processing in a transactional relational database should be easy, but in practice it's ridiculously hard to do well and get right. Most of the "solutions" that look sensible at a glance turn out to actually serialize all work (so only one of many queue workers is doing anything at any given time) when examined in detail.
You can use SELECT ... FOR UPDATE NOWAIT which will obtain an exclusive lock on the row, or report an error if it is already locked.

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!