Why is IMAP COPY command permitted to return an untagged EXPUNGE response? - email

This is what the IMAP rfc 3501 says:
An EXPUNGE response MUST NOT be sent when no command is in
progress, nor while responding to a FETCH, STORE, or SEARCH
command. This rule is necessary to prevent a loss of
synchronization of message sequence numbers between client and
server.
I think that's because FETCH, STORE and SEARCH are the only commands that use sequence numbers.
But COPY command also uses sequence numbers, so why is it allowed for it to trigger an untagged EXPUNGE?
This is a valid command pipeline sequence:
FETCH + STORE + SEARCH
It is valid because none of them can invalidate the sequence numbers in the other commands
This is not valid
COPY + FETCH
Because COPY can trigger an untagged EXPUNGE, it could invalidate the FETCH command
e.g
x COPY 50 Drafts
x2 FETCH 20 (FLAGS)
* 20 EXPUNGE
x OK [COPYUID 1675776450 61 2] Copy completed (0.002 + 0.000 + 0.001 secs).
x2 NO [EXPUNGEISSUED] Some of the requested messages no longer exist (0.001 + 0.000 secs).
Here the COPY triggers an untagged EXPUNGE which invalidates the sequence number used in FETCH

Related

What is an updated index configuration of Google Firestore in Datastore mode?

Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks
Stefan
I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key
lastUpdatedMillis
ExampleKind/1040
5
ExampleKind/1052
0
ExampleKind/1064
12
ExampleKind/1065
100
ExampleKind/1070
42
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis field.
The the lastUpdatedMillis ASC index table would have the following logical entries:
Index Key
Entity Key
0
ExampleKind/1052
5
ExampleKind/1040
12
ExampleKind/1064
42
ExampleKind/1070
100
ExampleKind/1065
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
For every ExampleKind Entity returned by the query, perform some operation which updates lastUpdatedMillis
Some of the updates may fail, so we repeat the query from step 1 again to catch any remaining entities.
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key
Entity Key
x
xxxx
x
xxxx
x
xxxx
42
ExampleKind/1070
...
Und so weiter ...
x
xxxx
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis field.
For example, instead of scanning the entire range of lastUpdateMillis < now, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
This deadlining query problem could occur with any kind of query over the index (projection, keys only, full entity, etc)
Despite what the error message says, no extra index here is need or would speed up the operation. Datastore's built-in ASC/DESC index over each field already exists for you and is serving this query.

Postgresql sequence: lock strategy to prevent record skipping on a table queue

I have a table that acts like a queue (let's call it queue) and has a sequence from 1..N.
Some triggers inserts on this queue (the triggers are inside transactions).
Then external machines have the sequence number and asks the remote database: give me sequences greater than 10 (for example).
The problem:
In some cases transaction 1 and 2 begins (numbers are examples). But transaction 2 ends before transaction 1. And in between host have asked queue for sequences greater than N and transaction 1 sequences are skipped.
How to prevent this?
I would proceed like this:
add a column state to the table that you change as soon as you process an entry
get the next entry with
SELECT ... FROM queuetab
WHERE state = 'new'
ORDER BY seq
LIMIT 1
FOR UPDATE SKIP LOCKED;
update state in the row you found and process it
As long as you do the last two actions in a single transaction, that will make sure that you are never blocked, get the first available entry and never skip an entry.

How to update a local table remotely?

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?
I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

What is the best way to stop a process from running until previous query finishes?

I'm querying a large historical database (HDB) to select records by alphabet character, then upserting the selected records to another kdb table...
pullRecords:{[x]select from `ts where sym like x}
pushRecords:{[x]`newTS upsert x}
The actual database contains millions of rows of records. If I were to run this simultaneously for each character it would result in an abort error since it requires more memory than what's available.
The ts and newTS tables, which I set up for testing are below. I also set up a metadata table called metaTable, which has a flag column to signal when the query has finished running...
ts:([]sym:1000000?`A`Ab`B`Bc`C`Ca`X`Xz`Y`Yx`Z`Zy;price:1000000?100.0;num:til 10000
newTS:([]sym:`$(); price:`float$(); amt:`int$())
metaTable:([id:`char$()]flag:`boolean$())
I'd like to stop the script from running, based on the value of the flag column. If a value of 1b is found, it means the script is locked on that character, and no other characters can run their queries until the lock is reset to a negative boolean. If all values are equal to 0b then the character checking the flag column acquires the lock (updates the value to 1b),and runs their functions. Once the queries have completed the lock will be reset.
What I'd like to do is the following...
(1) Declare 2 variables.
setflag:1b
resetflag:0b
(2) Check flag column in metaTable and set to 1b if 0b.
if[select flag from metaTable where id like "A*"=resetflag;update flag:setflag from metaTable where id="A";'"Flag set"]
if[select flag from metaTable;'"Flag already set for char "A""]
(2a) The above fails with a type error. I can store the select query in a variable and then index into the variable but this doesn't return the updated value once it's been set.
chkflg:select flag from metaTable
if[chkflg.flag[0];...]
(3) Run pullRecords query, count rows of data pulled for character, run pushRecords query.
if[select flag from metaTable;pulled::pullRecords["A*"];'"Pulling data"]
amt:count pulled
if[select flag from metaTable;pushRecords[pulled];'"Pushing data"]
(4) Check amount of data pulled from ts equals amount of data pushed to newTS. If so, update the flag in metaTable from 1b to 0b. Unlock the script and start process for next character.
if[amt~count select from newTS where sym like "A*";update flag:resetflag from `metaTable where id="A";'"Lock released"]
You are looking for something similar to the transaction behavior. You could do this without using metaTable. You can use global variables(or variables in namespaces) to serve as a lock.
Below is an example template to setup on the master service(service which is handling concurrent requests). Modify it according to your setup.
Define 2 global variables- lock(boolean) to serve as lock and lock_char to store the current locked character.
q) lock:0b
q) lock_char:""
Define a function which will first check if the lock can be acquired(lock value=0b). If yes then get the lock and perform rest of the operations else show the message and return.
q) transaction:{[ch] if[lock;show "Currently locked for character:",lock_ch;:0b];
/ else acquire lock and perform other operations
`lock set 1b; `lock_char set ch; s:ch,"*";
`newTs upsert t: select from ts where sym like s;
if[not count[t]=count select from newTs where sym like s;call_roolback_function[]];
/ reset lock
`lock set 0b; `lock_char set "";
:1b;
}
Call function:
q) transaction "A"

Redshift Concurrent Transactions

I'm having issues concurrently writing to a Redshift db. Writing to the db using a single connection works well, but is a little slow, so I am trying to use multiple concurrent connections but it looks like there can only be a single transaction at a time.
I investigated by running the following python script alone, and then running it 4 times simultaneously.
import psycopg2
import time
import os
if __name__ == "__main__":
rds_conn = psycopg2.connect(host="www.host.com", port="5439", dbname='db_name', user='db_user', password='db_pwd')
cur = rds_conn.cursor()
with open("tmp/test.query", 'r') as file:
query = file.read().replace('\n', '')
counter = 0
start_time = time.time()
try:
while True:
cur.execute(query)
rds_conn.commit() # first commit location
print("sent couter: %s" % counter)
counter += 1
except KeyboardInterrupt:
# rds_conn.commit() # secondary commit location
total_time = time.time() - start_time
queries_per_sec = counter / total_time
print("total queries/sec: %s" % queries_per_sec)
The test.query file being loaded up is a multi-row insert file ~16.8mb that looks a little like:
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
(Just a lot longer)
The results of the scripts showed:
---------------------------------------------------
| process count | queries/sec | total queries/sec |
---------------------------------------------------
| 1 | 0.1786 | 0.1786 |
---------------------------------------------------
| 8 | 0.0359 | 0.2872 |
---------------------------------------------------
...which is far from the increase I'm looking for. When you can see the counter increasing across the scripts there's a clear circular pattern where each waits for the prior script's query to finish.
When the commit is moved from the first commit location to the second commit location (so commits only when the script is interrupted), only one script advances at a time. If that isn't a clear indication of some sort of transaction lock, I don't know what is.
As far as I can tell from searching, there's no document that says we can't have concurrent transactions, so what could the problem be? It's crossed my mind that the query size is so large that only one can be performed at a time, but I would have expected Redshift to have much more than ~17mb per transaction.
Inline with Guy's comment, I ended up using a COPY from an S3 bucket. This ended up being an order of magnitude faster, requiring only a single thread to call the query, and then allowing AWS to process the files from S3 in parallel. I used the guide detailed here and managed to insert about 120Gb of data in just over an hour.