What is an updated index configuration of Google Firestore in Datastore mode? - google-cloud-firestore

Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks
Stefan

I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key
lastUpdatedMillis
ExampleKind/1040
5
ExampleKind/1052
0
ExampleKind/1064
12
ExampleKind/1065
100
ExampleKind/1070
42
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis field.
The the lastUpdatedMillis ASC index table would have the following logical entries:
Index Key
Entity Key
0
ExampleKind/1052
5
ExampleKind/1040
12
ExampleKind/1064
42
ExampleKind/1070
100
ExampleKind/1065
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
For every ExampleKind Entity returned by the query, perform some operation which updates lastUpdatedMillis
Some of the updates may fail, so we repeat the query from step 1 again to catch any remaining entities.
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key
Entity Key
x
xxxx
x
xxxx
x
xxxx
42
ExampleKind/1070
...
Und so weiter ...
x
xxxx
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis field.
For example, instead of scanning the entire range of lastUpdateMillis < now, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
This deadlining query problem could occur with any kind of query over the index (projection, keys only, full entity, etc)
Despite what the error message says, no extra index here is need or would speed up the operation. Datastore's built-in ASC/DESC index over each field already exists for you and is serving this query.

Related

pq: [parent] Data too large

I'm using Grafana to visualize some data stored in CrateDB in different panes.
Some of my boards work correctly, but there are 3 specific boards (created by someone from my work team), in which at certain times of the day they stop showing data (No Data) and as a warning it shows the following error:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
Honestly, I would like to understand what it means, and how I can fix it.
I remain attentive to any help you can give me, and I deeply appreciate the help.
what I tried
17 SQL Statements of the form:
SELECT
time_index AS "time",
entity_id AS metric,
v1_ps
FROM etsm
WHERE
entity_id = 'SM_B3_RECT'
ORDER BY 1,2
for 17 different entities.
what I hope
I hope to receive the data corresponding to each of the SQL statements for their respective graphing.
The result
As a result, there is no data received on some of the statements made and the warning message I shared:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
As an additional fact, the graph is configured to update every 15 min, but no matter how many times you manually update the graph, the statements that receive data are different.
Example: I refresh the panel and the SQL statements A, B and C get data, while the others don't. I refresh the panel and the SQL statements D, H and J receive data, and the others don't (with a random pattern).
Other additional information:
I have access to the database being consulted with Grafana, and the data is there
You don't have time condition, so query select/process all records all the time and you are hitting limits (e. g. size of processed data) of your DB. Add time condition, so only fraction of all records will be returned.

Efficient Way To Update Multiple Records In PostgreSQL

I have to update multiple records in a table with the most efficient way out there having the least latency and without utilising CPU extensively. At a time records to update can be ranged from 1 to 1000.
We do not want to lock the database when this update occurs as other services are utilising it.
Note: There are no dependencies generated from this table towards any other table in the system.
After looking in many places I've drilled down a few ways to do the task-
simple-update: A simple update query to the table with update command with already known id's
Either multiple update queries (one query for each individual record), or
Usage of update ... from clause as mentioned here as a single query (one query for all records)
delete-then-insert: Firstly, delete the outdated data and then insert updated data with new id's (since there is no dependency on records, new id's are acceptable)
insert-then-delete: Firstly, insert updated records with new id's and then delete outdated data using old id's (since there is no dependency on records, new id's are acceptable)
temp-table: Firstly, insert updated records into a temporary table. Secondly, update the original table with inserted records from the temporary table. At last, remove the temporary table.
We must not drop the existing table and create a new one in its place
We must not truncate the existing table because we have a huge number of records that we cannot store in the buffer memory
I'm open to any more suggestions.
Also, what will be the impact of making the update all at once vs doing it in batches of 100, 200 or 500?
References:
https://dba.stackexchange.com/questions/75631
https://dba.stackexchange.com/questions/230257
As mentioned by #Frank Heikens in the comments, I'm sure that different people will have different statistics based on their system design. I did some checks and I have found some insights to share for one of my development systems.
Configurations of the system used:
AWS
Engine: PostgreSQL
Engine version: 12.8
Instance class: db.m6g.xlarge
Instance vCPU: 4
Instance RAM: 16GB
Storage: 1000 GiB
I used a lambda function and pg package to write data into a table (default FILLFACTOR) that contains 34,09,304 records.
Both lambda function and database were in the same region.
UPDATE 1000 records into the database with a single query
Run
Time taken
1
143.78ms
2
115.277ms
3
98.358ms
4
98.065ms
5
114.78ms
6
111.261ms
7
107.883ms
8
89.091ms
9
88.42ms
10
88.95ms
UPDATE 1000 records into the database with a single query in 2 batches of 500 records concurrently
Run
Time taken
1
43.786ms
2
48.099ms
3
45.677ms
4
40.578ms
5
41.424ms
6
44.052ms
7
42.155ms
8
37.231ms
9
38.875ms
10
39.231ms
DELETE + INSERT 1000 records into the database
Run
Time taken
1
230.961ms
2
153.159ms
3
157.534ms
4
151.055ms
5
132.865ms
6
153.485ms
7
131.588ms
8
135.99ms
9
287.143ms
10
175.562ms
I did not proceed to check for updating records with the help of another buffer table because I had found my answer.
I've seen the database metrics graph provided by the AWS and by looking into those it was clear that DELETE + INSERT was more CPU intensive. And from the statistics shared above DELETE + INSERT took more time as compared to UPDATE.
If updates are done concurrently in batches, yes, updates will be faster, depending on the number of connections (a connection pool is recommended).
Using a buffer table, truncate, and other methods might be more suitable approaches if needed to update almost all the records in a giant table, though I currently do not have metrics to support this. However, for a limited number of records, UPDATE is a fine choice to proceed with.
Also, if not executed properly, please be mindful that if DELETE + INSERT fails, you might lose records and if INSERT + DELETE fails you might end up having duplicate records.

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Shuffle ID's on table rails 4 [duplicate]

User.find(:all, :order => "RANDOM()", :limit => 10) was the way I did it in Rails 3.
User.all(:order => "RANDOM()", :limit => 10) is how I thought Rails 4 would do it, but this is still giving me a Deprecation warning:
DEPRECATION WARNING: Relation#all is deprecated. If you want to eager-load a relation, you can call #load (e.g. `Post.where(published: true).load`). If you want to get an array of records from a relation, you can call #to_a (e.g. `Post.where(published: true).to_a`).
You'll want to use the order and limit methods instead. You can get rid of the all.
For PostgreSQL and SQLite:
User.order("RANDOM()").limit(10)
Or for MySQL:
User.order("RAND()").limit(10)
As the random function could change for different databases, I would recommend to use the following code:
User.offset(rand(User.count)).first
Of course, this is useful only if you're looking for only one record.
If you wanna get more that one, you could do something like:
User.offset(rand(User.count) - 10).limit(10)
The - 10 is to assure you get 10 records in case rand returns a number greater than count - 10.
Keep in mind you'll always get 10 consecutive records.
I think the best solution is really ordering randomly in database.
But if you need to avoid specific random function from database, you can use pluck and shuffle approach.
For one record:
User.find(User.pluck(:id).shuffle.first)
For more than one record:
User.where(id: User.pluck(:id).sample(10))
I would suggest making this a scope as you can then chain it:
class User < ActiveRecord::Base
scope :random, -> { order(Arel::Nodes::NamedFunction.new('RANDOM', [])) }
end
User.random.limit(10)
User.active.random.limit(10)
While not the fastest solution, I like the brevity of:
User.ids.sample(10)
The .ids method yields an array of User IDs and .sample(10) picks 10 random values from this array.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
For MYSQL this worked for me:
User.order("RAND()").limit(10)
You could call .sample on the records, like: User.all.sample(10)
The answer of #maurimiranda User.offset(rand(User.count)).first is not good in case we need get 10 random records because User.offset(rand(User.count) - 10).limit(10) will return a sequence of 10 records from the random position, they are not "total randomly", right? So we need to call that function 10 times to get 10 "total randomly".
Beside that, offset is also not good if the random function return a high value. If your query looks like offset: 10000 and limit: 20 , it is generating 10,020 rows and throwing away the first 10,000 of them,
which is very expensive. So call 10 times offset.limit is not efficient.
So i thought that in case we just want to get one random user then User.offset(rand(User.count)).first maybe better (at least we can improve by caching User.count).
But if we want 10 random users or more then User.order("RAND()").limit(10) should be better.
Here's a quick solution.. currently using it with over 1.5 million records and getting decent performance. The best solution would be to cache one or more random record sets, and then refresh them with a background worker at a desired interval.
Created random_records_helper.rb file:
module RandomRecordsHelper
def random_user_ids(n)
user_ids = []
user_count = User.count
n.times{user_ids << rand(1..user_count)}
return user_ids
end
in the controller:
#users = User.where(id: random_user_ids(10))
This is much quicker than the .order("RANDOM()").limit(10) method - I went from a 13 sec load time down to 500ms.