GCP datastore sudden extreme data inconsistency (NDB 1.8.0) - google-cloud-firestore

I have 6 months old Py38 standard gae project in europe-west3 region along with Firestore in DATASTORE mode.
Even with Redis as global cache or without, I have never had any inconsistency issues. Immediate (1 sec took the redirect) fetch after put (insert) yielded fresh results, up until last week. I have made some benching and it takes around 30s for put to result in global query. It actually behaves similar to datastore emulator with consistency parameter set to 0.05
I have read a lot about datastore and its eventual consistency here, but as the document says, this is true for "old" version. New firestore in datastore mode should ensure strong consistency as per this part.
Eventual consistency, all Datastore queries become strongly consistent.
Am I interpreting this claim wrong?
I have also created a fresh project (same region) with only the essential ndb initialization and still extreme "lag".
I'm running out of ideas what could cause this new behavior. Could be that Warshaw datacenter just started and this is causing the issues?
Abstract code with google-cloud-ndb==1.8.0
class X(ndb.Model):
foo = ndb.StringProperty()
x = X(foo="a")
x.put()
time.sleep(5)
for y in X.query(): # returns 0 results
print(y)
If I get entity its by key, it's there and fresh. It even instantly shows up in datastore admin.

This was also filed as https://github.com/googleapis/python-ndb/issues/666 . It turns out Cloud NDB before 1.9.0 was explicitly requesting eventually consistent queries.

Related

Data syncing with pouchdb-based systems client-side: is there a workaround to the 'deleted' flag?

I'm planning on using rxdb + hasura/postgresql in the backend. I'm reading this rxdb page for example, which off the bat requires sync-able entities to have a deleted flag.
Q1 (main question)
Is there ANY point at which I can finally hard-delete these entities? What conditions would have to be met - eg could I simply use "older than X months" and then force my app to only ever displays data for less than X months?
Is such a hard-delete, if possible, best carried out directly in the central db, since it will be the source of truth? Would there be any repercussions client-side that I'm not foreseeing/understanding?
I foresee the number of deleted's growing rapidly in my app and i don't want to have to store all this extra data forever.
Q2 (bonus / just curious)
What is the (algorithmic) basis for needing a 'deleted' flag? Is it that it's just faster to check a flag rather than to check for the omission of an object from, say, a very large list. I apologize if it's kind of a stupid question :(
Ultimately it comes down to a decision that's informed by your particular business/product with regards to how long you want to keep deleted entities in your system. For some applications it's important to always keep a history of deleted things or even individual revisions to records stored as a kind of ledger or history. You'll have to make a judgement call as to how long you want to keep your deleted entities.
I'd recommend that you also add a deleted_at column if you haven't already and then you could easily leverage something like Hasura's new Scheduled Triggers functionality to run a recurring job that fully deletes records older than whatever your threshold is.
You could also leverage Hasura's permissions system to ensure that rows that have been deleted aren't returned to the client. There is documentation and examples for ways to work with soft deletes and Hasura
For your second question it is definitely much faster to check for the deleted flag on records than to have to try and diff the entire dataset looking for things that are now missing.

LiquiBase and Kubernetes database rolling updates

Let's say I have a database with schema of v1, and an application which is tightly coupled to that schema of v1. i.e. SQLException is thrown if the records in the database don't match the entity classes.
How should I deploy a change which alters the database schema, and deploys the application which having a race condition. i.e. user queries the app for a field which no longer exists.
This problem actually isn't specific to kubernetes, it happens in any system with more than one server -- kubernetes just makes it more front-and-center because of how automatic the rollover is. The words "tightly coupled" in your question are a dead giveaway of the real problem here.
That said, the "answer" actually will depend on which of the following mental models are better for your team:
do not make two consecutive schemas contradictory
use a "maintenance" page that keeps traffic off of the pods until they are fully rolled out
just accept the SQLExceptions and add better retry logic to the consumers
We use the first one, because the kubernetes rollout is baked into our engineering culture and we know that pod-old and pod-new will be running simultaneously and thus schema changes need to be incremental and backward compatible for at minimum one generation of pods.
However, sometimes we just accept that the engineering effort to do that is more cost than the 500s that a specific breaking change will incur, so we cheat and scale the replicas low, then roll it out and warn our monitoring team that there will be exceptions but they'll blow over. We can do that partially because the client has retry logic built into it.

My heroku postgres psycopg2 (python) query gets slower and slower each time executed. Any insight?

I have a python app ran on heroku that utilizes a standard postgresql heroku db ($50 version). There are 4 tables within the db. My app queries for one primary key within the main table based off input from the users of my app.
The querying worked great at first however now I'm finding it becoming too slow after about 40-50 minutes without restart my dyno. The queries will take 2,000ms after a while and take several seconds to load in front of the users. I'm newer to programming and this is my second app. I'm wondering what would make queries slower with time instead of constant. They are so fast at first. What are best practices for psycopg2 within an app to ensure the db doesn't get hung up? Here is an example of one of the queries (all others have similar syntax throughout the script):
if eventText=="Mc3 my champs":
user=event.source.user_id
profile= line_bot_api.get_profile(user)
name=str((profile.display_name))
cur=None
try:
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
# get the user's information if it exists
cur.execute("""SELECT lineid, summoner_name, champ_data FROM prestige_data WHERE lineid = %(lineid)s LIMIT 1""", {"lineid": user})
rows = cur.fetchall()
for row in rows:
champs = row[2]
prestige=(calculate_prestige(champs))
champs = json.loads(champs)
champsdict=dict.items(champs)
champs_sorted=sorted(champsdict, key=lambda student: student[1], reverse=True)
l=('\n'.join(map(str,champs_sorted)))
hello=str(l).replace('(', '').replace(')', '')
yay=str(hello).replace("'", "").replace("'", "")
msg=(yay+'\n'+"---------------------------"+'\n'+name+'\n'+"Prestige:"+(str(prestige)))
line_bot_api.reply_message(
event.reply_token,
TextSendMessage(text=msg))
break # we should only have one result, but we'll stop just in case
# The user does not exist in the database already
else:
msg = "Oops! You need to add some champs first. Try 'Mc3 inputchamp'."
line_bot_api.reply_message(
event.reply_token,
TextSendMessage(text=msg))
except BaseException:
if cur is not None:
conn.rollback()
finally:
if cur is not None:
cur.close()
While I know I did not frame that question well (only been programming for a month), I have found one issue potentially causing this that warrants documentation on here.
I had a suspicion the concurrency issue was caused when incorrect data was not found in the queries. In this situation, my conn would not rollback, commit, nor close in the above example.
Per psycopg2s documentation, even select queries need to be committed or rolled back or the transaction will stand. This in turn will keep your heroku dyno worker focused on the transaction for 30 seconds causing h12. So make sure you commit or rollback each query, regardless of the outcome, with in an application to ensure you do not get an idle transaction.
My queries are spiffy now but the issue persists. I'm not sure what slowly but surely idles my waitress workers. I think some ancillary process is somehow started in one of the class modules I've created which goes indefinitely taking hold of each worker until they are all focused on the transaction which leads to h12.
Would love someones input if they've had a similar experience. I don't want to have a cron job reboot the app every 10 minutes to make this self functioning.

Entity Framework memory leak in Azure worker role

I investigate my worker role memory dumps using WinDbg.
I found the following results by grabbing dumps every half hour from WaWorkerHost.exe locally. First column is a count of objects, second is a size. Also, most expensive objects in the dump have string type.
35360 3394560 System.Data.Objects.EntitySqlQueryState
40256 3864576 System.Data.Objects.EntitySqlQueryState
45152 4334592 System.Data.Objects.EntitySqlQueryState
I found that class here http://entityframework.codeplex.com/SourceControl/latest#src/EntityFramework/Core/Objects/Internal/EntitySqlQueryState.cs
And you can see that it caches query string.
Is it possible that Entity Framework caches that objects without any releases ?
I found an article that NHibernate can.
http://rasmuskl.dk/2008/12/19/a-windbg-debugging-journey-nhibernate-memory-leak/
That worker role automatically restart every day on production server when RAM is over.
I have Entity Framework 5, Azure SDK 2.5.
Please, help me with that issue, what could you advice ?

Issue with Entity Framework 4.2 Code First taking a long time to add rows to a database

I am currently using Entity Framework 4.2 with Code First. I currently have a Windows 2008 application server and a database server running on Amazon EC2. The application server has a Windows Service installed that runs once per day. The service executes the following code:
// returns between 2000-4000 records
var users = userRepository.GetSomeUsers();
// do some work
foreach (var user in users)
{
var userProcessed = new UserProcessed { User = user };
userProcessedRepository.Add(userProcessed);
}
// Calls SaveChanges() on DbContext
unitOfWork.Commit();
This code takes a few minutes to run. It also maxes out the CPU on the application server. I have tried the following measures:
Remove the unitOfWork.Commit() to see if it is network related when the application server talks to the database. This did not change the outcome.
Changed my application server from a medium instance to a high CPU instance on Amazon to see if it is resource related. This caused the server not to max out the CPU anymore and the execution time improved slightly. However, the execution time was still a few minutes.
As a test I modified the above code to run three times to see if execution time for the second and third loop using the same DbContext. Every consecutive loop took longer to run that the previous one but that could be related to using the same DbContext.
Am I missing something? Is it really possible that something as simple as this takes minutes to run? Even if I don't commit to the database after each loop? Is there a way to speed this up?
Entity Framework (as it stands) isn't really well suited to this kind of bulk operation. Are you able to use one of the bulk insert methods with EC2? Otherwise, you might find that hand-coding the T-SQL INSERT statements is significantly faster. If performance is important then that probably outweighs the benefits of using EF.
My guess is that your ObjectContext is accumulating a lot of entity instances. SaveChanges seems to have a phase that has time linear in the number of entities loaded. This is likely the reason for the fact that it is taking longer and longer.
A way to resolve this is to use multiple, smaller ObjectContexts to get rid of old entity instances.