Per-collection read/write monitoring in Firestore [duplicate] - google-cloud-firestore

I am getting very high counts of Entity Writes in my firestore database.
Write permission in most of the paths are restricted, done from back-end server using admin SDK. Only a very few paths have write access- specifically only to the users who are (authenticated & registered & joined and approved in a specific group), so even though the ways to abuse are apparently thin, yet hard to specifically identify.
Only way I see- is to execute Cloud Functions on every write, and have the function log the paths somewhere to analyze. But that introduces further costs and complexity.
Is there any way/recommendation to monitor/profile where (i.e.- path) and who (UID or any identity) are performing the writes? There are tools to do such for RTDB, bu't can't find anything for Firestore.
I am also wondering if there is any way to restrict ip/users automatically in case of abuse (i.e.- high rate of read/write)?

What I'm currently doing is going to firestore console => menu usage => view usage
and I see something like this:
It's not the same as the profiler, but better than nothing.
I'm also keeping an eye on the video on the link below to see if someone provides an answer. People are asking for the profiler too.
https://www.youtube.com/watch?v=9CObBsjk6Tc

Related

What is the expected behaviour with Firestore cache disabled?

Context:
My users need to see the same data.
Example: Admin user deletes item from customer order. Admin user is offline so change happens in cache only. Customer thinks he is still getting his product. Balance due, customer expectations, etc. obviously all out of sync.
Persistence is turned off:
await Firebase.initializeApp(options: DefaultFirebaseOptions.currentPlatform);
FirebaseFirestore.instance.settings = const Settings(persistenceEnabled: false);
Why can I still delete a document locally, or write to cache?
Are there any other options? Is this setting working as it should?
The only two workarounds I can find from reading articles and other questions is creating callables to cloud functions, and using a transaction - this is obviously not as fast. And is using transactions this way good practice?
I just don't want to be doing something else and find out I am doing something wrong with this cache setting as its not working as expected:
https://firebase.flutter.dev/docs/firestore/usage/
"This functionality is enabled by default, however it can be disabled if needed."
What I would like is for it to just throw if cache is false.
Thank you
Setting persistenceEnabled to false disabled the persistent cache that Firestore uses to caching data and keep pending writes between page/app reloads. But even when this is set to false, the Firestore SDK will keep all pending writes and data you have active listeners on in memory. This allows it to work in the situation of spotty network connections, which is quite common.
There is no way to disable the in-memory cache of pending writes.
Possible workarounds are to either detect the no-network situation yourself, and disable the functionality you don't want to allow, or (as you said already) use a mechanism that inherently requires a connection, such as a transaction or calling a Cloud Function.
All of these lead to a worse user experience though, so I recommend considering whether disabling the operation is really needed, or if there's a way to work with Firestore's model rather than against it.

How to prevent attackers from doing many writes to the firestore database?

I already set security rules for my database, however there are still some places a user must be allowed to write to, but i am afraid an attacker might do multiple writes to the location, i can add an authentication check to allow write access but all they have to do is create an account to gain write access, the only other solution i can think of is only allowing a write to happen every X minutes, if that is possible, or is there another solution to prevent attackers from making multiple writes.
Posting Doug Stevenson comment as community answer for better visibility:
"all they have to do is create an account to gain write access" - yes, that's the way things work. Your rules should determine which users can read and write specific data, or accept that anyone can read and write data by simply creating an account. If you want general rate limiting, you should consider forcing your users through a backend endpoint that determine when that user is allow to write, if that's what you want. You could use security rules for this, but there is no easy expression for this, and you're in for a lot of work to track write rates correctly.

Meteor - Why not just publish all the collection data?

This may be quite an easy question to answer as it may just be my lack of understanding, but if you are having to run the query twice - once of the server and once on the client - why not just publish all the collection data, and then just run one query on the client?
Obviously I don't mean doing this for the users collection, but if you have a blog Posts collection, wouldn't this be beneficial?
Publish all the post data, then subscribe to it and running whatever query is necessary on the client to get the data you need.
Publishing everything is good for 'development' environment as meteor adds autopublish by default but this has some fallacies in 'production' environment. I find this two points to be of importance
Security : The idea is, supply only as much data to the client as required. You can never trust the client and you don't know what the client may use the data for. For your use case, of simple blog posts, this may not be a serious risk but may be a critical risk for e commerce application. The last thing, you want is a hacker to use the data and leverage a bug in your code to do nasty stuff.
Data Overheads: For subscriptions, generally waitOn is used. Thus, till all the data has been made available to the client, the templates are not rendered. IF you have a very large amount of data it will take considerable time to render. So, it is advised to keep the data at 'only what to need' stage to optimize this time too.

Are "best practices" regarding connection handle re-use and database user design mutually exclusive?

SO says this may be subjective. I'm hoping not--I just can't seem to understand how this works in practice, and it seems like a specific enough technical question with I hope a definitive answer.
Context: LAPP stack.
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
Database handles have some resource cost associated with them, hence the existence of Apache::DBI, DBIx::Connector, and DBI::connect_cached(), to re-use a recent connection to a database. Making use of them should make a web app faster by avoiding the cost of connecting to a database.
The reason these seem to be mutually exclusive best practices is that, in my understanding, #1 implies that any database connection will be made with separate per-user credentials, which implies (as Apache::DBI documents) that re-using such connections will likely quickly cause your database backend to run out of connections.
The default maximum number of connections for PostgreSQL is 100.
The default numbers of servers and multiplied by subprocesses allowed for each, for Apache 2 running with the prefork MPM, far exceeds that, so it seems Apache::DBI's docs are right.
Thus the question: What do people do then, in practice?
Does this mean people using a LAPP stack generally connect using a single database user, and implement their own security/permissions model? Or does it mean they don't pool connections? Or do they choose between these two strategies based on speed vs security needs if they go with a LAPP stack, and if they need both, go with a desktop app or some other connection model?
Or if these are not, in fact, mutually exclusive strategies, what am I missing in my understanding here?
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
You probably misread this, or read it in a highly biased location. A more balanced view is (hopefully) this:
Managing perms (ACL or RBAC or other) within the database is a bloody mess and hard to get right. It can cripple performance, too, if done improperly (think: "select * from table join perms where convoluted_permission_scenario".) Depending on who you ask, you'll get more or less extreme viewpoints, e.g. here's (the very controversial) Zed Shaw: http://vimeo.com/2723800.
Managing perms at the DB level is just as much of a bloody mess. Not all engines implement row-level permissions, and even then there occasionally are leaks. For instance, calling a function in a where clause could (can?) leak rows in Postgres (until a recent version?) if raise gets called. And frankly, if you go past a superficial analysis of what is going on, it basically amounts to the former — just standardized and (usually) in C.
Managing perms at the app level without a database is also a bloody mess. It'll cripple performance no matter what you do from the moment where you need to join outside of SQL, unless you're dealing with trivial amounts of data. If you try it, you'll do fine… until your database grows too large and you basically don't.
So, in short: it's a bloody mess no matter where you manage it. Because permissions are a mess. In addition to the casual and idealistic "Joe needs write access to this set of nodes", you also need to cope with more down to earth scenarios such as "John is going off on vacation for Christmas and needs to temporarily delegate his write permissions on this set of nodes to his assistant Jane". Moreover, whichever scenario you do pick, you need to manage read access (which is usually the most frequent) in such a way that it's fast so you can scale. There's no silver bullet.
Moreover, even in the first and last of the above scenarios, it's ideal to have three DB users. One for reads, one for read/writes, and one for schema changes. Most apps don't, because it's yet another bloody mess to configure your ORM that way, hence the typical one DB user per app.
Anyway, getting back to your question: what people do in practice is one or two database users (read vs read/write/modify), implement RBAC or ACL within the database itself, and avoid access restriction logic like the plague on public-facing pages for performance reasons.

Caching repeating query results in MongoDB

I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.