How to fetch the continuous list with PostgreSQL in web - postgresql

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!

Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.

I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

Improve Hasura Subscription Performance

we developed a web app that relies on real-time interaction between our users. We use Angular for the frontend and Hasura with GraphQL on Postgres as our backend.
What we noticed is that when more than 300 users are active at the same time we experience crucial performance losses.
Therefore, we want to improve our subscriptions setup. We think that possible issues could be:
Too many subscriptions
too large and complex subscriptions, too many forks in the subscription
Concerning 1. each user has approximately 5-10 subscriptions active when using the web app. Concerning 2. we have subscriptions that are complex as we join up to 6 tables together.
The solutions we think of:
Use more queries and limit the use of subscriptions on fields that are totally necessary to be real-time.
Split up complex queries/subscriptions in multiple smaller ones.
Are we missing another possible cause? What else can we use to improve the overall performance?
Thank you for your input!
Preface
OP question is quite broad and impossible to be answered in a general case.
So what I describe here reflects my experience with optimization of subscriptions - it's for OP to decide is it reflects their situtation.
Short description of system
Users of system: uploads documents, extracts information, prepare new documents, converse during process (IM-like functionalitty), there are AI-bots that tries to reduce the burden of repetitive tasks, services that exchange data with external systems.
There are a lot of entities, a lot of interaction between both human and robot participants. Plus quite complex authorization rules: visibility of data depends on organization, departements and content of documents.
What was on start
At first it was:
programmer wrote a graphql-query for whole data needed for application
changed query to subscription
finish
It was OK for first 2-3 monthes then:
queries became more complex and then even more complex
amount of subscriptions grew
UI became lagging
DB instance is always near 100% load. Even during nigths and weekends. Because somebody did not close application
First we did optimization of queries itself but it did not suffice:
some things are rightfully costly: JOINs, existence predicates, data itself grew significantly
network part: you can optimize DB but just to transfer all needed data has it's cost
Optimization of subscriptions
Step I. Split subscriptions: subscribe for change date, query on change
Instead of complex subscription for whole data split into parts:
A. Subscription for a single field that indicates that entity was changed
E.g.
Instead of:
subscription{
document{
id
title
# other fields
pages{ # array relation
...
}
tasks{ # array relation
...
}
# multiple other array/object relations
# pagination and ordering
}
that returns thousands of rows.
Create a function that:
accepts hasura_session - so that results are individual per user
returns just one field: max_change_date
So it became:
subscription{
doc_change_date{
max_change_date
}
}
Always one row and always one field
B. Change of application logic
Query whole data
Subscribe for doc_change_date
memorize value of max_change_date
if max_change_date changed - requery data
Notes
It's absolutely OK if subscription function sometimes returns false positives.
There is no need to replicate all predicates from source query to subscription function.
E.g.
In our case: visibility of data depends on organizations and departments (and even more).
So if a user of one department creates/modifies document - this change is not visible to user of other department.
But those changes are like ones/twice in a minute per organization.
So for subscription function we can ignore those granularity and calculate max_change_date for whole organization.
It's beneficial to have faster and cruder subscription function: it will trigger refresh of data more frequently but whole cost will be less.
Step II. Multiplex subscriptions
The first step is a crucial one.
And hasura has a multiplexing of subscriptions: https://hasura.io/docs/latest/graphql/core/databases/postgres/subscriptions/execution-and-performance.html#subscription-multiplexing
So in theory hasura could be smart enough and solve your problems.
But if you think "explicit better than implicit" there is another step you can do.
In our case:
user(s) uploads documents
combines them in dossiers
create new document types
converse with other
So subscriptions becames: doc_change_date, dossier_change_date, msg_change_date and so on.
But actually it could be beneficial to have just one subscription: "hey! there are changes for you!"
So instead of multiple subscriptions application makes just one.
Note
We thought about 2 formats of multiplexed subscription:
A. Subscription returns just one field {max_change_date} that is accumulative for all entities
B. Subscription returns more granular result: {doc_change_date, dossier_change_date, msg_change_date}
Right now "A" works for us. But maybe we change to "B" in future.
Step III. What we would do differently with hasura 2.0
That's what we did not tried yet.
Hasura 2.0 allows registering VOLATILE functions for queries.
That allows creation of functions with memoization in DB:
you define a cache for function call presumably in a table
then on function call you first look in cache
if not exists: add values to cache
return result from cache
That allows further optimizations both for subscription functions and query functions.
Note
Actually it's possible to do that without waiting for hasura 2.0 but it requires trickery on postgresql side:
you create VOLATILE function that did real work
and another function that's defined as STABLE that calls VOLATILE function. This function could be registered in hasura
It works but that's trick is hard to recommend.
Who knows, maybe future postgresql versions or updates will make it impossible.
Summary
That's everything that I can say on the topic right now.
Actually I would be glad to read something similar a year ago.
If somebody sees some pitfalls - please comment, I would be glad to hear opinions and maybe alternative ways.
I hope that this explanation will help somebody or at least provoke thought how to deal with subscriptions in other ways.

Is there any speed benefit to performing your own algorithm to scramble IDs for security purposes? [duplicate]

I am planning to implement my own very simple "hashing" formula to add a layer of security to an app with multiple users. My current plan is as follows:
User creates an account at which point an ID is generated on the backend. The ID is run through a formula (let's say ID * 57 + 8926 - 36 * 7, or something equally random). I then send back to the frontend the new user ID and the new "hashed" number and store them in localStorage.
User tries to access a secured area (let's say a settings page so they can change their own settings).
I send the backend two values: their ID and the hashed number. I run the ID through the same formula to check it matches the hashed value I've received. If the check passes, they can get in. So if someone has tried, say, changing their ID in localStorage to get access to another user's settings page, the only way they could achieve that is if they guess what the formula was. They could easily guess a user ID, but guess that the corresponding number is the result of ID * 57 + 8926 - 36 * 7 seems pretty unlikely.
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value... I think? Would it make more sense to use a package to create some kind of primary key/uuid instead of "hashing" my own value and doing a db lookup each time?
Tech stack: React on FE, Python on BE, SQL db.
I see a lot of posts saying "don't roll your own" -- is this absolute?
Yes it is. The reason being that whenever a non-cryptographer tries their hand at developing their own algorithms, they invariably fall into a multitude of pit holes which render the security of the algorithm to next to useless.
Your particular scheme, for example, can be trivially broken given two consecutive ID and "hash" pairs. (It's a simple arithmetic sequence, deriving the formula of an arithmetic sequence given two consecutive values is grade ~6 level math.)
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value...
The performance difference would probably be negligible. Don't worry about it.
If the information is not particularly sensitive, just assign each user a randomly generated 128 bit number. The chances of someone guessing a valid user's number are practically zero.
Two property of real hashes that you are missing with this are
a simple change in the input causes a large change in the output
all hashes have the same length
This could be a problem if a user somehow knows their own id and hash.
With your selfmade hash I could easily find out the hash of a random other user by reverse engeniering the hash.

how to avoid redundent in Yahoo answer api

I have a question about Yahoo answer api. I plan to use (questionSearch, getByCategory, getQuestion, getByUser). For example I used getByCategory to query. Each time I call the function, I can query max 50 questions. However, there are a lot of same questions which have been queried in previous time. So How can I remove this redundent ?
The API doesn't track what it has returned to you previously as its stateless.
This leaves you with two options that I can think of.
1) After you get your data back filter out what you already have. This requires you checking what is displayed and then not displaying duplicated items.
2) Store all ID's you have showing in a list, then adjust your YQL Query so that it provides that list of ID's as ones not to turn. Like:
select * from answers.getbycategory where category_id=2115500137 and type="resolved" and id not in ('20140216060544AA0tCLE', '20140215125452AAcNRTq', '20140215124804AAC1cQl');
The downside of this, is that it could effect performance since your YQL queries will start to take longer and longer to return.

What are some of the advantage/disadvantages of using SQLDataReader?

SqlDataReader is a faster way to process the stored procedure. What are some of the advantage/disadvantages of using SQLDataReader?
I assume you mean "instead of loading the results into a DataTable"?
Advantages: you're in control of how the data is loaded. You can ask for specific data types, and you don't end up loading the whole set of data into memory all at the same time unless you want to. Basically, if you want the data but don't need a data table (e.g. you're going to populate your own kind of collection) you don't get the overhead of the intermediate step.
Disadvantages: you're in control of how the data is loaded, which means it's easier to make a mistake and there's more work to do.
What's your use case here? Do you have a good reason to believe that the overhead of using a normal (or strongly typed) data table is significantly hurting performance? I'd only use SqlDataReader directly if I had a good reason to do so.
The key advantage is obviously speed - that's the main reason you'd choose a SQLDataReader.
One potential disadvantage not already mentioned is that the SQLDataReader is forward only, so you can only go through the records once in sequence - that's one of the things that allows it to be so fast. In many cases that's fine but if you need to iterate over the records more than once or add/edit/delete data you'll need to use one of the alternatives.
It also remains connected until you've worked through all the records and close the reader (of course, you can opt to close it earlier, but then you can't access any of the remaining records). If you're going to perform any lengthy processing on the records as you iterate over them, you may find that you impact other connections to the database.
It depends on what you need to do. If you get back a page of results from the database (say 20 records), it would be better to use a data adapter to fill a DataSet, and bind that to something in the UI.
But if you need to process many records, 1 at a time, use SqlDataReader.
Advantages: Faster, less memory.
Disadvantages: Must remain connected, must remember to close the reader.
The data might not be concluesive and you are not in control of your actions that why the milk man down the road has always got to carry data with him or else they gona get cracked by the data and the policeman will not carry any data because they think that is wrong to keep other people's data and its wrong to do so. There is a girl who lives in Sheffield and she loves to go out and play most the times that she s in the house that is why I dont like to talk to her because her parents and her other fwends got taken to peace gardens thats a place that everyone likes to sing and stay. usually famous Celebs get to hang aroun dthere but there are always top security because we dont want to get skanked down them ends. KK see u now I need 2 go and chill in the west end PEACE!!!£"$$$ Made of MOney MAN$$$$