Should I store an array or individual items in Memcache? - memcached

Right now we are storing some query results on Memcache. After investigating a bit more, I've seen that many people save each individual item in Memcache. The benefit of doing this is that they can get these items from Memcache on any other request.
Store an array
$key = 'page.items.20';
if( !( $results = $memcache->get($key) ) )
{
$results = $con->execute('SELECT * FROM table LEFT JOIN .... LIMIT 0,20')->fetchAll();
$memcache->save($results, $key, 3600);
}
...
PROS:
Easier
CONS:
If I change an individual item, I have to delete all caches (it can be a pain)
I can have duplicated results (the same item on different queries)
vs
Store each item
$key = 'page.items.20';
if( !( $results_ids = $memcache->get($key) ) )
{
$results = $con->execute('SELECT * FROM table LEFT JOIN .... LIMIT 0,20')->fetchAll();
$results_ids = array();
foreach ( $results as $result )
{
$results_ids[] = $result['id'];
// if doesn't exist, save individual item
$memcache->add($result, 'item'.$result['id'], 3600);
}
// save results_ids
$memcache->save($results_ids, $key, 3600);
}
else
{
$results = $memcache->multi_get($results_ids);
// get elements which are not cached
...
}
...
PROS:
I don't have the same item stored twice on Memcache
Easier to invalidate results on several queries (just the item we change)
CONS:
More complicated business logic.
What do you think? Any other PROS or CONS on each way?
Some links
Post explaining the second method in Memcached list
Thread in Memcached Group

Grab stats and try to calculate a hit ratio or possible improvement if you cache the full query vs doing individual item grabs in MC. Profiling this kind of code is also very helpful to actually see how your theory applies.
It depends on what the query does. If you have a set of users and then want to grab the "top 10 music affinity" with some of those friends, it is worth to have both cachings:
- each friend (in fact, each user of the site)
- the top 10 query for each user (space is cheaper than CPU time)
But in general it is worth to store in MC all individual entities that are going to be used frequently (either in the same code execution, or in subsequent requests or by other users). Then things like CPU or resource heavy queries and data processings either MC-them or delegate them to async. jobs instead of making them realtime (e.g. Top 10 site users doesn't needs to be realtime, can be updated hourly or daily).
And of course taking into account that if you store and MC individual entities, you have to remove all referential integrity from the DB to be able to reuse them either individually or in groups.

The question is subjective and argumentative...
This depends on your usage pattern. If you're constantly pulling individual nodes by ID, store each one separately.
Also, note that in either case, storing the list isn't all that useful except for the top 20. If you insert/update/delete a node in such a way that the top-20 is no longer valid, you may end up needing to flush the next 20, and so on.
Lastly, keep in mind that it's a cache. If you're using a cache, you're making the underlying statement that it's no big deal if the data you're outputting is slightly stale.

The memcached stores data in chunks of specific sizes as explained better in the link below.
http://code.google.com/p/memcached/wiki/NewUserInternals
If your data distributions in the memcached is large, then the number of the larger size chunks will be less and therefore the least recently used algorithm will push the data out even if their is space available in the other chunk sizes. The least recently used algorithm works on respective chunks.
You can decide which implementation to choose based on the data size distribution in memcached.

Related

Efficient row reading with libpq (postgresql)

This is a scalability related question.
We want to read some rows from a table and, after processing some of them, stop the query. The stop criteria is data dependent (we do not know in advance how many or what rows are we interested in).
This is scalability sensitive when the number of rows of the table grows far beyond the number of rows we really are interested in.
If we use the standard PQExec, all rows are returned and we are forced to consume them (we have to call PQGetResult until it returns null). So this does not scale.
We are now trying "row by row" reading.
We first used PQsendQuery and PQsetSingleRowMode. However, we still have to call PQGetResult until it returns null.
Our last approach is PQsendQuery, PQsetSingleRowMode and when we are done we cancel the query as follows
void CloseRowByRow() {
PGcancel *c = PQgetCancel(conn);
char errbuf[256];
PQcancel(c, errbuf, 256);
PQfreeCancel(c);
while (res) {
PQclear(res);
res = PQgetResult(conn);
}
}
This produces some performance benefits but we are wondering if this is the best we can do.
So here comes the question: Is there any other way?
Use DECLARE and FETCH to define & read from a server-side cursor, this is exactly what they are meant for. You would use standard APIs, FETCH will just let you retrieve the results in batches of a controlled size. See the examples in the docs for more details.

Why does response time go up when the number of concurrent requests per second to my Asp.Net Core API goes up

I'm testing an endpoint under load. For 1 request per second, the average response time is around 200ms. The endpoint does a few DB lookups (all read) that are pretty fast and it's async throughout.
However when doing a few hundred requests per second (req/sec), the average response time goes up to over a second.
I've had a look at the best practices guide at:
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2
Some suggestions like "Avoid blocking calls" and "Minimize large object allocations" seem like they don't apply since I'm already using async throughout and my response size for a single request is less than 50 KB.
There are a couple though that seem like they might be useful, for example:
https://learn.microsoft.com/en-us/ef/core/what-is-new/ef-core-2.0#high-performance
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2#pool-http-connections-with-httpclientfactory
Questions:
Why would the average response time go up with an increased req/sec?
Are the suggestions above that I've marked as being 'might be useful' likely to help? I ask because while I would like to try out all, I have limited time available to me unfortunately, so I'd like to try out options that are most likely to help first.
Are there any other options worth considering?
I've had a look at these two existing threads, but neither answer my question:
Correlation between requests per second and response time?
ASP.NET Web API - Handle more requests per second
It will be hard to answer your specific issue without access to the code but the main things to consider is the size and complexity of the database queries being generated by EF. Using async/await will increase the responsiveness of your web server to start requests, but the request handling time under load will depend largely on the queries being run as the database becomes the contention point. You will want to ensure that all queries are as minimalist as possible. For example, there is a huge difference between the following 3 statements:
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.ToList()
.Where(x => x.SomeCriteriaMethod())
.ToList();
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.ToList();
var someData = context.SomeTable
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.Select(x => new SomeViewModel
{
SomeTableId = x.SomeTableId,
SomeField = x.SomeField,
SomeOtherField = x.SomeOtherTable.SomeOtherField
}).ToList();
Examples like the first above are extremely inefficient as they end up loading all data from the related tables from the database before filtering rows. Even though your web server may only pass back a few rows, it has requested everything from the database. These kinds of scenarios creep into apps when developers face scenarios where they want to filter on a value that EF cannot translate to SQL (such as a Function) so they solve it by putting a ToList call, or it can be introduced as a byproduct of poor separation such as a repository pattern that returns an IEnumerable.
The second example is a little better where they avoid using the read-all ToList() call, but the calls are still loading back entire rows for data that isn't necessary. This ties up resources on the database and web servers.
The third example demonstrates refining queries to just return the absolute minimum of data that the consumer needs. This can make better use of indexes and execution plans on the database server.
Other performance pitfalls that you can face under load are things like lazy loads. Databases will execute a finite number of concurrent requests, so if it turns out some queries are kicking off additional lazy load requests, when there is no load, these are executed immediately. Under load though, they are queued up along with other queries and lazy load requests which can tie down data pulls.
Ultimately you should run an SQL profiler against your database to capture the kinds and numbers of SQL queries being executed. When executing in a test environment, pay close attention to the Read count and CPU cost rather than the total execution time. As a general rule of thumb higher read and CPU cost queries will be far more susceptible to execution time blow-out under load. They require more resources to run, and "touch" more rows meaning more waiting for row/table locks.
Another thing to watch out for are "heavy" queries in very large data systems that will need to touch a lot of rows, such as reports, and in some cases, highly customizable search queries. If these should be needed, you should consider planning your database design to include a read-only replica to run reports or large search expressions against to avoid row lock scenarios in your primary database that can degrade responsiveness for the typical read and write queries.
Edit: Identifying lazy load queries.
These show up in a profiler where you query against a top level table, but then see a number of additional queries for related tables following it.
For example, say you have a table called Order, with a related table called Product, another called Customer and another called Address for a delivery address. To read all orders for a date range you'd expect to see a query such as:
SELECT [OrderId], [Quantity], [OrderDate] [ProductId], [CustomerId], [DeliveryAddressId] FROM [dbo].[Orders] WHERE [OrderDate] >= '2019-01-01' AND [OrderDate] < '2020-01-01'
Where you just wanted to load Orders and return them.
When the serializer iterates over the fields, it finds a Product, Customer, and Address referenced, and by attempting to read those fields, will trip lazy loading resulting in:
SELECT [CustomerId], [Name] FROM [dbo].[Customers] WHERE [CustomerId] = 22
SELECT [ProductId], [Name], [Price] FROM [dbo].[Products] WHERE [ProductId] = 1023
SELECT [AddressId], [StreetNumber], [City], [State], [PostCode] FROM [dbo].[Addresses] WHERE [AddressId] = 1211
If your original query returned 100 Orders, you would see potentially 100x the above set of queries, one set for each order as a lazy load hit on 1 order row would attempt to look up a related customer by customer ID, a related product by Product ID, and a related Address by Delivery Address ID. This can, and does get costly. It may not be visible when run on a test environment, but that adds up to a lot of potential queries.
If eager loaded using .Include() for the related entities, EF will compose JOIN statements to get the related rows all in one hit which is considerably faster than fetching each individual related entity. Still, that can result in pulling a lot of data you don't need. The best way to avoid this extra cost is to leverage projection through Select to retrieve just the columns you need.

Service fabric reliable dictionary performance with 1 million keys

I am evaluating the performance of Service Fabric with a Reliable Dictionary of ~1 million keys. I'm getting fairly disappointing results, so I wanted to check if either my code or my expectations are wrong.
I have a dictionary initialized with
dict = await _stateManager.GetOrAddAsync<IReliableDictionary2<string, string>>("test_"+id);
id is unique for each test run.
I populate it with a list of strings, like
"1-1-1-1-1-1-1-1-1",
"1-1-1-1-1-1-1-1-2",
"1-1-1-1-1-1-1-1-3".... up to 576,000 items. The value in the dictionary is not used, I'm currently just using "1".
It takes about 3 minutes to add all the items to the dictionary. I have to split the transaction to 100,000 at a time, otherwise it seems to hang forever (is there a limit to the number of operations in a transaction before you need to CommitAsync()?)
//take100_000 is the next 100_000 in the original list of 576,000
using (var tx = _stateManager.CreateTransaction())
{
foreach (var tick in take100_000) {
await dict.AddAsync(tx, tick, "1");
}
await tx.CommitAsync();
}
After that, I need to iterate through the dictionary to visit each item:
using (var tx = _stateManager.CreateTransaction())
{
var enumerator = (await dict.CreateEnumerableAsync(tx)).GetAsyncEnumerator();
try
{
while (await enumerator.MoveNextAsync(ct))
{
var tick = enumerator.Current.Key;
//do something with tick
}
}
catch (Exception ex)
{
throw ex;
}
}
This takes 16 seconds.
I'm not so concerned about the write time, I know it has to be replicated and persisted. But why does it take so long to read? 576,000 17-character string keys should be no more than 11.5mb in memory, and the values are only a single character and are ignored. Aren't Reliable Collections cached in ram? To iterate through a regular Dictionary of the same values takes 13ms.
I then called ContainsKeyAsync 576,000 times on an empty dictionary (in 1 transaction). This took 112 seconds. Trying this on probably any other data structure would take ~0 ms.
This is on a local 1 node cluster. I got similar results when deployed to Azure.
Are these results plausible? Any configuration I should check? Am I doing something wrong, or are my expectations wildly inaccurate? If so, is there something better suited to these requirements? (~1 million tiny keys, no values, persistent transactional updates)
Ok, for what it's worth:
Not everything is stored in memory. To support large Reliable Collections, some values are cached and some of them reside on disk, which potentially could lead to extra I/O while retrieving the data you request. I've heard a rumor that at some point we may get a chance to adjust the caching policy, but I don't think it has been implemented already.
You iterate through the data reading records one by one. IMHO, if you try to issue half a million separate sequential queries against any data source, the outcome won't be much optimistic. I'm not saying that every single MoveNext() results in a separate I/O operation, but I'd say that overall it doesn't look like a single fetch.
It depends on the resources you have. For instance, trying to reproduce your case on my local machine with a single partition and three replicas, I get the records in 5 seconds average.
Thinking about a workaround, here is what comes in mind:
Chunking I've tried to do the same stuff splitting records into string arrays capped with 10 elements(IReliableDictionary< string, string[] >). So essentially it was the same amount of data, but the time range was reduced from 5sec down to 7ms. I guess if you keep your items below 80KB thus reducing the amount of round-trips and keeping LOH small, you should see your performance improved.
Filtering CreateEnumerableAsync has an overload that allows you to specify a delegate to avoid retrieving values from the disk for keys that do not match the filter.
State Serializer In case you go beyond simple strings, you could develop your own Serializer and try to reduce the incurred I/O against your type.
Hopefully it makes sense.

Efficient way to store and query tree-like hierarchical data

Please see the image here:
https://picasaweb.google.com/108987384888529766314/CS3217Project#5717590602842112850
So, as you can see from the image, we are trying to store hierarchical data into a database. 1 publisher has may articles, 1 article has many comments and so on. Thus, if I use a relational database like SQL Server, I will have a publisher table, then an articles table and a comments table. But the comments table will grow very quickly and become very large.
Thus, is there any alternative which allows me to store and query such tree like data efficiently? How about NoSQL (MongoDB)?
You can use adjacent lists for hierarchical data. It's efficient and easy to implement. It works also with MySQL. Here a link: http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/.
Here is good survey of 8 NoSQL distributed databases and the needs that they fill.
Do you anticipate you will write more than you read?
Do you anticipate you will need low-latency data access, high concurrency support and high availability is a requirement?
Do you need dynamic queries?
Do you prefer to define indexes, not map/reduce functions?
Is versioning important?
Do you anticipate you will accumulate occasionally changing data, on which pre-defined queries are to be run?
Do you anticipate you will rapidly changing data with a foreseeable database size (should fit mostly in memory)?
Do you anticipate graph-style, rich or complex, interconnected data?
Do you anticipate you will need random, realtime read/write access to BigTable-like data?
Most NOSQL database design involves a mix of the following techniques:
Embedding - nesting of objects and arrays inside a document
Linking - references between documents
The schema you craft depends on various aspects of you data. One solution to your problem may be the following schema:
db.articles { _id: ARTICLE_ID; publisher: "publisher name"; ... }
db.comments { _id: COMMENT_ID; article_id: ARTICLE_ID; ... }
Here the publisher is embedded in an article document. We can do this because it's unlikely the publisher name will change. It also saves us having to look up publisher details every time we need to access an article.
The comments are stored in their own documents, with each comment linking to an article. To find all comments associated to an article you can
db.comments.find({article_id:"My Atticle ID"}]
and to speed things up you could always add "article_id" to the index
db.comments.ensureIndex({article_id:1})
I found this SO post when searching the same thing, The URL posted by Phpdevpad is a great read to understand how Adjacency List Model and Nested Set Model work and compare against each other. The article is very much in favor of the Nested Set Model and explains many draw backs to the Adjacency List Model, however I was greatly concerned about the mass updates the nested method would cause.
The main limitation to adjacency lists outlined in the article was that an additional self join was required for each layer of depth. However this limitation is easily overcome with the use of another language (such as php) and a recessive function for finding children such as outlined here: http://www.sitepoint.com/hierarchical-data-database/
snippet from url above using the Adjacency List Model
<?php
// $parent is the parent of the children we want to see
// $level is increased when we go deeper into the tree,
// used to display a nice indented tree
function display_children($parent, $level) {
// retrieve all children of $parent
$result = mysql_query('SELECT title FROM tree WHERE parent="'.$parent.'";');
// display each child
while ($row = mysql_fetch_array($result)) {
// indent and display the title of this child
echo str_repeat(' ',$level).$row['title']."n";
// call this function again to display this
display_children($row['title'], $level+1);
}
}
// $node is the name of the node we want the path of
function get_path($node) {
// look up the parent of this node
$result = mysql_query('SELECT parent FROM tree WHERE title="'.$node.'";');
$row = mysql_fetch_array($result);
// save the path in this array
$path = array();
// only continue if this $node isn't the root node
// (that's the node with no parent)
if ($row['parent']!='') {
// the last part of the path to $node, is the name
// of the parent of $node
$path[] = $row['parent'];
// we should add the path to the parent of this node
// to the path
$path = array_merge(get_path($row['parent']), $path);
}
// return the path
return $path;
}
display_children('',0);
Conclusion
As a result I am now convinced that the Adjacency List Model will be far easier to use and manage moving forward.

How to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!
Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.
I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!