How does Load Store Queue work in the presence of MSHR? - queue

I understand the basic working of load-store queue, which is
when loads compute their address, they check the store queue for any prior stores to the same address and if there is one then they gets the data from the most recent store else from write buffer or data cache.
When stores compute their address, they check load queue for any load violations
My doubt is what happens when
In the first case when the load access data cache due to some unresolved store addresses in the store queue and the access is miss in L1 data cache and before the data can be retrieved from the cache, the store address resolves. Now, the store does load queue checking for any violations. The dependent load has already accessed the data cache prior but didn't receive the value from cache yet due to long latency miss. Does the store post load violation or does it do store-to-load forwarding and cancel the data from cache?
When load access miss in the l1 data cache, then the loads are placed in MSHR so as to not block the execute stage. When the miss resolves, the MSHR entry for that load has information regarding destination register and physical address. So the value can be updated in the physical register but how does the MSHR communicate with load queue that the value is available? when does this happen in the pipeline stage?
Because I have read somewhere that MSHR store physical addresses and Load-store queue store virtual addresses. So how does MSHR communicate with LSQ?
I haven't found any resources regarding these doubts.

This is speculative execution where loads bypass older stores. When the older store is resolved, we can throw a load violation. If the probability of address aliasing is low then speculative execution is profitable (more throughput) - typically should be true for programs. On detecting a load violation, we can take appropriate step - (a) store-to-load forward, or (b) rollback pipeline to the resolved store.
Same as when loads are served via cache hits (that can take 1-3 cycles for a L1 hit). For example in a reservation station with a CDB (common data bus), the result will be shared with all HW structures that need it.

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Scala and playframework shared cache between nodes

I have a complex problem and I can't figure out which one is the best solution to solve it.
this is the scenario:
I have N servers under a single load balancer and a Database.
All the servers connect to the database
All the servers run the same identical application
I want to implement a Cache in order to decrease the response time and reduce to the minimum the HTTP calls Server -> Database
I implemented it and works like a charm on a single server...but I need to find a mechanism to update all the other caches in the other servers when the data is not valid anymore.
example:
I have server A and server B, both have their own cache.
At the first request from the outside, for example, get user information, replies server A.
his cache is empty so he needs to get the information from the database.
the second request goes to B, also here server B cache is empty, so he needs to get information from the database.
the third request, again on server A, now the data is in the cache, it replies immediately without database request.
the fourth request, on server B, is a write request (for example change user name), server B can make the changes on the database and update his own cache, invalidating the old user.
but server A still has the old invalid user.
So I need a mechanism for server B to communicate to server A (or N other servers) to invalidate/update the data in the cache.
whats is the best way to do this, in scala play framework?
Also, consider that in the future servers can be in geo-redundancy, so in different geographical locations, in a different network, served by a different ISP.
would be great also to update all the other caches when one user is loaded (one server request from database update all the servers caches), this way all the servers are ready for future request.
Hope I have been clear.
Thanks
Since you're using Play, which under the hood, already uses Akka, I suggest using Akka Cluster Sharding. With this, the instances of your Play service would form a cluster (including failure detection, etc.) at startup, and organize between themselves which instance owns a particular user's information.
So proceeding through your requests, the first request to GET /userinfo/:uid hits server A. The request handler hashes uid (e.g. with murmur3: consistent hashing is important) and resolves it to, e.g., shard 27. Since the instances started, this is the first time we've had a request involving a user in shard 27, so shard 27 is created and let's say it gets owned by server A. We send a message (e.g. GetUserInfoFor(uid)) to a new UserInfoActor which loads the required data from the DB, stores it in its state, and replies. The Play API handler receives the reply and generates a response to the HTTP request.
For the second request, it's for the same uid, but hits server B. The handler resolves it to shard 27 and its cluster sharding knows that A owns that shard, so it sends a message to the UserInfoActor on A for that uid which has the data in memory. It replies with the info and the Play API handler generates a response to the HTTP request from the reply.
In this way, all subsequent requests (e.g. the third, the same GET hitting server A) for the user info will not touch the DB, no matter which server they hit.
For the fourth request, which let's say is POST /userinfo/:uid and hits server B, the request handler again hashes the uid to shard 27 but this time, we send, e.g., an UpdateUserInfoFor(uid, newInfo) message to that UserInfoActor on server A. The actor receives the message, updates the DB, updates its in-memory user info and replies (either something simple like Done or the new info). The request handler generates a response from that reply.
This works really well: I've personally seen systems using cluster sharding keep terabytes in memory and operate with consistent single-digit millisecond latency for streaming analytics with interactive queries. Servers crash, and the actors running on the servers get rebalanced to surviving instances.
It's important to note that anything matching your requirements is a distributed system and you're requiring strong consistency, i.e. you're requiring that it be unavailable under a network partition (if B is unable to communicate an update to A, it has no choice but to fail the request). Once you start talking about geo-redundancy and multiple ISPs, you're going to see partitions pretty regularly. The only way to get availability under a network partition is to relax the consistency demand and accept that sometimes the GET will not incorporate the latest PUT/POST/DELETE.
This is probably not something that you want to build yourself. But there are plenty of distributed caches out there that you can use, such as Ehcache or InfiniSpan. I suggest you look into one of those two.

When and how to update a cache

I have a service A and a service B.
The service A is a REST API that stores some relevant information, that the sevice B needs, in a database.
The service B handles a lot of traffic and is constantly consuming messages from a Kafka topic. Each message needs some information from the service A. But this information rarely changes, at most it changes a time per day.
So, in order to avoid hitting the REST API constantly for information that rarely changes, i'm going to implement a cache. (Not using a cache would give also the problem of querying the DB all the time). And the service B will hit the cache first, and if it doesn't have the required data it will only hit A once.
Here comes the question.
If the service A updates its information, i would need to update the cache right away.
What is the best way of doing this?
1) I can implement something in the REST API to let B notice that it needs to update his chache, but in terms of separation of concerns and encapsulation, isn't bad that A knows that B handles an internal cache? (I think it is)
2) I can implement a pool in B (and make B check if the info changed every X time) or get the cache updated every X time. But this way i have the risk of not getting the information updated right away.
3) Maybe a cache in A for this information? At least i avoid querying the DB, but not hitting the API :/
Is there a better way of handling this?
Thanks!
This is a question of consistency guarantees and it is a core issue in distributed systems.
Your scenario contains three services: A, B and the database.
If B must never ever under any circumstances use stale data, then you have two options:
All reads will hit the database (no caching at A or B). Built in mechanisms at the database, such as the database's internal cache, disk cache and RAID mirroring, might relief some of the disk I/O bottleneck.
Cache the data at A (or B) and enforce strong consistency between the cache and the database, which means that every write would be done inside a distributed transaction between the database and the cache (or by using some other consensus protocol which provides strong consistency guarantees)
The first option requires no effort, and would work fine for a certain workload, but would become a serious bottleneck if the data ingress at B requires more throughput that the database can withhold.
The second option is quite complex to implement, would slow down data changes complicate the system and hurt its overall availability: if A goes down then data cannot be changed at the database; it a goes down amidst a transaction then the data won't be available for reading from the database (!)
The good news is that most systems don't need such strong consistency guarantees, and they're OK with using stale data occasionally, under specific circumstances.
If this is the case for your system, then there are several ways of invalidating the cache. Personally I'd go with Jose Martinez's suggestion to use a message queueing system, combined with the Publish/Subscribe pattern: service B would publish a "data changed" message to the pub/sub (the message would include information as to what data item changed exactly), service A would be a subscriber processing "data changed" messages and invalidating its cache as they arrive.
Additional points:
Caching inside B might seem like it can provide strong consistency at first, but truth is you might need to scale B so you'll have multiple instances of B, each with its own cache that needs be invalidated and synchronized.
You may use a whole other service for holding the cached data (Redis, Memcached etc.), which would allow you to split he responsibilities over the cached data (B could invalidate it and A could be reading from it directly), but it won't change the essence of the consistency dilemma.
Adding a third bullet point to #CapnSchwenk's answer...
Have A submit all changes to a message queue, like Rabbit MQ. The message queue can handle persistence (in case B is down), and publisher design model implementation. The queue can also contain the new data so that B need not have to query A for the new data.
Based on this statement: "If the service A updates its information, i would need to update the cache right away", then your 2 choices in my experience would be some form of distributed cache:
Have service A provide a listener mechanism that Service B can subscribe to, to invalidate its own internal cache when data changes;
Implement a distributed caching layer such as ehcache or memcache that both Service A and B are aware of; when Service A updates it writes the new value into the cache and all subscribers are automatically updated
Hope that helps!

Push data to client, how to handle slow clients?

In a push model, where server pushes data to clients, how does one handle clients with low or variable bandwidth?
For example i receive data from a producer and send the data to my clients (push). What if one of my clients decides to download a linux iso, the available bandwidth to this client becomes too little to download my data.
Now when my producers produces data and the server pushes it to the client, all clients will have to wait until all clients have downloaded the data. This is a problem when there is one or more slow clients with little bandwidth.
I can cache the data to be send for every client, but because the data size is big this isn't really an option (lots of clients * data size = huge memory requirements).
How is this generally solved? No need for code, just a few thoughts/ideas are already more then welcome.
Now when my producers produces data and the server pushes it to the
client, all clients will have to wait until all clients have
downloaded the data.
The above shouldn't be the case -- your clients should be able to download asynchronously from each other, with each client maintaining its own independent download state. That is, client A should never have to wait for client B to finish, and vice versa.
I can cache the data to be send for every client, but because the data
size is big this isn't really an option (lots of clients * data size =
huge memory requirements).
As Warren said in his answer, this problem can be reduced by keeping only one copy of the data rather than one copy per client. Reference-counting (e.g. via shared_ptr, if you are using C++, or something equivalent in another language) is an easy way to make sure that the shared data is deleted only when all clients are done downloading it. You can make the sharing more fine-grained, if necessary, by breaking up the data into chunks (e.g. instead of all clients holding a reference to a single 800MB linux iso, you could break it up into 800 1MB chunks, so that you can start removing the earlier chunks from memory as soon as all clients have downloaded them, instead of having to hold the entire 800MB of data in memory until every client has downloaded the entire thing)
Of course, that sort of optimization only gets you so far -- e.g. if two clients each request a different 800MB file, then you're liable to end up with 1.6GB of RAM usage for caching, unless you come up with a more clever solution.
Here are some possible approaches you could try (from less complex to more complex). You could try any of these either separately or in combination:
Monitor how much each client's "backlog" is -- that is, keep a count of the amount of data you have cached waiting to send to that client. Also keep track of the number of bytes of cached data your server is currently holding; if that number gets too high, force-disconnect the client with the largest backlog, in order to free up memory. (this doesn't result in a good user experience for the client, of course; but if the client has a buggy or slow connection he was unlikely to have a good user experience anyway. It does keep your server from crashing or swapping itself to death because a single client has a bad connection)
Keep track of how much data your server has cached and waiting to send out. If the amount of data you have cached is too large (for some appropriate value of "too large"), temporarily stop reading from the socket(s) that are pushing the data out to you (or if you are generating your data internally, temporarily stop generating it). Once the amount of cached data gets down to an acceptable level again, you can resume receiving (or generating) more data to push.
(this may or may not be applicable to your use-case) Revise your data model so that instead of being communications-oriented, it becomes state-oriented. For example, if your goal is to update the clients' state to match the state of the data-source, and you can organize the data-source's state into a set of key/value pairs, then you can require that the data-source include a key with each piece of data it sends. Whenever a key/value pair is received from the data-source, simply place that key-value pair into a map (or hash table or some other key/value oriented data structure) for each client (again, used shared_ptr's or similar here to keep memory usage reasonable). Whenever a given client has drained its queue of outgoing TCP data, remove the oldest item from that client's key/value map, convert it into TCP bytes to send, and add them to the outgoing-TCP-data queue. Repeat as necessary. The advantage of this is that "obsolete" values for a given key are automatically dropped inside the server and therefore never need to be sent to the slow clients; rather the slow clients will only ever get the "latest" value for that given key. The beneficial consequence of that is that a given client's maximum "backlog" will be limited by the number of keys in the state-model, regardless of how slow or intermittent the client's bandwidth is. Thus a slow client might see fewer updates (per second/minute/hour), but the updates it does see will still be as recent as possible given its bandwidth.
Cache the data once only, and have each client handler keep track of where it is in the download, all using the same cache. Once all clients have all the data, the cached data can be deleted.

MSI/MESI: How can we get "read miss" in shared state?

In The Cache Memory Book by Jim Handy (excerpt is below), the author has the table description of MESI protocol. The table looks very unclear to me, and unfortunately the text does not help.
The first question (in green on the picture):
Is this right? -- a data block is in the cache of a CPU,
and it is in the shared state, but when the CPU reads it,
the CPU gets read miss.
How is this possible?
The second question (in purple):
Who and when create all these messages "Read miss", etc.?
(afaik, the system bus just translates messages of others)
And finally, the third question (not on the pic):
Do all these cache coherency protocols
(MSI,MESI,MOESI,Firefly,Dragon...)
maintain sequential consistency memory model?
Are there protocols that maintain other consistency models?
For the first question, while there is a data block (in shared state) in the cache, it is the wrong block (i.e., the tag does not match), so there is a cache miss. On a cache miss, the cache still needs to writeback data to memory if it is in the modified state.
For the second question, each bus is providing address and request information (read or write) to the cache; the system bus is providing this from a remote processor. The hit or miss is whether the address provided is a hit in the cache. Requests on the system bus are filtered by the remote processor's cache.
On a read by processor 1 (P1), if the P1's cache has a hit, no signal needs to be sent to the system bus. On a write by P1, if the P1's cache has a hit and the state is exclusive or modified, then no signals need to be sent to the system bus, but if state in P1's cache is shared, then the other (P0) cache must told (via the system bus) that P1 is performing a write and that any entry in P0's cache for that address must be invalidated.
(Note that "shared" state does not necessarily mean that the block of memory is present in some other cache. Most snoop-based protocols allow silent [no system bus communication] dropping of cache blocks in shared state.)
(By the way, the MESI protocol given by that book is a bit unusual. It is more common for Exclusive state to be entered by a read miss that found no other caches with the block [this allows silent update to modified on a later write] and for writes not to use write through to memory on a hit that was in shared state.)
For the third question, cache coherence protocols only address coherence not consistency. The consistency model that the hardware provides determines what kinds of activity can be buffered and completed out of order (in a way that is visible to software). E.g., it helps performance if a write miss does not force any following read hits to wait until the write notification has been seen by all remote caches, but such can violate sequential consistency.
Here is an example how sequential consistency can prevent buffering of writes:
P0 reads "B_old" at location B (cache hit), writes "A_new" to location A (a cache miss), then reads location B.
P1 reads "A_old" at location A (cache hit), writes "B_new" to location B (a cache miss), then reads location A.
If writes are buffered and reads are allowed to proceed without waiting for the preceding write to be recognized by the other cache, the P0's second read of B can read "B_old" (since it will still be a cache hit) and P1's second read of A can read "A_old" (since P0's write has not yet been seen). There is no way to construct a sequential ordering of these memory accesses, so sequential consistency would be violated.
However, if P0's second read of B waited until P0's write to A was recognized by P1 (and P1's second read of A waited until P1's write to B was recognized by P0), then a sequential ordering would be possible.
If P1 saw P0's write to A before its write to B, then sequential ordering is possible, including:
P0.read "B_old", P1.read "A_old", P0.write "A_new", P1.write "B_new", P0.read "B_new", P1.read "A_new"
P0.read "B_old", P1.read "A_old", P0.write "A_new", P0.read "B_old", P1.write "B_new", P1.read "A_new"
Note that hardware can speculate that ordering requirements are not violated and reorder the completion of memory accesses; however, it needs to be able to handle incorrect speculation (e.g., by restoration to a checkpoint). (A classic early paper on this is "Is SC + ILP = RC?" [PDF].)
There is a one-to-many mapping between cache lines and blocks of main memory. A cache line that is shared may not be the block of memory your program is looking for. What actually happened is that the block of memory your program wants, and the block of memory present in the cache, though different, have the same cache line index. Consecutive blocks of memory are mapped to consecutive cache lines, till the cache lines finish, and the subsequent blocks of memory are mapped to cache lines starting with the first cache line all over again.
I found an intuitive explanation here: https://medium.com/breaktheloop/direct-mapping-map-cache-and-main-memory-d5e4c1cbf73e