Why is a ledger hashed with the close time of its parent ledger? - xrp

This seems to be unnecessary because a ledger is hashed with its own close time and its parent's digest (which thus incorporates the parent's close time).

There are transactions that need to know the current time in order to execute correctly. For example, a transaction that trades one asset for another needs to ensure it doesn't execute offers that expired.
There are really only two places you could get that time. You could get it from the previous ledger's close time or you could get it from this ledger's close time. The option I chose is to use the previous ledger's close time.
The reason for that choice is that you have to know everything that could affect a transaction result before you can start executing that transaction. Having to know a ledger's close time before you could execute any of the transactions in it would result in significant additional computational expense.
The software runs every transaction when it is received to make sure the transaction will be able to execute and claim a fee. This is necessary to prevent relaying a transaction that won't claim a fee and allowing free relaying which could lead to denial of service attacks. The more similarly the transaction runs when it executes for real, the less disk I/O and additional computation is needed. So you want to feed the same inputs to the transaction when you run it for real as when you test it. That means using the parent ledger's close time rather than the actual ledger's close time, which is not known until much later.
So given that we need the parent's close time, why do we put it in the ledger header? There really isn't a particularly good reason. In practice, you need to have the preceding ledger's header in memory to produce the next ledger anyway. However, not putting the previous ledger's close time in the header would mean that if you only had a single ledger, you wouldn't know what the effective time of the transactions was unless you looked at the previous ledger. That would make it harder to understand what rules were applied during order executions or other operations that require a time.
In summary, the decision to use the previous ledger's close time was done for sound engineering reasons primarily focused around performance in the critical path where transactions are executed "for real". But the decision to put the close time in the ledger header is really just a mild human convenience to make it easier to know what effective wall time was applied for the transactions in the ledger without having to look outside the ledger to know.

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

SQL Server Fetch Slow until after a ClearCache

A user is experiencing very slow performance. I can see duration of a fetch statement that aligns to the performance time. There is no blocking or deadlocks and no waits that are unreasonable.
After I do a clear cache on the database the user immediately sees response time improve from a minute to several seconds. I see that a fetch statement no longer is at a high level but appears after a time at several seconds.
Does this recompilation change something about how the fetch is done or is it something about what is being done within the loop within the fetch processing.
Since I cannot see the cursor definition I have to assume that it is not a STATIC cursor otherwise the clearing of the cache would have had no impact on response time since the recompile of the cursor would have had no impact because all of the fetched data would already have been in the temp table.
If the cursor were any of the others KEYSET, DYNAMIC (default) or FAST-FORWARD (unless it chose STATIC), the recompile of the fetch could possibly have had an impact on the execution plan.
That would explain the difference between the response time.

Simulating an Oracle sequence with MongoDB

Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.

Controlling duration of PostgreSQL lock waits

I have a table called deposits
When a deposit is made, the table is locked, so the query looks something like:
SELECT * FROM deposits WHERE id=123 FOR UPDATE
I assume FOR UPDATE is locking the table so that we can manipulate it without another thread stomping on the data.
The problem occurs though, when other deposits are trying to get the lock for the table. What happens is, somewhere in between locking the table and calling psql_commit() something is failing and keeping the lock for a stupidly long amount of time. There are a couple of things I need help addressing:
Subsequent queries trying to get the lock should fail, I have tried achieving this with NOWAIT but would prefer a timeout method (because it may be ok to wait, just not wait for a 'stupid amount of time')
Ideally I would head this off at the pass, and have my initial query only hold the lock for a certain amount of time, is this possible with postgresql?
Is there some other magic function I can tack onto the query (similar to NOWAIT) which will only wait for the lock for 4 seconds before failing?
Due to the painfully monolithic spaghetti code nature of the code base, its not simply a matter of changing global configs, it kinda needs to be a per-query based solution
Thanks for your help guys, I will keep poking around but I haven't had much luck. Is this a non-existing function of psql, because I found this: http://www.postgresql.org/message-id/40286F1F.8050703#optusnet.com.au
I assume FOR UPDATE is locking the table so that we can manipulate it without another thread stomping on the data.
Nope. FOR UPDATE locks only those rows, so that another transaction that attempts to lock them (with FOR SHARE, FOR UPDATE, UPDATE or DELETE) blocks until your transaction commits or rolls back.
If you want a whole table lock that blocks inserts/updates/deletes you probably want LOCK TABLE ... IN EXCLUSIVE MODE.
Subsequent queries trying to get the lock should fail, I have tried achieving this with NOWAIT but would prefer a timeout method (because it may be ok to wait, just not wait for a 'stupid amount of time')
See the lock_timeout setting. This was added in 9.3 and is not available in older versions.
Crude approximations for older versions can be achieved with statement_timeout, but that can lead to statements being cancelled unnecessarily. If statement_timeout is 1s and a statement waits 950ms on a lock, it might then get the lock and proceed, only to be immediately cancelled by a timeout. Not what you want.
There's no query-level way to set lock_timeout, but you can and should just:
SET LOCAL lock_timeout = '1s';
after you BEGIN a transaction.
Ideally I would head this off at the pass, and have my initial query only hold the lock for a certain amount of time, is this possible with postgresql?
There is a statement timeout, but locks are held at transaction level. There's no transaction timeout feature.
If you're running single-statement transactions you can just set a statement_timeout before running the statement to limit how long it can run for. This isn't quite the same thing as limiting how long it can hold a lock, though, because it might wait 900ms of an allowed 1s for the lock, only actually hold the lock for 100ms, then get cancelled by the timeout.
Is there some other magic function I can tack onto the query (similar to NOWAIT) which will only wait for the lock for 4 seconds before failing?
No. You must:
BEGIN;
SET LOCAL lock_timeout = '4s';
SELECT ....;
COMMIT;
Due to the painfully monolithic spaghetti code nature of the code base, its not simply a matter of changing global configs, it kinda needs to be a per-query based solution
SET LOCAL is suitable, and preferred, for this.
There's no way to do it in the text of the query, it must be a separate statement.
The mailing list post you linked to is a proposal for an imaginary syntax that was never implemented (at least in a public PostgreSQL release) and does not exist.
In a situation like this you may want to consider "optimistic concurrency control", often called "optimistic locking". It gives you greater control over locking behaviour at the cost of increased rates of query repetition and the need for more application logic.

One big call vs. multiple smaller TSQL calls

I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.