When to truncate strings longer than the storage location allows? - truncate

Let's say I have a function that inserts records into a database table with string fields of limited length. In general, at what point should I be truncating strings that are too long for the storage location, in the insert function itself, or at every point in the code where it's called?
(I'm assuming here that truncation of strings that are too long is more desirable than having an exception thrown.)

I think it depends on where the function is and how accessible it is.
If it's a private function that just makes up your own SQL library then you can probably get away with truncating it in the function.
If it's in a library that, say, your team at work all use then perhaps you need to at least parse the string before attempting to insert it.
If it's a public API, then you shouldn't be silently truncating anything - throw a meaningful exception instead.

This should sit in the insert function - it's specific to the database implementation rather than the calling application. If you manage to change your data structure, you don't want to have to go back through all the client code to ensure the full string is used.

As per Widor, but may I also add:
Your application should ideally be structured so that there is a distinct data layer that separates the rest of your code from the database and its implementation logic.
In high traffic systems you will ideally want to limit the amount of data passing back and forth between the database and your code, hence data validation should be performed by your data layer BEFORE passing it on to your database. It is here that you can raise a meaningful exception for your business logic to handle.
The object data presented by the data layer need bear no relation to what is actually stored in or by the database. For instance it may present a data object class that is actually a composite of data stored in several tables.
The data layer itself can be structured in such a way that it can handle different database implementations.
I have used a factory pattern in the past that has allowed me to switch between SQL, MySQL databases, XML file storage and compiled test data as required at runtime without the need for recompilation.
edit
Your application data layer is the interface between your application code e.g. business logic and GUI, and your database.
The business logic will trigger the data layer to update the database with your string.
In your example the data layer contains your update function.
You can validate the string, truncate it if too long, and then update the database (through stored procedure call or direct write for instance) within that function if you wish.
In reality you'll have many strings that will have to be restricted to the same length, so it is advisable to have the validation performed by a seperate function to save duplicating code.
Also you may wish to validate/truncate the string and notify the user/calling code of this without writing the data to the database.
Essentially though this is performed by your application data layer code, which may be encapsulated within a class library/dll for instance and not left to the database to handle nor the business logic (other than to react to any error event/response fed back).

Related

How many database or table needed for Event sourcing?

I'm trying to use event sourcing, ddd and cqrs.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
And also if I have create two database (or table), I have to save data in databases (json and normalize) as atomic in one transaction or not?
Best regards
DDD
We're making an assumption here that you fully understand using DDD and the implications. Specifically related to Event Sourcing, it's a matter of defining Aggregate boundaries and the events that become their state.
CQRS
Again we're making an assumption that you fully understand the implications. CQRS merely allows you to write code in vertical slices (i.e. from UI to database) for handling "commands" separately from code that handles "queries". That's all. While it's true that you can then take this further, by storing data in a "read model" that might even be in a different database, let alone table, it's not a requirement of implementing CQRS.
As CQRS pertains to Event Sourcing - it's a good fit because the data model you tend to end up with in Event Sourcing is not conducive to complex queries. It's typically limited to "get the Aggregate by it's ID". Therefore having "projections" to store the data in other ways that are more appropriate for querying and loading into UIs is the typical approach.
Event Sourcing
If you implement a Domain Model in such a way that every command handled by an Aggregate (i.e. every use-case/task carried out by a user) generates one or more events, then Event Sourcing is the principle where you store those list of events in an append-only style against the Aggregate's ID, rather than storing a snapshot of the Aggregate after the command was successfully handled.
To load an aggregate from the event store, you load all it's previous events, and replay them in memory on the Aggregate object, again rather than loading a single row/document in as a snapshot/memento.
A document database is therefore an excellent choice for event stores, because a single document represents the event stream for a given Aggregate. However if you want to store your event streams in SQL, that's fine, but you might store it in two tables:
create table Aggregate (Id int not null...);
create table AggregateEvent(AggregateId int not null FK..., Version int not null, eventBody nvarchar(max));
The actual event body would typically be the event itself, serialised to a text format like JSON.
Projections and Read Stores
If you take the events generated by the handling of commands by aggregates, and write code that consumes them by writing to a separate data store (SQL, pre-calculated ViewModels, etc), then you can call that a "projection". It's "projecting" the data that's in one shape into another shape fit for a different purpose. The result is a "read store", which you can then query however you need to.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
It's possible to get by with just an event store and nothing else.
"get by" isn't necessarily pleasant, however. Event stores are, as a rule, really good at "append new information", but not particularly good at "query". Thus, the usual answer is to deploy processes that copy information from your event store to something that has nicer query support.
I have to save data in databases (json and normalize) as atomic in one transaction or not?
It's a common pattern to update the event storage only, and then later invoke the process to copy the information from the event storage to your query support. Of course, that also means that your queries may end up showing old/out of date information (here is the answer to your question as-of five minutes ago).
If you store your query friendly data model with the event storage (tables in the same relational database, for instance), then you can arrange for at least some of your updates to the query friendly model to be synchronized with the events.
In other words, you get trade offs, not a single cookie cutter pattern that is used everywhere.

How to get column name and data type returned by a custom query in postgres?

How to get column name and data type returned by a custom query in postgres? We have inbuilt functions for table/views but not for custom queries. For more clarification I would say that I need a postgres function which will take sql string as parameter and will return colnames and their datatype.
I don't think there's any built-in SQL function which does this for you.
If you want to do this purely at the SQL level, the simplest and cheapest way is probably to CREATE TEMP VIEW AS (<your_query>), dig the column definitions out of the catalog tables, and drop the view when you're done. However, this can have a non-trivial overhead depending on how often you do it (as it needs to write view definitions to the catalogs), can't be run in a read-only transaction, and can't be done on a standby server.
The ideal solution, if it fits your use case, is to build a prepared query on the client side, and make use of the metadata returned by the server (in the form of a RowDescription message passed as part of the query protocol). Unfortunately, this depends very much on which client library you're using, and how much of this information it chooses to expose. For example, libpq will give you access to everything, whereas the JDBC driver limits you to the public methods on its ResultSetMetadata object (though you could probably pull more information from its private fields via reflection, if you're determined enough).
If you want a read-only, low-overhead, client-independent solution, then you could also write a server-side C function to prepare and describe the query via SPI. Writing and building C functions comes with a bit of a learning curve, but you can find numerous examples on PGXN, or within Postgres' own contrib modules.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

How can I use ormlite to escape my insert?

I have ormlite integrated into an application I'm working on. Right now I'm trying to build in functionality to easily switch from automatically inserting data to the database to outputting the equivalent collection of insert statements to a file for later use. The data isn't user input but still requires proper escaping to handle basic gotchas like apostrophes.
Ideas I've burned through:
Dao.create() writes to the database directly, so that's a no-go.
QueryBuilder can't handle inserts.
JdbcDatabaseConnection.compileStatement() might work but the amount of setup required is inappropriate.
Using a java.sql.PreparedStatement has a reasonable enough interface (if toString() returns the SQL like I would hope) but it's not compatible with ormlite's connection types.
This should be very easy and if it is, I can't find the right combination of method calls to make it happen.
Right now I'm trying to build in functionality to easily switch from automatically inserting data to the database to outputting the equivalent collection of insert statements to a file for later use.
Interesting. So one hack would be to use the MappedCreate class. The MappedCreate.build(...) method takes a DatabaseType and a TableInfo which is available from the dao.getTableInfo().
The mappedCreate.toString() exposed the generated INSERT statement (with a prefix) which might help but you would still need to convert the ? arguments to be the actual values with escaped quotes. That you would have to do in your own code.
Hope this helps somewhat.

Updating last accessed time when separating Commands and Queries

Consider a function: IsWalletValid(walletID). It returns true if the walletID exists in the database, and updates a 'last_accessed_time' field.
A task runs periodically to remove any wallets that have not been accessed for a set period of time.
Seems like an easy solution for what we want to do, but IsWalletValid() has a side effect because it writes to the database.
Should we add an additional 'UpdateLastAccessedTime(walletID)' function? Everytime we call IsWalletValid() we will also need to remember to call UpdateLastAccessedTime(walletID).
Do verifying that a wallet is valid and updating it's last_accessed_time field need to be transactionally consistent (ACID)? You could use eventual consistency here:
The method IsWalletValid publishes an WalletAccessed event, then an event handler updates last_accessed_time asynchronously.
if last_accessed_time is not accessed by domain logic to make decisions on any write handling this could just be a facet of the read only projection. Seems like this is the same concern as other more verbose read audit concerns. Just because data is being written and maintained doesn't mean that it necessarily needs to be part of the write model of the system. If you did however want to implement this as part of the domain and perhaps stored within the same event store it could be considered a separate auditing context outside of the boundary of the original aggregate being audited.