relational_database vs config_file vs spreadsheet usage - configuration-files

I have heard some genuine arguments for the use of relational database vs spreadsheet before. Relational database provides fast reporting and (relatively speaking) reliable data warehousing,where spreadsheets are lightweight, fast replicating, and easy to float around the organization to different audience. Although I notice the advantages of either, I can rarely distinguish what's better in which scenario, and always end up using database.
In development, it's easy to forget to consider other options when one can place config settings in the database. I've ran into quite a few apps where user menus, work flows and their orders, and constants are defined in the database level. While this is good if these entities were subject to change by end user from application level, it was not the case.
So, what's your take on the roles of databases, config files, and spread sheets?

The old adage is this.
When you use a spreadsheet to solve a problem, you now have two problems.
Database is for records of the business. Long-lasting. Permanent.
Other configuration files are for other configuration information -- not long-lasting business records. Current settings and what-not are not enduring business records, they're part of a specific software configuration that processes the business records.
Spreadsheets are -- well -- they are what they are. Too complex to be a simple, configuration file. Too simple to be a real database.
Since they're (almost) impossible to control, you need one standard, correct, idempotent result in the database. You should be able to rebuild spreadsheets from that controlled source.
Similarly, if you accept a spreadsheet for upload, you have to extract the data, and never refer back to the (almost uncontrollable) source document again.

For me, I want all of the core data to be stored in a database. Two reasons:
to allow adhoc reporting access to the data
to allow applications to share data.
Databases should contain all of the domain data, and occasionally some on-the-fly data (user preferences for example). Relational databases are most popular, but for some apps there are other options.
The config file on the other hand should contain all of the 'parameters' you want to change in the system; the ones that are not changed rapidly (on-the-fly). Config items are flexible, but not easily, and usually not from the interface. If it's a param that you only want the coder to possibly change, that should be right in the code (so no one else has access).
If you want to fiddle with data mining, provide some generic mechanism to download a CSV file with the results of a SQL query, directly into Excel. That way people can fiddle with pivot tables, without having to alter the application's schema.

Spreadsheets are documents, databases are repositories for information, configuration files store rules for how a specific instance of an application should behave. If you think of it that way, it's usually not hard to make a call.

Related

Does NoSQL Fit the Bill for Reporting Software?

I am currently developing a software system that imports and normalizes historical data in various formats (XML, JSON, CSV). As of right now, we are using SQL server, and are looking to find the best replacement for this tool (Postgres or NoSQL). 90% of the time, the (archived/historical/static)data is accessed via a web client, and is used in a READ only format with users picking and choosing canned reports. Changes to the data only occur to update incorrect information .
The replacement DB must be able to store and report on 10s of millions of rows very quickly, and scale across multiple servers with ease (data replication, clustering, etc). There must also be data integrity, so if I update one KPI (lets say Cost per Hr), then all the reports that rely on the KPI will be updated accordingly.
Having no prior experience with NoSQL databases, I am wondering if it is even the right choice to use in a reporting software. We would like to allow for users to create their own custom reports, and that means being able to query any data as opposed to our canned reports, but I don't know if this would throw a wrench in the comparison between SQL vs NoSQL.
There are a few too many variables in the question, to comfortably answer it in entirety, but here's an attempt.
Your choice in SQL vs NoSQL should be based on data structure. Scalability is generally a second-tier concern, and is only slightly easy on some NoSQL platforms, but as always, isn't always free.
If you're looking for 10s of millions of rows 'very quickly' you are seriously testing the limits of what you can do with it. An RDBMS would allow you a plethora of options at the cost of speed, and a NoSQL although quite fast an inputting at that speed, would make you code most of the RDBMS smartness in your application. Chose your poison.
Updating a metric and 'automagically' updating reports is clearly a business-logic smartness, that shouldn't be tied down to platform selection.
PostgreSQL has in the near past, really picked up a lot of arsenal to deal with file formats (JSON et al) and is clearly worth a try (sans easy scalability).
Having said that, you should really investigate Postgres' otherwise forgotten asset, FDWs. You can clearly consider using a NoSQL setup to ingest large unstructured data, and thence utilize postgres' powerful semantics to use that and create a asynchronous yet structured backend for your application. If done well, that could mean the best of both worlds.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

How to build local environment with large databases

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.
I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.
Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.

Is a document/NoSQL database a good candidate for storing a balance sheet?

If I were to create a basic personal accounting system (because I'm like that - it's a hobby project about a domain I'm familiar enough with to avoid getting bogged-down in requirements), would a NoSQL/document database like RavenDB be a good candidate for storing the accounts and more importantly, transactions against those accounts? How do I choose which entity is the "document"?
I suspect this is one of those cases were actually a SQL database is the right fit and trying to go NoSQL is the mistake, but then when I think of what little I know of CQRS and event sourcing, I wonder if the entity/document is actually the Account, and the transactions are Events stored against it, and that when these "events" occur, maybe my application also then writes out to a easily queryable read store like a SQL database.
Many thanks in advance.
Personally think it is a good idea, but I am a little biased because my full time job is building an accounting system which is based on CQRS, Event Sourcing, and a document database.
Here is why:
Event Sourcing and Accounting are based on the same principle. You don't delete anything, you only modify. If you add a transaction that is wrong, you don't delete it. You create an offset transaction. Same thing with events, you don't delete them, you just create an event that cancels out the first one. This means you are publishing a lot of TransactionAddedEvent.
Next, if you are doing double entry accounting, recording a transaction is different than the way your view it on a screen (especially in a balance sheet). Hence, my liking for cqrs again. We can store the data using correct accounting principles but our read model can be optimized to show the data the way you want to view it.
In a balance sheet, you want to view all entries for a given account. You don't want to see the transaction because the transaction has two sides. You only want to see the entry that affects that account.
So in your document db you would have an entries collection.
This makes querying very easy. If you want to see all of the entries for an account you just say SELECT * FROM Entries WHERE AccountId = 1. I know that is SQL but everyone understands the simplicity of this query. It just as easy in a document db. Plus, it will be lightning fast.
You can then create a balance sheet with a query grouping by accountid, and setting a restriction on the date. Notice no joins are needed at all, which makes a document db a great choice.
Theory and Architecture
If you dig around in accounting theory and history a while, you'll see that the "documents" ought to be the source documents -- purchase order, invoice, check, and so on. Accounting records are standardized summaries of those usually-human-readable source documents. An accounting transaction is two or more records that hit two or more accounts, tied together, with debits and credits balancing. Account balances, reports like a balance sheet or P&L, and so on are just summaries of those transactions.
Think of it as a layered architecture -- the bottom layer, the foundation, is the source documents. If the source is electronic, then it goes into the accounting system's document storage layer -- this is where a nosql db might be useful. If the source is a piece of paper, then image it and/or file it with an index number that is then stored in the accounting system's document layer. The next layer up is digital records summarizing those documents; each document is summarized by one or more unbalanced transaction legs. The next layer up is balanced transactions; each transaction is composed of two or more of those unbalanced legs. The top layer is the financial statements that summarize those balanced transactions.
Source Documents and External Applications
The source documents are the "single source of truth" -- not the records that describe them. You should always be able to rebuild the entire db from the source documents. In a way, the db is just an index into the source documents in the first place. Way too many people forget this, and write accounting software in which the transactions themselves are considered the source of truth. This causes a need for a whole 'nother storage and workflow system for the source documents themselves, and you wind up with a typical modern corporate mess.
This all implies that any applications that write to the accounting system should only create source documents, adding them to that bottom layer. In practice though, this gets bypassed all the time, with applications directly creating transactions. This means that the source document, rather than being in the accounting system, is now way over there in the application that created the transaction; that is fragile.
Events, Workflow, and Digitizing
If you're working with some sort of event model, then the right place to use an event is to attach a source document to it. The event then triggers that document getting parsed into the right accounting records. That parsing can be done programatically if the source document is already digital, or manually if the source is a piece of paper or an unformatted message -- sounds like the beginnings of a workflow system, right? You still want to keep that original source document around somewhere though. A document db does seem like a good idea for that, particularly if it supports a schema where you can tie the source documents to their resulting parsed and balanced records and vice versa.
You can certainly create such a system.
In that scenario, you have the Account Aggregate, and you also have the TimePeriod Aggregate.
The time period is usually a Month, a Quarter or a Year.
Inside each TimePeriod, you have the Transactions for that period.
That means that loading the current state is very fast, and you have the full log in which you can go backward.
The reason for TimePeriod is that this is usually the boundary in which you actually think about such things.
In this case, a relational database is the most appropriate, since you have relational data (eg. rows and columns)
Since this is just a personal system, you are highly unlikely to have any scale or performance issues.
That being said, it would be an interesting exercise for personal growth and learning to use a document-based DB like RavenDB. Traditionally, finance has always been a very formal thing, and relational databases are typically considered more formal and rigorous than document databases. But, like you said, the domain for this application is under your control, and is fairly straight forward, so complexity and requirements would not get in the way of designing the system.
If it was my own personal pet project, and I wanted to learn more about a new-ish technology and see if it worked in a particular domain, I would go with whatever I found interesting and if it didn't work very well, then I learned something. But, your mileage may vary. :)

What NoSQL solution to choose for domain names database?

I have a project that stores several millions of domain names in database and perform search requests to find if domain is present in DB. The only operation I need - check if given value exists. No range queries, no additional information, nothing.
The number of queries that I make to database is rather big, for example 100'000 per one user session.
I have new database once a day and even it's possible to check what records were deleted and what added - I don't think that it's worth it. So, I am importing database to a new table and point script to a new name.
Looking for solution that can make the whole things faster, as I don't use any SQL features. Name search and import time are important for me.
My server can't store this database in memory, even half of it, so I think some NoSQL solution working from hard drive can help me.
Can you suggest something?
A much smaller and faster solution would be to use Berkeley DB with the key-value pair API. Berkeley DB is a database library that links into your application, so there is no client/server overhead nor separate server to install and manage. Berkeley DB is very straightforward and provides, among several APIs, a simple key-value (NoSQL) API that provides all of the basic data management routines that you would expect to find in a much larger, more complex RDBMS (indexing, secondary indexes, foreign keys), but without the overhead of a SQL engine.
Disclaimer: I am the Product Manager for Berkeley DB, so I am a little biased. That said, it was designed to do exactly what you're asking for -- straightforward, fast, scalable key-value data management without unnecessary overhead.
In fact, there are many "database domain" type application services that use Berkeley DB as their primary data store. Most of the open source and/or commercial LDAP implementations use Berkeley DB (including OpenLDAP, Redhat's LDAP, Sun Directory Server, etc.). Cisco, Juniper, AT&T, Alcatel, Mitel, Motorola and many others use Berkeley DB to manage their They use Berkeley DB for their gateway, authentication, and configuration management systems, They use BDB because it does exactly what they need, it's very fast, scalable and reliable.
You could get by quite nicely with just a Bloom filter if you can accept a very small false positive rate (assuming you use a large enough filter).
On the other hand, you could certainly use Cassandra. It makes heavy use of bloom filters, so asking for something that doesn't exist is quick, and you don't have to worry about false positives. It's designed to handle data sets that do not fit into memory, so performance degredation there is quite smooth.
Importing any amount of data should be quick -- on a normal machine, Cassandra can handle about 15k writes per second.
Many options here. Berkeley DB certainly does the job and is probably one of the simplest solutions. Just as simple: store everything in memcached, then you have the option of splitting the cache of the values across several machines if needed (if query load or data size grows).