Database design: Postgres or EAV to hold semi-structured data - mongodb

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Related

PostgreSQL: JSON column or one-to-many table for config options

We currently have a table which stores information about users. Some of the columns hold information such as user ID, name etc., but many other columns (booleans, integers and varchars etc) hold configuration options for each user.
This has over time resulted in the width of the table becoming quite big and I think the time has come to migrate this to something new, so I want to remove all the "option"-related columns to a separate data structure.
The typical way of doing this, from my experience, would be to have a new table which would simply have option_id and option_name, and a second new table which would contain user_id, option_id, option_value, for example.
However, a colleague suggested using the new jsonb column type as an alternative, but I don't know if I like the idea of storing relational data in a non-relational way. From a Java point of view, it's pretty much the same as far as I can tell - it'll just be turned into a POJO and then cached on the object.
I should mention the number of users will be quite low, only going into the thousands, and number of columns could and will go into the hundreds.
Does anyone have advice on the best way forward here?
Technically, you have already de-normalized your database structure by adding columns to a table that are irrelevant to some of the entities stored therein.
Using JSON is just another way to de-normalize, cramming a bunch of values into a single row-column field. The excellent binary support for JSON in Postgres (the jsonb data type) then lets you index elements within those JSON documents, as a way to quickly access those embedded values. This is quite screwy from a relational point of view, but is handy for some situations.
Either approach is commonly done for this kind of problem, and is not necessarily bad. In general, de-normalizing is often a pay-now-or-pay-later kind of solution. But for something like user preferences, there may not be a pay-later penalty, as there often is with most business-oriented problem domains.
Nevertheless, you should consider a normalized database structure.
By the way, this kind of table-structure Question might be better asked in the sister site, http://DBA.StackExchange.com/.
I suggest searching Stack Overflow, that DBA site, and the wider Internet for discussions of database design for storing user preferences. Like this.

Calculating and reporting Data Completeness

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.

Links to files outside a PostgreSQL database

A stupid newbie question: I want to make a PostgreSQL (9.2.2 with PostGIS 2.0.1; on 32-bit Windows XP) database with rasters saved outside the database (I will need the rasters to be accessed from outside the database and they won't be uploaded/migrated frequently, so consistence is not an issue). My problem is: I don't know how to make the links to the rasters (from database with metadata), and I didn't find anything comprehensible enough.
I have found something about data wrappers, but they seem to be intended for data with table structure, not files like rasters. DATALINK seems better, but I'm afraid it's the same case, plus I'm not sure I understood how to use it. In some of the discussions I've found a mention of symbollic links, but these seem to be something Unix-based, and probably only vaguely related.
I'm sure it must be simple, but I didn't manage to solve it myself.
Databases provide no possibilities to link outside objects.
I can think of at least 2 approaches:
Save a full path to your files in some metadata table as one of the attributes or type text. Don't use it for joining tables in queries though, having artifitial key of internal numeric type (like integer or bigint) is a better choice for performance reasons;
Name your raster files according to their numeric keys in the database. This approach has a drawback — without database you will not be able to obtain any usefull info about your files.
Further paths depends on the complexity of your system and choosen optimization techniques.

When To Use What Database? i.e. What are the parameters should I look at when choosing a database to use?

I am developing a system which has multiple modules,
Social Media User Demography - (Document) - Name, loc, interests, work, education
Social Media User Connections - (Graph) - friends
CRM - (Rows and Columns) - telecom + banking etc
to name a few. I'm pretty sure that I have already crossed millions of records in each one of them.
When I look for a NoSql database to choose from I have at least 10 in each category. For Document database I have a arraylist right from MongoDB to DjonDB. Its the same case when I look for a graph database, so on and so forth. And also I have seen other key value store databases, columnar databases etc to name a few at http://nosql-database.org/.
So I wanted to know are there any generic thumb rules that I should follow to choose among these databases, when a columnar DB is optimized, to what type of data does a key value store suits best etc..
What are the best suited databases for what type of data and why? and most importantly
What are the worst suited databases for what type of data and why?
Thanks
This is a very open question, but I'll give it a shot.
Some things to keep in mind:
Pick a database that has a great (not good, great) community around it. You might find FooDB and it claims to do everything you want, but it no one is contributing to it, then you'll have built your application around a dead technology. You want active contributions, lots of customers with production deployments and ideally something not in its first version.
Try to find technologies that play well together. For example, Elastic Search, MongoDB, CouchDB, Couchbase all more or less work with JSON. That should help you narrow your choices of technologies.
I wouldn't try and spread myself too thin. Each type of database (graph, document, row/column, key value pair) has it's own learning curve. It takes quite a while to learn how to model data in a denormalized fashion. The more variety you have, the harder it will be to maintain all those different databases.
I don't know why I rarely see this as advice, but I would pick something you actually like developing in. Does the query syntax seem intuitive and fun? If not, you're going to hate developing in it. This isn't the most important factor, but I think it should be considered.

relational_database vs config_file vs spreadsheet usage

I have heard some genuine arguments for the use of relational database vs spreadsheet before. Relational database provides fast reporting and (relatively speaking) reliable data warehousing,where spreadsheets are lightweight, fast replicating, and easy to float around the organization to different audience. Although I notice the advantages of either, I can rarely distinguish what's better in which scenario, and always end up using database.
In development, it's easy to forget to consider other options when one can place config settings in the database. I've ran into quite a few apps where user menus, work flows and their orders, and constants are defined in the database level. While this is good if these entities were subject to change by end user from application level, it was not the case.
So, what's your take on the roles of databases, config files, and spread sheets?
The old adage is this.
When you use a spreadsheet to solve a problem, you now have two problems.
Database is for records of the business. Long-lasting. Permanent.
Other configuration files are for other configuration information -- not long-lasting business records. Current settings and what-not are not enduring business records, they're part of a specific software configuration that processes the business records.
Spreadsheets are -- well -- they are what they are. Too complex to be a simple, configuration file. Too simple to be a real database.
Since they're (almost) impossible to control, you need one standard, correct, idempotent result in the database. You should be able to rebuild spreadsheets from that controlled source.
Similarly, if you accept a spreadsheet for upload, you have to extract the data, and never refer back to the (almost uncontrollable) source document again.
For me, I want all of the core data to be stored in a database. Two reasons:
to allow adhoc reporting access to the data
to allow applications to share data.
Databases should contain all of the domain data, and occasionally some on-the-fly data (user preferences for example). Relational databases are most popular, but for some apps there are other options.
The config file on the other hand should contain all of the 'parameters' you want to change in the system; the ones that are not changed rapidly (on-the-fly). Config items are flexible, but not easily, and usually not from the interface. If it's a param that you only want the coder to possibly change, that should be right in the code (so no one else has access).
If you want to fiddle with data mining, provide some generic mechanism to download a CSV file with the results of a SQL query, directly into Excel. That way people can fiddle with pivot tables, without having to alter the application's schema.
Spreadsheets are documents, databases are repositories for information, configuration files store rules for how a specific instance of an application should behave. If you think of it that way, it's usually not hard to make a call.