Should I use a database instead of serialized files? [closed] - persistence

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am working in my first real world application that consists of keeping track of medical studies of a medium size medical office. The system needs to keep track of doctors, users, patients, study templates and study reports. The purpose of this program is to apply preformatted study template for any possible study, keep track of each patient's study and keep a easy to find file system. Each study report is saved in an specific folder as an html file that can be used or printed from Windows directly.
I estimate that at any given time would be about 20 active doctors, 30 different study templates, 12 users; the patients and study reports would be cumulative an will remain active indefinitely. I estimate that we are talking about 2000 new patient and 6000 new study reports a year.
I have almost completed the job, but initially I chose to store the data in a serialized file and I did not consider to use a database instead. Now, considering that the size of the data will rapidly grow, I believe that I should consider to work with a database instead. For many different reasons, especially I am concerned about the serialized file choice because I noticed that any change that I may make in the future in any class may conflict with the serialized file and stops me from reopening it. I appreciate any comments, how large a file is too large to work with? It is a serialize file acceptable in this case please pass me any ideas or comments. Thanks for the help

Your concern about breaking compatibility with these files is absolutely reasonable.
I solved the same problem in a small inventory project by taking these steps:
Setup of a DB server (MySQL)
Integration of hibernate into the project
Reimplementation of the serializable classes within a new package using JPA annotations (if the DB schema won't break, add the annotations to existing classes)
Generation of the DB schema using the JPA entitites
Implementation of an importer for existing objects (deserialization, conversion and persisting with referential integrity.
Import and validation of existing data objects
Any required refactoring from old classes to the new JPA entities within the whole project
Removal of old classes and their importer (should slumber in a repository)

Most people will say that you should use a database regardless. If this is a professional application you can't risk the data being corrupted and is a real possibility e.g. due to a bug in your code and someone using the program incorrectly.
It is the value of the data, not the size which matters here. Say it has been running for a year and the file becomes unusable. Are you going to tell them they should enter all the data again from scratch?
If its just an exercise, I still suggest you use a database as you will learn something. A popular choice is to use hibernate and it is CV++. ;)

Related

store/query semi-structure data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We had application written in perl that create complex data structure for our subscriber (we have move than 4m subscribers). each subscriber have some conmen fields that are are present in all of them and some other subscriber has missing some.
The data looks like this:
%subscribers = {
"user_001" = {
"name" => "sam",
"age" => "13",
"color" =>['red','blue']
"item"=>{
"old" =>['PC','pen'],
"new" =>['tap','car']
},
"user_002" = {
"name" => "ali",
"age" => "54",
"color" =>['red','null','green']
"item"=>{
"old" =>['phone','TV']
},
"user_003" = {
"name" => "foo",
"age" => "02",
"item"=>{
"old" =>['']
},
....
}
}
our data are more nasty and complex
Now we try to store these data in DB then do some query in them like get user that have new 'TAPs' in item or there age is larger than 30 years.
what we need to know is:
What is the best method to store the data (as MySQL or Oracle db not option), we need something for semi-structure data. How to do these queries taken in mind the preformence.
We jast need headline to start our search (and yes we did our homework using Google ^_^).
BR,
Hosen
It sounds like your dataset is still small and manageable, so you need to be very careful about dismissing traditional database solutions at this early point. You haven't really offered any hard reasons as to why SQL solutions have been dismissed (new features in recent years are targeted squarely at NoSQL use-cases), so as someone that's trawled through this issue myself in the past (in a large perl project) I will offer some questions you should ask yourself:
Will the new technology choice become the authoritative data store, or just something you want to bolt-on with minimum changes to help you service queries?
If you just want to quickly bolt-on a new API to service queries, NoSQL technologies such as MongoDB (with excellent perl driver) become a viable option (and you can slurp in a perl hash as you've described with very little code). If you only use it as a (possibly read-only) cache, you mitigate all the durability concerns and avoid a lot of expensive data cleaning/validation/normalization effort to get you to an 80% solution very quickly.
If you want something durable to replace your current data storage, it's true that there are options other than SQL RDBMS. XML stores like eXistDB are very powerful if you work with XML ecosystems already and your data fits the document-object paradigm where XQuery/XPath makes sense (there's even a perl RPC thing for it). It's worth taking a look at commercial vendors like MarkLogic or EnterpriseDB if you have time pressures and a decent budget. If your data is truly messy and can be efficiently modeled as a graph of entities and relationships, it's tempting to consider things like SparkleDB, Neo4j or Virtuoso however in my limited exposure to these things whilst they have a lot of potential for servicing otherwise impossible or difficult queries/analsyses, they make a terrible place to curate and manage your core business data.
What kinds of queries, reports/analyses do you hope to do? This will determine how much data cleaning and normalization effort will be required. Answering this question will help you focus your choice:
If you think you'll end up doing data cleaning/validation/transformation in order to implement your final choice and make the data queryable, you might as well use a traditional SQL database but explore using it in a "NoSQL" way (there's lots of advice/comparison out there).
If you are hoping to avoid doing a lot of data cleaning/validation/normalization due to lack of time or budget, I'm afraid that the more mature XML/RDF/SPARQL solutions will require 10x more engineering effort to design and establish a working system built around the messy data than simply cleaning it properly in the first place.
If you have truly messy, heterogeneous data (especially when you need to continuously import from 3rd-parties over which you have no control and you want to avoid constant data cleaning effort), then leaving your messy data "as-is" lands you in a spectrum of hurt. At one extreme (in terms of cost but also query power/expressiveness and accuracy) you have the XML/RDF/SPARQL solutions mentioned before. At the cheaper/quicker/simpler (perhaps too simple in many cases) you have contenders such as MongoDB, Cassandra and CouchDB (this is by no means an exhaustive list, and they have differing levels of perl support or quality of perl clients).

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Data Integration [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have been looking at data integration methods Global as view and Local as view, but I can not find any examples of how queries would be formed for these, could anyone give me examples of how these methods of data integration can be queried using GAV and LAV please
I am specifically asking about GAV and LAV here
I know that GAV (Global as view) is described over data sources and that LAV (local as view) is described over the mediated schema. However I am not totally sure what those terms mean, nor how they affect the query produced.
There is a wikipedia page for GAV, with no example of a query, and there isn't a wikipedia page for LAV sadly
I think these terms are not widely used in industry - the only references I can see for them appear to arise from academic work. They apply to Enterprise Information Integration - a genre of technology where a client-side reporting or integration layer is placed over existing databases without actually persisting the data into a separate reporting database.
Essentially, 'Global As View' describes where data is transformed into a unified representation before reporting queries are issued. In a data warehouse (where the data is transformed and persisted into a separate database) this view would be the data warehouse tables. An EII tool can do this by issuing queries to the underlying data sources and merging it into the centralised schema. EII is not a widely used technology, though.
'Local as view' techniques query all the sources individually and then merge the result sets together. Conceptually, this is an act of making up several queries to the different sources that produce result sets in the same format, but source the data from wherever it is found in the underlying systems. The data integration is then done in the reporting layer.

Redis, CouchDB or Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?
The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.
Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service