Refactoring a NoSql Database Schema - mongodb

I am considering Nosql for a project we are about to start (see https://stackoverflow.com/a/20588134/1838739). In regard to modifying NoSql i am confused about this statement "While the network/graph topology is faster than the Set topology the graph data model once implemented is almost impossible to change." Is this true for Nosql? How flexible is it to modify if changes are needed in a production implemented db.

Graph database are very open for change since the relationships are part of the data and not part of a schema.
To get started I recommend to read a little bit on Cypher (http://www.neo4j.org/learn/cypher or http://docs.neo4j.org/chunked/stable/cypher-query-lang.html). You'll see quickly how easy graphs can be changed.

Related

Auto complete on Large data set

I'm writing a project where I need to do an autocomplete on a data set that has 5 milion objects (schema is different for objects).
My first thought was to do SQL, but since Schema is changing it will not be fast
So I thought about MongoDB.
Two questions:
1 - do you have sample code that's working that I can use?
2- is Mongo the best solution in place? will it be fast? is there another NoSQL database that I can use instead?
If the time is critical and you wish to have the fastest database than Redis may be the database you are looking for. Here is a link to the Auto complete blog post using Redis.
MongoDB is a great database and includes many great feature so it may be a good choice either.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

How to migrate existing data managed with sqeryl?

There is a small project of mine reaching its release, based on squeryl - typesafe relational database framework for Scala (JVM based language).
I foresee multiple updates after initial deployment. The data entered in the database should be persisted over them. This is impossible without some kind of data migration procedure, upgrading data for newer DB schema.
Using old data for testing new code also requires compatibility patches.
Now I use automatic schema generation by framework. It seem to be only able create schema from scratch - no data persists.
Are there methods that allow easy and formalized migration of data to changed schema without completely dropping automatic schema generation?
So far I can only see an easy way to add columns: we dump old data, provide default values for new columns, reset schema and restore old data.
How do I delete, rename, change column types or semantics?
If schema generation is not useful for production database migration, what are standard procedures to follow for conventional manual/scripted redeployment?
There have been several discussions about this on the Squeryl list. The consensus tends to be that there is no real best practice that works for everyone. Having an automated process to update your schema based on your model is brittle (can't handle situations like column renames) and can be dangerous in production. Personally, I like the idea of "migrations" where all of your schema changes are written as SQL. There are a few frameworks that help with this and you can find some of them here. Personally, I just use a light wrapper around the psql command line utility to do schema migrations and data loading as it's a lot faster for the latter than feeding in the data over JDBC.

Java ORMs on NoSQL DB like HBase

I have recently started getting familiarized with NoSQL (HBase). I am definitely a noob.
I was investigating about ORMs and high level clients which can be used on HBase and came across a few.
Some ORM libraries like Kundera are providing SQL like data query functionality. I am finding this a little counter intuitive.
Can any one help me understand why we would again need SQL like querying if the whole objective was to move away from it?
Also can anyone comment on your experiences with ORMs for HBase? I looked at a few of them from http://wiki.apache.org/hadoop/SupportingProjects and started looking at Kundera.
Another related question - Does data query with Kundera run map reduce jobs internally?
kundera or Spring data might provide user friendly ORM layer over NoSQL databases, but the underlying entity model still has to be NoSQL friendly. This means that NoSQL users should not blindly follow RDBMS modeling strategies but design ORM entities in such a way so that all NoSQL capabilities can be used.
As a thumb rule, the kundera ORM entities should be designed using query-first strategy where first the queries need to defined so as to create primary keys and also ensuring that relationship model is used as minimal as possible. Querying on random columns and full scans should be avoided and so data might have to be replicated across entities for reducing multiple entity look ups. Also, transactions management needs to be planned. FYI, kundera does not support transactions(beyond single row TX supported by Hbase/Cassandra).
Reason for using Kundera:
1) If looking for SQL like support over HBase. As it is build on top of HBase native API, so it simply transforms these SQL queries in to corresponding GET or PUT method calls.
2) Currently it support HBase-0.20.6 only. Kundera-2.0.6 will enable support for HBase 0-90.x versions.
3) Kundera does not do sometihng out of the box to provide map reduce over SQL like queries. However support for such thing will be provided in Kundera-2.0.6 by enabling support for Hive native queries only!
It is totally JPA compliant, so no need to learn something new. It simply hides complexity at developer level with very minimal effort.
SQL like querying is for developement ease, quick developement, less error prone and reusability ofcourse!
-Vivek

Relation between ER Modelling and Database normalization

How is database normalization related to ER Modelling??
What comes first??
Or should both be implemented at the same time??
I feel modeling should come first in a highly normalized database design.
Creating the model allows you to think through how the tables will relate to one another and also allows you to envision what tables you'll need to use when writing your join queries.
Using a tool such as MySQL Workbench or Toad Data Modeler , depending on your target database vendor, can even generate SQL commands to build the tables, constraints, and indexes directly from the model. This is useful because it ensures the tables are created exactly as you designed them.
Also, when making changes to the model, some tools like those mentioned above will even allow you to "update" your schema by issuing the necessary statements required to do so.
So in short, for a project with more than one table, I'd always model it first. It also makes it easier for developers to understand how the tables function and relate at a glance rather than having to read through DDL to understand it.
Modeling can even be fun!
A model created with MySQL Workbench:
Hope this helps!