How to build local environment with large databases - postgresql

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.

I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.

Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.


tableau extract vs live

I just need a bit more clarity around tableau extract VS live. I have 40 people who will use tableau and a bunch of custom SQL scripts. If we go down the extract path will the custom SQL queries only run once and all instances of tableau will use a single result set or will each instance of tableau run the custom SQL separately and only cache those results locally?
There are some aspects of your configuration that aren't completely clear from your question. Tableau extracts are a useful tool - they essentially are temporary, but persistent, cache of query results. They act similar to a materialized view in many respects.
You will usually want to employ your extract in a central location, often on Tableau Server, so that it is shared by many users. That's typical. With some work, you can make each individual Tableau Desktop user have a copy of the extract (say by distributing packaged workbooks). That makes sense in some environments, say with remote disconnected users, but is not the norm. That use case is similar to sending out data marts to analysts each month with information drawn from a central warehouse.
So the answer to your question is that Tableau provides features that you can can employ as you choose to best serve your particular use case -- either replicated or shared extracts. The trick is then just to learn how extracts work and employ them as desired.
The easiest way to have a shared extract, is to publish it to Tableau Server, either embedded in a workbook or separately as a data source (which is then referenced by workbooks). The easiest way to replicate extracts is to export your workbook as a packaged workbook, after first making an extract.
A Tableau data source is the meta data that references an original source, e.g. CSV, database, etc. A Tableau data source can optionally include an extract that shadows the original source. You can refresh or append to the extract to see new data. If published to Tableau Server, you can have the refreshes happen on schedule.
Storing the extract centrally on Tableau Server is beneficial, especially for data that changes relatively infrequently. You can capture the query results, offload work from the database, reduce network traffic and speed your visualizations.
You can further improve performance by filtering (and even aggregating) extracts to have only the data needed to display your viz. Very useful for large data sources like web server logs to do the aggregation once at extract creation time. Extracts can also just capture the results of long running SQL queries instead of repeating them at visualization time.
If you do make aggregated extracts, just be careful that any further aggregation you do in the visualization makes sense. SUMS of SUMS and MINS of MINs are well defined. Averages of Averages etc are not always meaningful.
If you use the extract, than if will behave like a materialized SQL table, thus anything before the Tableau extract will not influence the result, until being refreshed.
The extract is used when the data need to be processed very fast. In this case, the copy of the source of data is stored in the Tableau memory engine, so the query execution is very fast compared to the live. The only problem with this method is that the data won't automatically update when the source data is updated.
The live is used when handling real-time data. Here each query is accessed from the source data, so the performance won't be as good as the extract.
If you need to work on a static database use extract else the live.
I am feeling from your question that you are worrying about performance issues, which is why you are wondering if your users should use tableau extract or use live connection.
From my opinion for both cases (live vs extract) it all depends on your infrastructure and the size of the table. It makes no sense to make an extract of a huge table that would take hours to download (for example 1 billion rows and 400 columns).
In the case all your users are directly connected on a database (not a tableau server), you may run on different issues. If the tables they are connecting to, are relatively small and your database processes well multiple users that may be OK. But if your database has to run many resource-intensive queries in parallel, on big tables, on a database that is not optimized for many users to access at the same time and located in a different time zone with high latency, that will be a nightmare for you to find a solution. On the worse case scenario you may have to change your data structure and update your infrastructure to allow 40 users to access the data simultaneously.

mongo as a main db for a complex project

Is there any sense to use mongodb in a system with great amount of entities (50+) connected to each other, for example in CRM. Any "success stories"?
There is a need of intensive writing and fast selection from high number of records for the some kind of analytics system.
It is definitely hard to provide a recommendation with such open question; however, you can analyze some of the advantages of MongoDB over other database, most likely you are considering Mongo as an alternative to a relational database like Oracle or SQL Server.
From you can see the main characteristics...
Document Oriented Storage: Which basically means you can have a single or multiple documents representing your data structures. One very important think here is that the schema is dynamic, that is you can add more attributes without having to change your database. Pretty useful for adding flexibility to your system.
Full index support: We wouldn't expect any less than full support for indices, right?
Replication and High availability; Sharding: Very critical elements for availability, disaster recovery, and to guarantee the
ability to grow with your system.
Querying: Again, pretty critical requirement. Need to make sure you account for the dynamic schema. You will need to consider in
your queries that some attributes are not defined for all documents
(remember dynamic schema?).
Map/Reduce: Very useful for
analytics. Recommended for aggregating large amounts of data.
Should be used offline, meaning, you don't run a live query against a
map/reduce function, otherwise you will be sitting for a while
waiting. But it is great to run batch analytics on your system.
GridFS: A great way of storing binary data. Automatically generates MD5's for your files, splits them in chunks, and can add
metadata. Your files will stay with your database.
Also, the Geolocation indices are great. You can define lon,lat attributes and do searches on those.
Now it is up to you to see if these features are good for your needs, or you rather stay with a well know relational system.
Before jumping into a solution you should experiment and build some prototypes. You will see very early what challenges you'll have in your design.
Hope this helps.

What NoSQL solution to choose for domain names database?

I have a project that stores several millions of domain names in database and perform search requests to find if domain is present in DB. The only operation I need - check if given value exists. No range queries, no additional information, nothing.
The number of queries that I make to database is rather big, for example 100'000 per one user session.
I have new database once a day and even it's possible to check what records were deleted and what added - I don't think that it's worth it. So, I am importing database to a new table and point script to a new name.
Looking for solution that can make the whole things faster, as I don't use any SQL features. Name search and import time are important for me.
My server can't store this database in memory, even half of it, so I think some NoSQL solution working from hard drive can help me.
Can you suggest something?
A much smaller and faster solution would be to use Berkeley DB with the key-value pair API. Berkeley DB is a database library that links into your application, so there is no client/server overhead nor separate server to install and manage. Berkeley DB is very straightforward and provides, among several APIs, a simple key-value (NoSQL) API that provides all of the basic data management routines that you would expect to find in a much larger, more complex RDBMS (indexing, secondary indexes, foreign keys), but without the overhead of a SQL engine.
Disclaimer: I am the Product Manager for Berkeley DB, so I am a little biased. That said, it was designed to do exactly what you're asking for -- straightforward, fast, scalable key-value data management without unnecessary overhead.
In fact, there are many "database domain" type application services that use Berkeley DB as their primary data store. Most of the open source and/or commercial LDAP implementations use Berkeley DB (including OpenLDAP, Redhat's LDAP, Sun Directory Server, etc.). Cisco, Juniper, AT&T, Alcatel, Mitel, Motorola and many others use Berkeley DB to manage their They use Berkeley DB for their gateway, authentication, and configuration management systems, They use BDB because it does exactly what they need, it's very fast, scalable and reliable.
You could get by quite nicely with just a Bloom filter if you can accept a very small false positive rate (assuming you use a large enough filter).
On the other hand, you could certainly use Cassandra. It makes heavy use of bloom filters, so asking for something that doesn't exist is quick, and you don't have to worry about false positives. It's designed to handle data sets that do not fit into memory, so performance degredation there is quite smooth.
Importing any amount of data should be quick -- on a normal machine, Cassandra can handle about 15k writes per second.
Many options here. Berkeley DB certainly does the job and is probably one of the simplest solutions. Just as simple: store everything in memcached, then you have the option of splitting the cache of the values across several machines if needed (if query load or data size grows).

Main Memory DB vs Object DB

I'm currently trying to pick a database vendor.
I'm just seeking some personal opinions from fellow database developers out there.
My question is especially targeted towards people who:
1) have used Main Memory DB (MMDB) that supports replicating to disk (hybrid) before (i.e. ExtremeDB)
2) have used Versant Object Database and/or Objectivity Database and/or Progress ObjectStore
and the question is really: if you could recommend a database vendor, based on your experience, that would suit my application.
My application is a commercial real-time (read: high-performance) object-oriented C++ GIS kind of app, where we need to do a lot of lat/lon search (i.e. given an area, find all matching targets within the area...R-Tree index).
The types of data that I would like to store into the database are all modeled as objects and they make use of std::list and std::vector, so naturally, Object Database seems to make sense. I have read through enough articles to convince myself that a traditional RDBMS probably isnt what I'm really looking for in terms of
performance (joins or multiple
tables for dynamic-length data like
ease of programming
(impedance mismatch)
However, in terms of performance,
Input data is being fed into the system at about 40 MB/s.
Hence, the system will also be doing insert into the database at the rate of roughly 350 inserts per second (where each object varies from 64KB to 128KB),
Database will consistently be searched and updated via multiple threads.
From my understanding, all of the Object DBs I have listed here use cache for storing database objects. ExtremeDB claims that since it's designed especially for memory, it can avoid overhead of caching logic, etc. See more by googling: Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark
So..I'm just a bit confused. Can Object DBs be used in real-time system? Is it as "fast" as MMDB?
Fundamentally, I difference between a MMDB and a OODB is that the MMDB has the expectation that all of its data is based in RAM, but persisted to disk at some point. Whereas an OODB is more conventional in that there's no expectation of the entire DB fitting in to RAM.
The MMDB can leverage this by giving up on the concept that the persisted data doesn't necessarily have to "match" the in RAM data.
The way anything with persistence is going to work, is that it has to write the data to disk on update in some fashion.
Almost all DBs use some kind of log for this. These logs are basically "raw" pages of data, or perhaps individual transactions, appended to a file. When the file gets "too big", a new file is started.
Once the logs are properly consolidated in to the main store, the logs are discarded (or reused).
Now, a crude, in RAM DB can exist simply by appending transactions to a log file, and when it's restarted, it just loads the log in to RAM. So, in essence, the log file IS the database.
The downside of this technique is the longer and more transactions you have, the bigger your log/DB is, and thus the longer the DB startup time. But, ideally, you can also "snapshot" the current state, which eliminates all of the logs up to date, and effectively compresses them.
In this manner, all the routine operations of the DB have to manage is appending pages to logs, rather than updating other disk pages, index pages, etc. Since, ideally, most systems don't need to "Start up" that often, perhaps start up time is less of an issue.
So, in this way, a MMDB can be faster than an OODB who has a different contract with the disk, maintaining logs and disk pages. In this way, an OODB can be slower even if the entire DB fits in to RAM and is properly cached, simply because you incur disk operations outside of the log operations during normal operations, vs a MMDB where these operations happen as a "maintenance" task, which can be scheduled during down time and/or quiet time.
As to whether either of these systems can meet you actual performance needs, I can't say.
The back ends of databases (reader and writer processes, caching, lock managing, txn log files, ACID semantics) are the same, so RDBs and OODB are actually very similar here. The difference is the interface to the application programmer. Is your data model complicated, consists of lots of classes with real inheritance relationships? Then OO is good. Is it relatively flat and simple? Then go RDB. What is the nature of the relationships? Is it pointer-like and set like? Then go RDB. Is is more complicated, like (ordered) list, array, map? Then you should go OO. Also, do you have a stand-alone application with no need to integrate with other apps? Then OO is ok. Do you have to share data with other apps (i.e. several apps access the same database)? Then that's a deal-breaker for OO, and you should stick with RDB. Is the schema of your database stable or do you expect it to evolve frequently? OODBs are bad ad schema evolution, so if you expect frequent changes, stick with RDBs.

relational_database vs config_file vs spreadsheet usage

I have heard some genuine arguments for the use of relational database vs spreadsheet before. Relational database provides fast reporting and (relatively speaking) reliable data warehousing,where spreadsheets are lightweight, fast replicating, and easy to float around the organization to different audience. Although I notice the advantages of either, I can rarely distinguish what's better in which scenario, and always end up using database.
In development, it's easy to forget to consider other options when one can place config settings in the database. I've ran into quite a few apps where user menus, work flows and their orders, and constants are defined in the database level. While this is good if these entities were subject to change by end user from application level, it was not the case.
So, what's your take on the roles of databases, config files, and spread sheets?
The old adage is this.
When you use a spreadsheet to solve a problem, you now have two problems.
Database is for records of the business. Long-lasting. Permanent.
Other configuration files are for other configuration information -- not long-lasting business records. Current settings and what-not are not enduring business records, they're part of a specific software configuration that processes the business records.
Spreadsheets are -- well -- they are what they are. Too complex to be a simple, configuration file. Too simple to be a real database.
Since they're (almost) impossible to control, you need one standard, correct, idempotent result in the database. You should be able to rebuild spreadsheets from that controlled source.
Similarly, if you accept a spreadsheet for upload, you have to extract the data, and never refer back to the (almost uncontrollable) source document again.
For me, I want all of the core data to be stored in a database. Two reasons:
to allow adhoc reporting access to the data
to allow applications to share data.
Databases should contain all of the domain data, and occasionally some on-the-fly data (user preferences for example). Relational databases are most popular, but for some apps there are other options.
The config file on the other hand should contain all of the 'parameters' you want to change in the system; the ones that are not changed rapidly (on-the-fly). Config items are flexible, but not easily, and usually not from the interface. If it's a param that you only want the coder to possibly change, that should be right in the code (so no one else has access).
If you want to fiddle with data mining, provide some generic mechanism to download a CSV file with the results of a SQL query, directly into Excel. That way people can fiddle with pivot tables, without having to alter the application's schema.
Spreadsheets are documents, databases are repositories for information, configuration files store rules for how a specific instance of an application should behave. If you think of it that way, it's usually not hard to make a call.