I just need a bit more clarity around tableau extract VS live. I have 40 people who will use tableau and a bunch of custom SQL scripts. If we go down the extract path will the custom SQL queries only run once and all instances of tableau will use a single result set or will each instance of tableau run the custom SQL separately and only cache those results locally?
There are some aspects of your configuration that aren't completely clear from your question. Tableau extracts are a useful tool - they essentially are temporary, but persistent, cache of query results. They act similar to a materialized view in many respects.
You will usually want to employ your extract in a central location, often on Tableau Server, so that it is shared by many users. That's typical. With some work, you can make each individual Tableau Desktop user have a copy of the extract (say by distributing packaged workbooks). That makes sense in some environments, say with remote disconnected users, but is not the norm. That use case is similar to sending out data marts to analysts each month with information drawn from a central warehouse.
So the answer to your question is that Tableau provides features that you can can employ as you choose to best serve your particular use case -- either replicated or shared extracts. The trick is then just to learn how extracts work and employ them as desired.
The easiest way to have a shared extract, is to publish it to Tableau Server, either embedded in a workbook or separately as a data source (which is then referenced by workbooks). The easiest way to replicate extracts is to export your workbook as a packaged workbook, after first making an extract.
A Tableau data source is the meta data that references an original source, e.g. CSV, database, etc. A Tableau data source can optionally include an extract that shadows the original source. You can refresh or append to the extract to see new data. If published to Tableau Server, you can have the refreshes happen on schedule.
Storing the extract centrally on Tableau Server is beneficial, especially for data that changes relatively infrequently. You can capture the query results, offload work from the database, reduce network traffic and speed your visualizations.
You can further improve performance by filtering (and even aggregating) extracts to have only the data needed to display your viz. Very useful for large data sources like web server logs to do the aggregation once at extract creation time. Extracts can also just capture the results of long running SQL queries instead of repeating them at visualization time.
If you do make aggregated extracts, just be careful that any further aggregation you do in the visualization makes sense. SUMS of SUMS and MINS of MINs are well defined. Averages of Averages etc are not always meaningful.
If you use the extract, than if will behave like a materialized SQL table, thus anything before the Tableau extract will not influence the result, until being refreshed.
The extract is used when the data need to be processed very fast. In this case, the copy of the source of data is stored in the Tableau memory engine, so the query execution is very fast compared to the live. The only problem with this method is that the data won't automatically update when the source data is updated.
The live is used when handling real-time data. Here each query is accessed from the source data, so the performance won't be as good as the extract.
If you need to work on a static database use extract else the live.
I am feeling from your question that you are worrying about performance issues, which is why you are wondering if your users should use tableau extract or use live connection.
From my opinion for both cases (live vs extract) it all depends on your infrastructure and the size of the table. It makes no sense to make an extract of a huge table that would take hours to download (for example 1 billion rows and 400 columns).
In the case all your users are directly connected on a database (not a tableau server), you may run on different issues. If the tables they are connecting to, are relatively small and your database processes well multiple users that may be OK. But if your database has to run many resource-intensive queries in parallel, on big tables, on a database that is not optimized for many users to access at the same time and located in a different time zone with high latency, that will be a nightmare for you to find a solution. On the worse case scenario you may have to change your data structure and update your infrastructure to allow 40 users to access the data simultaneously.
Related
I would like to report the database size to myself via email every week and make a comparison to the week before and display the growth in Megabyte and/or %.
I have everything besides the comparison done.
Imagine this setup :
SQL server with 100 databases
Now there are plenty of ways to do a comparison, I thought about writing the sizes into XML by powershell and later read out using a second script and report to me.
Since I trained myself in powershell I might have gaps here, so I am afraid to miss an easy way.
Does anyone has a nice Idea of how to compare the size?
The report and calculation I will manage myself later, I just need a good way to do that.
Currently I am on Powershell 3.0 but I can upgrade to 4.0
Don't invent the wheel again. Sql Server already has tools to monitor DB file sizes. So does Performance Monitor. There are several 3rd party products available too. Ask your local DBA if there already is such a system present.
A common practice is to query the server for DB file sizes on, say, daily basis and store it in utility db table with timestamp. Calculating change volumes, ratios and whatnot can be done on TSQL side. (Not that it is CPU intensive anyway.)
I would creat foreach database an csv file. then write out two rows:
Date,Size
27.08.2014,1024
28.08.2014,1040
29.08.2014,1080
Then you can import the csv file, sort the row by date, compare the last two sizes and send the result by mail.
Assume we want to build a simple Google Analytics clone, tracking pageviews. We will place javascript on websites that tracks pageviews.
Can the javascript dump data directly into the database without having to go through a server (preferred)?
We obviously want to dump a lot of data in there. Billions of rows.
Does the database scale easily with as little interference as possible? (DynamoDB's model is perfect: 0 overhead).
Can we do somewhat flexible querying: limit by date, and filter/limit by a number of tags?
Can the javascript dump data directly into the database without having to go through a server (preferred)?
For the databases of which I'm aware, that would require browser clients having write access to the database, such that it would be trivial for an attacker to pollute your database with some simple JavaScript. If that's sufferable, then it's certainly possible. For something like CouchDB or Cloudant, you'd just make the DB globally writable (but not readable or editable), so clients can push events as they occur.
Does the database scale easily with as little interference as possible? (DynamoDB's model is perfect: 0 overhead).
Cloudant specifically is built on BigCouch, which the creators built to deal with systems generating petabytes of data per second. So, it scales. It uses Dynamo's concept of quorum to maximize consistency between nodes.
FYI: BigCouch is merging with CouchDB later this year.
Can we do somewhat flexible querying: limit by date, and filter/limit by a number of tags?
CouchDB, BigCouch, and Cloudant all use MapReduce for queries, which build as your data enters the system, such that accessing the results of a MapReduce query occur in O(log n) time. Each system also provides special methods for streaming information about changes to the database as changes occur, which is perfect for a dashboard.
I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.
I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.
Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.
As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.
I have heard some genuine arguments for the use of relational database vs spreadsheet before. Relational database provides fast reporting and (relatively speaking) reliable data warehousing,where spreadsheets are lightweight, fast replicating, and easy to float around the organization to different audience. Although I notice the advantages of either, I can rarely distinguish what's better in which scenario, and always end up using database.
In development, it's easy to forget to consider other options when one can place config settings in the database. I've ran into quite a few apps where user menus, work flows and their orders, and constants are defined in the database level. While this is good if these entities were subject to change by end user from application level, it was not the case.
So, what's your take on the roles of databases, config files, and spread sheets?
The old adage is this.
When you use a spreadsheet to solve a problem, you now have two problems.
Database is for records of the business. Long-lasting. Permanent.
Other configuration files are for other configuration information -- not long-lasting business records. Current settings and what-not are not enduring business records, they're part of a specific software configuration that processes the business records.
Spreadsheets are -- well -- they are what they are. Too complex to be a simple, configuration file. Too simple to be a real database.
Since they're (almost) impossible to control, you need one standard, correct, idempotent result in the database. You should be able to rebuild spreadsheets from that controlled source.
Similarly, if you accept a spreadsheet for upload, you have to extract the data, and never refer back to the (almost uncontrollable) source document again.
For me, I want all of the core data to be stored in a database. Two reasons:
to allow adhoc reporting access to the data
to allow applications to share data.
Databases should contain all of the domain data, and occasionally some on-the-fly data (user preferences for example). Relational databases are most popular, but for some apps there are other options.
The config file on the other hand should contain all of the 'parameters' you want to change in the system; the ones that are not changed rapidly (on-the-fly). Config items are flexible, but not easily, and usually not from the interface. If it's a param that you only want the coder to possibly change, that should be right in the code (so no one else has access).
If you want to fiddle with data mining, provide some generic mechanism to download a CSV file with the results of a SQL query, directly into Excel. That way people can fiddle with pivot tables, without having to alter the application's schema.
Spreadsheets are documents, databases are repositories for information, configuration files store rules for how a specific instance of an application should behave. If you think of it that way, it's usually not hard to make a call.