In other words, can Zeppelin be used as a Tableau replacement at small scale?
I have a new UI/UX design of reporting dashboard. Data for dashboard comes from relational database (SQL Server). This dashboard is to be viewed by ~300 colleagues in my company. Perhaps up to ten of them will be viewing it at the same time.
Currently the dashboard is implemented in Kibana with data being imported into Elasticsearch from SQL Server on a regular basis. However, the new design requires certain widgets and data aggregations that go beyond dashboarding capabilities of Kibana. Additionally, my organization desires to migrate this dashboard to a technology which is considered more familiar for data scientists that work with us (Kibana isn't considered such).
This report and dashboard could be migrated to Tableau. Tableau is powerful enough to perform desired data aggregations and present all desired widgets. However we can't afford licenses cost, but we can invest as much developer's time as needed.
I have evaluated couple of open-source dashboarding tools (Metabase and Superset) and they are lacking aggregations and widgets that we need. I would not go into details because the question is not about specifics. It is clear that Metabase and Superset are not powerful enough for our needs.
I have an impression that Apache Zeppelin is powerful enough with its support for arbitrary Python code (I would use Pandas for data aggregations), graphs and widgets. However, I am not sure if single Zeppelin instance can support well number of concurrent viewers.
We'd like to build a set of notebooks and make them available to all colleagues in the organization (access control is not an issue, we trust each other). Notebooks will be interactive with data filters and date range pickers.
Looks like Zeppelin has switchable interpreter isolation modes which we can use to make different user's sessions isolated from each other. My question is whether a single t2.large AWS instance hosting Zeppelin can sustain up to ten users viewing report aggregated on 300k rows dataset. Also, are there any usability concerns which make an idea of multi-user viewing of reporting dashboard impractical for Zeppelin?
I see a couple questions you're asking:
Can Zeppelin replace Tableau on a small scale? This depends on what features you are using in Tableau. Every platform has its own set of features that the others do or don't have, and Tableau has a lot of customization options that you won't find elsewhere. Aim to get as much of your dashboard converted 1:1 then warm everyone up to the idea that it will look/operate a little bit different since it's on a different platform.
Can a t2.large hosting Zeppelin sustain up to 10 concurrent users viewing a report aggregated on 300k rows? A t2.large should be more than big enough to run Zeppelin, Tableau, Superset, etc. with 10 concurrent users pulling a report with 300k rows. 300k isn't really that much.
A good way to speed things up and squeeze more concurrent users on with your existing infrastructure is to speed up your data sources. That is where a lot of the aggregation calculations happen. Taking a look at your ETL's and trying to aggregate ahead of time can help, as well as make sure your data scientists aren't running massive queries slowing down your database server.
Related
I'm new to PipelineDB and have yet to even experience it at runtime (installation pending ...). But I'm reading over the documentation and I'm totally intrigued.
Apparently, PipelineDB is able to take set-based query representations and mechanically transform them into an incremental representation for efficiently processing the streams of deltas with storage limited as a function of the output of the continuous view.
Is it also supported to run the set-based query as a set-based query for priming a continuous view? It seems to me that upon creation of a Continuous View the initial data would be computed this traditional way. Also, since Continuous Views can be truncated, can they then be repopulated (from still-available source tables) without tearing down whatever dependent objects it has to allow a drop/create?
It seems to me that this feature would be critical in many practical scenarios. One easy example would be refreshing occasionally to reset the drift from rounding errors in, say, fractional averages.
Another example is if there were bug discovered and fixed in PipelineDB itself which had caused errors in the data. After the software is patched, the queries based on data still available ought to be rerun.
Continuous Views based fully on event streams with no permanent storage could not be rebuilt in that way. Not sure about if only part of the join sources are ephemeral.
I don't see these topics covered in the docs. Can you explain how these are or aren't a concern?
Thanks!
Jeff from PipelineDB here.
The main answer to your question is covered in the introduction section of the PipelineDB technical docs:
"PipelineDB can dramatically reduce the amount of information that needs to be persisted to disk because only the output of continuous queries is stored. Raw data is discarded once it has been read by the continuous queries that need to read it."
While continuous views only store the output of continuous queries, almost everybody who is using PipelineDB is storing their raw data somewhere cheap like S3. PipelineDB is meant to be the realtime analytics layer that powers things like realtime reporting applications and realtime monitoring & alerting systems, used almost always in conjunction with other systems for data infrastructure.
If you're interested in PipelineDB you might also want to check out the new realtime analytics API product we recently rolled out called Stride. The Stride API gives developers the benefit of continuous SQL queries, integrated storage, windowed queries, and other things like realtime webhooks, all without having to manage any underlying data infrastructure, all via a simple HTTP API.
If you have any additional technical questions you can always find our open-source users and dev team hanging out in our gitter chat channel.
I know this question has been asked before at PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data.
But I want to rephrase this question.
I am attempting a real-time data warehouse. The difference between real-time and near real-time is huge.
I find real-time data warehouse to be event-driven and transactional in approach.
While near real-time would do the same batch mode application but would poll data more frequently.
It would put so much extra load on the production server and would certainly kill the production system.
Being a batch approach it would scan through all the tables for changes and would take rows which have changed from
a cut-off time stamp.
I mean by event driven, it would be specific to tables which have undergone changes are focus only on transaction
which are happening currently.
But the source system is an elephant of system, SAP, assuming which has 25,000 tables. It is not easy to model that,
not easy to write database triggers on each table to capture each change. I want impact on the production server to be minimal.
Is there any trigger at database level so that I could capture all changes happening in database in one trigger.
Is there any way to write that database trigger on a different database server so that production server goes untouched.
I have not been keeping pace with changes happening to database technology and am sure some nice new technologies would have come by to capture these changes easily.
I know of Log miners and Change data captures but it would be difficult to filter out the information which I need from redo logs.
Alternate ways to capture database write operations on the go.
Just for completeness sake let us assume databases are a heterogeneous mix of Oracle, SQL Server and DB2. But my contention is
the concepts we want to develop.
This is a universal problem, every company is looking for easy to implement solution. So a good discussion would benefit all.
Don't ever try to access SAP directly. Use the APIs of SAP Data Services (http://help.sap.com/bods). Look for the words "Integrator Guide" on that page for documentation.
This document should give you a good hint about where to look for your data sources (http://wiki.scn.sap.com/wiki/display/EIM/What+Extractors+to+use). Extractors are kind-of-somewhat like views in a DBMS, they're abstracting all the SAP stuff into somethin human readable.
As far as near-real-time, think in terms of micro-batches. Run your extract jobs every 5 (?) minutes, or longer if necessary.
Check the Lambda Architecture from Nathan Marz (I provide no link, but you'll find the resources easily). The implementation is all Java and No SQl, but the main ideas are applicable to the classical relational databases as well. In the nutshell: you have two implementations, one real time but responsible for only limited time interval. The "long tail" is maintained with classical best practice batch implementation.
The real time part is always discarded after the batch refresh, effectively blocking the propagation of the problems of the real time processing in the historical data.
As of now I can see only two solutions:
Write services on the source systems. If source is COBOL, put those in services. Put all services in a service bus and
some how trap when changes happen to database. This needs to be explored how that trap will work.
But from outset it appears to be a very
expensive proposition and uncertain. Convincing management for a three year lag time would be difficult. Services are not easy.
Log Shippers: This a trusted database solution. Logs would be available on another server, production server need not be
burdened. There are good number of tools as well available.
But the spirit does not match. Event driven is missing so the action when things
are happening is not captured. I will settle down for this.
As Ron pointed out NEVER TOUCH SAP TABLE DIRECTLY. There are adapters and adapters to access SAP tables. This will build another layer in between but it is unavoidable. One good news I want to share is a customer did a study of SAP tables and found that only 14% of the tables are actually populated or touched by SAP system. Even then 14% of 25,000 tables is coming to huge data model of 2000+ entities. Again micro-batches are like dividing the system into Purchase, Receivables, Payables etc., which is heading for a data mart and not an EDW. I want to focus on a Enterprise Data Warehouse.
I am planning a large project that is going to use Neo4j as main storage to model the data, but I was wondering if I should additionally use something like MySQL as a parallel structure to back the user administration subsystem. It would have the advantage that I could use something with which I have much more experience (18 Years) to do the very tedious job of login and account administration. The disadvantage would be the overhead of keeping two databases in sync, since I do plan to have the users also represented as nodes in Neo4j with a small subset of the user information. I know that it depends a lot on the details of the project, but if anybody has experience in the field with a similar set up it would be very appreciated.
Rather than building replication, stick to your chosen database & build administration tools and UI. Replication is complex & difficult, admin tools hopefully less so.
If you can't get the data out at least in CSV for reporting, maybe reconsider choice of DB.
We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.
I'm using an application that is very interactive and is now at the point of requiring a real analytics solution. We generate roughly 2.5-3 million events per month (and growing), and would like to build reports to analyze cohorts of users, funneling, etc. The reports are standard enough that it would seem feasible to use an existing service.
However, given the volume of data I am worried that the costs of using a hosted analytics solution like MixPanel will become very expensive very quickly. I've also looked into building a traditional star-schema data warehouse with offline background processes (I know very little about data warehousing).
This is a Ruby application with a PostgreSQL backend.
What are my options, both build and buy, to answer such questions?
Why not building your own?
Check this open source project as an exemple:
http://www.warefeed.com
It is very basic and you will have to built datamart feature you will need in your case