How can I run visual analytics on historical IoT data on #Bluemix?
There are services like Real-time Insights and Streaming analytics for real-time data analytics, but is there a service for historical data analytics and visualization?
There are a few different options depending on your use case, experience and data source. For example:
You can use Jupyter notebooks or RStudio on Data science experience. You can use R, python or scala to analyse data and create re-runnable reports. You can also use spark which is great if your data volumes are large or you want to use spark's vast number of connectors to different data sources. This approach is ideal for data scientists with coding experience.
You can use Watson Analytics if you want to do analysis without data science or coding skills. This environment is more for ah-hoc analysis rather than reporting.
If you are looking to do reporting Cognos has excellent visualization capabilities and reports can be created by users who don't have coding skills.
There are a number of other options, but in my experience the above three tend to be the most common.
Related
In other words, can Zeppelin be used as a Tableau replacement at small scale?
I have a new UI/UX design of reporting dashboard. Data for dashboard comes from relational database (SQL Server). This dashboard is to be viewed by ~300 colleagues in my company. Perhaps up to ten of them will be viewing it at the same time.
Currently the dashboard is implemented in Kibana with data being imported into Elasticsearch from SQL Server on a regular basis. However, the new design requires certain widgets and data aggregations that go beyond dashboarding capabilities of Kibana. Additionally, my organization desires to migrate this dashboard to a technology which is considered more familiar for data scientists that work with us (Kibana isn't considered such).
This report and dashboard could be migrated to Tableau. Tableau is powerful enough to perform desired data aggregations and present all desired widgets. However we can't afford licenses cost, but we can invest as much developer's time as needed.
I have evaluated couple of open-source dashboarding tools (Metabase and Superset) and they are lacking aggregations and widgets that we need. I would not go into details because the question is not about specifics. It is clear that Metabase and Superset are not powerful enough for our needs.
I have an impression that Apache Zeppelin is powerful enough with its support for arbitrary Python code (I would use Pandas for data aggregations), graphs and widgets. However, I am not sure if single Zeppelin instance can support well number of concurrent viewers.
We'd like to build a set of notebooks and make them available to all colleagues in the organization (access control is not an issue, we trust each other). Notebooks will be interactive with data filters and date range pickers.
Looks like Zeppelin has switchable interpreter isolation modes which we can use to make different user's sessions isolated from each other. My question is whether a single t2.large AWS instance hosting Zeppelin can sustain up to ten users viewing report aggregated on 300k rows dataset. Also, are there any usability concerns which make an idea of multi-user viewing of reporting dashboard impractical for Zeppelin?
I see a couple questions you're asking:
Can Zeppelin replace Tableau on a small scale? This depends on what features you are using in Tableau. Every platform has its own set of features that the others do or don't have, and Tableau has a lot of customization options that you won't find elsewhere. Aim to get as much of your dashboard converted 1:1 then warm everyone up to the idea that it will look/operate a little bit different since it's on a different platform.
Can a t2.large hosting Zeppelin sustain up to 10 concurrent users viewing a report aggregated on 300k rows? A t2.large should be more than big enough to run Zeppelin, Tableau, Superset, etc. with 10 concurrent users pulling a report with 300k rows. 300k isn't really that much.
A good way to speed things up and squeeze more concurrent users on with your existing infrastructure is to speed up your data sources. That is where a lot of the aggregation calculations happen. Taking a look at your ETL's and trying to aggregate ahead of time can help, as well as make sure your data scientists aren't running massive queries slowing down your database server.
Am need of your suggestion for scenario below :
one of our clients has 8 postgres DB servers used as OLTP and now wants to generate MIS reports/dashboards integrating all the data in the servers.
- There are around 100 reports to be generated
- There would be around 50k rows added to each of these databases
- the reports are to be generated for once every month
- they are running all there setup in baremetals
- they don't want to use hadoop/spark , since they think the maintainabilty will be higher
- they want to use opensource tech to accomplish this task
with all said above, one approach would be to write scripts to bring aggregated data into one server
and then manually code the reports with frontend javascript.
is there any better approach using ETL tools like Talend,Pentaho etc.
which ETL tool would be best suited for this ?
community editions of any ETL tool would suffice the above requirement..?
I know for the fact that the commercial offering of any of the ETL tools will not be in the budget.
could you please let me know your views on this.
Thanks in Advance
Deepak
Sure Yes. I did similar things successfully a dozen time in my life.
My suggestion is to use Pentaho-Data-Integrator (or Talend) to collect the data in one place and then filter, aggregate and format the data. The data volume is not an issue as long as you have a decent server.
For the reports, I suggest to produce them with Pentaho-Report-Designer so that they can be send by mail (with Pentaho-DI) or distributed with a Pentaho-BI-server.
You can also make javascript front end with Pentaho-CDE.
All of these tools are mature, robust, easy to use, have a community edition and well supported by the community.
We are at the beginning of building an IoT cloud platform project. There are certain well known portions to achieve complete IoT platform solution. One of them is real-time rule processing/engine system which is needed to understand that streaming events are matched with any rules defined dynamically by end users with readable format (SQL or Drools if/when/then etc.)
I am so confused because there are lots of products, projects (Storm, Spark, Flink, Drools, Espertech etc.) in internet so, considering we have 3-person development team (a junior, a mid-senior, a senior), what would it be the best choice ?
Choosing one of the streaming projects such as Apache Flink and learn well ?
Choosing one of the complete solution (AWS, Azure etc.)
The BRMS(Business Rule Management System) like Drools is mainly built for quickly adapting changes in business logic and are more matured and stable compared to stream processing engines like Apache Storm, Spark Streaming, and Flink. Stream processing engines are built for high throughput and low latency. The BRMS may not be suitable to serve hundreds of millions of events in IOT scenarios and may be difficult to deal with event-time-based window calculations.
All these solutions can be used in Iaas providers. In AWS you may also want to take a look at AWS EMR and Kinesis/Kinesis Analytics.
Some use cases I've seen.
Stream data directly to FlinkCEP.
Use rule engines to do fast response with low latency, at the same time stream data to Spark for analysis and machine learning.
You can also run Drools in Spark and Flink to hot-deploy user-defined rules.
Disclaimer, I work for them. But, you should check out Losant. It's developer friendly and it's super easy to get started. We also have a workflow engine, where you can build custom logic/rules for your application.
check out the Waylay rules engine built specifically for real-time IoT data streams.
In the beginning phase Go for the cloud based IoT platform like predix,AWA,SAP or Watson for rapid product development and initial learning.
What is the function of a historian in terms of OPC and PLC?
Based upon your tags opc and plc, you're refering to Historian in the SCADA context.
Basically, a historian is normally a service that collects data from various devices in SCADA network and logs to a database.
A proprietary (time-series) database is normally used. Normal marketing makes claims such as:
Faster Speeds
In contrast, a plant-wide historian provides a much faster read/write performance over a relational database and “down to the millisecond” resolution for true real-time data. This capability enables better responsiveness by quickly providing the granularity of data needed to analyze and solve intense process applications.
Higher Data Compression
The powerful compression algorithms of Proficy Historian enables you to store years of data easily and securely online, which enhances performance, reduces maintenance and lowers costs. You can configure GE Intelligent Platforms’s Proficy Historian without the active maintenance and back-up routines that a traditional RDB requires. Archives can be automatically created, backed up, and purged—enabling extended use without the need for a database administrator.
These marketing bullet points are often optimistic of poor RDBMS implementations. "Faster Speeds" is confused with the precision used for the timestamping datatype and proper indexing of data in a relational database. "Higher Data Compression" claims are realized by using Swinging Door Algorithms that can also be implemented for most RDBMS. Their use is explained in the Chevron whitepaper Data Compression for Process Historians. A new trend is using a classical time-series database for the historical data, and a relational database for analysis and reporting. An example of this hybrid configuration is OSIsoft using Microsoft SQL server for its Analysis Framework that holds the asset management hierarchy and relational type batch history data with a shift from a tag-centric application to an asset-centric application with a relational database at the core.
Popular options are Proficy Historian from General Electric, or OSIsoft's PI Historian.
As an automation integrator, my absolute favorite Historian is Wonderware (now called AVEVA). It integrates with their other products (SCADA, Archestra, Batch, etc.) to provide integrated data from a manufacturing environment. It is also can be integrated with "competitors" products.
The Wonderware Historian product does have great compression...relative to the Rockwell's FactoryTalk Historian and SIGNIFICANTLY better configuration.
It is quite scalable and has excellent documentation, views and stored procedures for querying including abundant examples IMHO.
They offer tiered historians and cloud Historian and Historian saas.
I'm using an application that is very interactive and is now at the point of requiring a real analytics solution. We generate roughly 2.5-3 million events per month (and growing), and would like to build reports to analyze cohorts of users, funneling, etc. The reports are standard enough that it would seem feasible to use an existing service.
However, given the volume of data I am worried that the costs of using a hosted analytics solution like MixPanel will become very expensive very quickly. I've also looked into building a traditional star-schema data warehouse with offline background processes (I know very little about data warehousing).
This is a Ruby application with a PostgreSQL backend.
What are my options, both build and buy, to answer such questions?
Why not building your own?
Check this open source project as an exemple:
http://www.warefeed.com
It is very basic and you will have to built datamart feature you will need in your case