The Good way to create pentaho cde dashboard - charts

Pentaho version : bi server CE 6.1
I'm new to pentaho universe and I found myself stuck in finding documentation to create a cde dashboard. Just to be clear, I have no idea of what is the good way to create cde dashboard, but i tried many things based on tutorials found pretty much everywhere
What i have done so far
From this data model
I already created a dynamic chart with "sql over sqljdbc" datasource.
Here is my query (and the result behind in picture)
SELECT (select survey_type from survey where id = pr.form_type) as "form type",
pr.date as "Date",
count(pr.id) as "Form number"
FROM result pr
inner join district pd on pr.district_id=pd.id
inner join departement pdep on pd.departement_id=pdep.id
inner join region pre on pdep.region_id=pre.id
WHERE pre.region_text = ${region}
GROUP by date,form_type
ORDER by date;
Dashboard generated by the query - Form number by date, type and region (set dynamically)
What I want to achieve
I want to do this kind of chart : community.pentaho.com/ctools/ccc/#type=bar&anchor=small-multiple-bars or community.pentaho.com/ctools/ccc/#type=bar&anchor=stacked-bar (sorry i don't have enough reputation to post more than 2 links) with a "sql over jdbc" datasource
Can anyone give me an example of sql request to achieve that ? (preferably with the sql request given up on this post with some modification.I tried this but it does not work as expected:
SELECT (select survey_type from survey where id = pr.form_type) as "form type",
pr.date as "Date",
pre.region_text as region,
count(pr.id) as "Form number"
FROM result pr
inner join district pd on pr.district_id=pd.id
inner join departement pdep on pd.departement_id=pdep.id
inner join region pre on pdep.region_id=pre.id
GROUP by date,form_type,pre.id
ORDER by date;
)
And where can i put the code given behind this example to previsualize it in my own instance of pentaho ? I need to know how to reproduce it
What i want to know
The good way to do cde chart on pentaho :
how the query need to be formatted ? (how fields are organised on dashboard, number max of fields...)
what is the difference between mdx queries and sql queries and purpose ?
what is the best way to do chart between those two types (mdx and sql) ?
how can i transform my relational database in mondrian cube if i want to use mdx queries (or what i should do is to redesign the database in datawarehouse using kettle ?)
Thank you for your answers.

First of all you should realize that you're asking alot here. Having said that you've pretty much done what I did when I first started with Pentaho which was experiment. Alot.
Regarding your questions I have some links which should help you (if you haven't checked them already)
http://pentaho-bi-suite.blogspot.be/2014/01/inter-panel-communication-in-pentaho.html
http://holowczak.com/getting-started-with-pentaho-community-edition-dashboard-editor-cde/
The first link is a very good blog on which I have found several answers regarding dashboards.
The second link is more of an overal tutorial.
There is no general "best way" (apart from applying general best practices ofcourse) for creating dashboards. I suggest you keep trying (getting to know all of the properties and settings along the way) and find out what method works best for you.
Regarding your questions about MDX and Mondrian, I haven't had much experience in these area's but as I understand it MDX queries are based off of Mondrian cubes which you prepare in the Mondrian Schema Workbench of Pentaho.
http://mondrian.pentaho.com/documentation/olap.php
I believe this should answer (atleast some of) your questions. Trying many different things and experimenting will get you quite far as you'll catch up with plenty of small things one at a time.

I will elaborate a little bit on this.
As dooms stated, you ask a lot of things here but I am glad you are trying to create some great dashboards.
In order to format charts and tune them, I remembered I had to learn some JavaScript/JQuery.
The difference between SQL and MDX. They are completely different,
even when sometimes the syntax looks similar. You use SQL to query
relational databases whereas MDX is used to query Cubes. If you don’t
have cubes in place you need to use SQL of course. If not, you should
ask the cube developer to introduce you to this world. Basically
cubes are good at aggregating data and allows to easily interact and
perform ad-hoc analysis, it is intended for business analyst to let
them better explore the data. I am a MDX fan, but I would recommend
you to explore new alternatives to multidimensional cubes, like
tabular models or other in-memory technologies.
The best way to do a chart has nothing to do with MDX or SQL. It depends where your data is stored. The most important thing is to have a good data model behind.
Again, depending on your architecture, you should have a multidimensional model in your data mart, without snow flake if possible. That allows you either to build easy SQL queries and a straight forward cube design. Designing cubes required some extra skills. I would try to have a clean data model and then start to evaluate if a cube is required.
I hope I give you some lights, it is not easy to answer the broad questions you asked. Important is to define the scope of your project.
Kind Regards,

Related

some questions about designing on OrientDB

We were looking for the most suitable database for our innovative “collaboration application”. Sorry, we don’t know how to name it in a way generally understood. In fact, highly complicated relationships among tenants, roles, users, tasks and bills need to be handled effectively.
After reading 5 DBs(Postgrel, Mongo, Couch, Arango and Neo4J), when the words “… relationships among things are more important than things themselves” came to my eyes, I made up my mind to dig into OrientDB. Both the design philosophy and innovative features of OrientDB (multi-models, cluster, OO,native graph, full graph API, SQL-like, LiveQuery, multi-masters, auditing, simple RID and version number ...) keep intensifying my enthusiasm.
OrientDB enlightens me to re-think and try to model from a totally different viewpoint!
We are now designing the data structure based on OrientDB. However, there are some questions puzzling me.
LINK vs. EDGE
Take a case that a CLIENT may place thousands of ORDERs, how to choose between LINKs and EDGEs to store the relationships? I prefer EDGEs, but they seem like to store thousands of RIDs of ORDERs in the CLIENT record.
Embedded records’ Security
Can an embedded record be authorized independently from it’s container record?
Record-level Security
How does activating Record-level Security affect the query performance?
Hope I express clearly. Any words will be truly appreciated.
LINK vs EDGE
If you don't have properties on your arch you can use a link, instead if you have it use edges. You really need edges if you need to traverse the relationship in both directions, while using the linklist you can only in one direction (just like a hyperlink on the web), without the overhead of edges. Edges are the right choice if you need to walk thru a graph.Edges require more storage space than a linklist. Another difference between them it's the fact that if you have two vertices linked each other through a link A --> (link) B if you delete B, the link doesn't disappear it will remain but without pointing something. It is designed this way because when you delete a document, finding all the other documents that link to it would mean doing a full scan of the database, that typically takes ages to complete. The Graph API, with bi-directional links, is specifically designed to resolve this problem, so in general we suggest customers to use that, or to be careful and manage link consistency at application level.
RECORD - LEVEL SECURITY
Using 1 Million vertex and an admin user called Luke, doing a query like: select from where title = ? with an NOT_UNIQUE_HASH_INDEX the execution time it has been 0.027 sec.
OrientDB has the concept of users and roles, as well as Record Level Security. It also supports token based authentication, so it's possible to use OrientDB as your primary means of authorizing/authenticating users.
EMBEDDED RECORD'S SECURITY
I've made this example for trying to answer to your question
I have this structure:
If I want to access to the embedded data, I have to do this command: select prop from User
Because if I try to access it through the class that contains the type of car I won't have any type of result
select from Car
UPDATE
OrientDB supports that kind of authorization/authentication but it's a little bit different from your example. For example: if an user A, without admin permission, inserts a record, another user B can't see the record inserted by user A without admin permission. An User can see only the records that has inserted.
Hope it helps

NoSql self join like

I want to understand if I can do the following thing with NoSql in any way. I will take flights as example.
Lets say I have table or collection of flights with the following info:
...
{ from:XXX, to:YYY, date:01-01-2016 }
{ from:YYY, to:XXX, date:02-02-2016 }
...
I need to be able to perform something like self join to find the full route :
{from:XXX, to:YYY, outbound:01-01-2016, inbound:02-02-2016}
the table should have a lot of from and to locations.
Is it possible to do it with no relational DB?
Is it possible to do it with no relational DB?
That's the wrong question to ask. The idea of NoSQL is to use specialized data stores for specific problems, instead of attempting to solve every problem with the same tool.
Your use case is unclear, however - depending on what data you use to query, you could simply do two queries and merge the result(s), or use a simple $or query (in mongodb) to query for paths back and forth. There's dozens of ways to solve this with all kinds of tools, but it depends on the exact problem you want to solve.
The example of flights is not even a good fit for RDBMSs, because this is usually a routing problem where it might be allowed (or necessary) to combine two or more flights for each direction - neo4j might be the simpler tool for graph problems (note that I'm not saying 'better', because that can mean many things...)
Here is an efficient way to do this in AWS DynamoDB.
Create a table with the following schema:
HashKey: From_City-To_City
RangeKey: Time
So in your case your table would look like:
HashKey RangeKey
XXX-YYY 01-01-2016
YYY-XXX 02-02-2016
Now given a flight from XXX to YYY on 01-01-2016, you can find the return flight by doing a DynamoDB query like this:
HashKey=="YYY-XXX" and RangeKey > "01-01-2016".
This query should be very efficient because the hash key "YYY-XXX" is already defined and range key is sorted/indexed. So you can have tons of flight info in your table, but your query execution time should stay (mostly) the same regardless of the growth in table size.

distributing pivot graphs to departments through email efficiently

I work at an institution with a lot of departments and subdivisions. I have an "excel-database" with pivotcharts that can show the results for the progress of the different departments and subdivisions, but there are quite a lot, and to get through all graphs (Dep 1, subdivision 1, Dep 1 subdivision 2, etc...) I have to go through quite a bunch of iterations sending out the graphs for each department and subdivision.
I'm considering creating a macro - that selects each option in the pivotchart and then exports to a word document, but I don't know if there's an easier way to go, since I guess thiss will take me quite some time too.
I'm thinking that someone probably has been in the same situation, so if anyone has any suggestions as to how this could be solved efficiently, please let me know.
EDIT:
So as I see it there are three steps to this question that need solving (steps that are striked are steps that I know how to do)
Iterate through pivot table options
Copy charts to word OR other excel file and save
attach that file to a mail, and send it to the correct department-mail
The general thinking about how to handle a case like yours has changed over the years. Currently I would recommend making the data accessible on an internal website of some kind and allowing each department to generate their own graph on demand. They would then be able to look at the data whenever they wanted and you would not have to send out graphs. See if Google Drive or MS Office365 can do this for you.

Introduction to object databases

I'm trying to understand the idea of noSQL databases, to be more precise, the concept behind neo4j graph database. I have experience with SQL databases (MySQL, MS SQL), but the limitations of managing hierarchical data made me to expand my knowledge. But now I have some questions and I can't find their answers (maybe I don't know what to search).
Imagine we have list of countries in the world. Each country has it's GDP every year. Each country has it's GDP calculated by different sources - World Bank, their government, CIA etc. What's the best way to organise data in this case?
The simplest thing which came in mind is to have the node (the values are imaginary):
China:
GDPByWorldBank2012: 999,
GDPByCIA2011: 994,
GDPByGovernment2012: 1102,
In relational database, I would split the data in three tables: Countries, Sources and Values, where in Values I would have value of GDP, year, id of the country and id of the source.
Other thing which came in mind is to create nodes CIA, World bank, but node Government looks really weird. Even though, the idea is to have relationships (valueIfGDP):
CIA -> valueOfGDP - {year: 2011, value: 994} -> China
World Bank -> valueOfGDP - {year: 2012, value: 999} -> China
This looks pretty weird for me, what is more, what happens when we add the values for all the years from one source? We would have multiple relationships or what?
I'm sorry if my questions are too dumb and I would be happy if someone explain me or show me what book/article to read.
Thanks in advance. :)
Your questions are very legit and you're not the only one having difficulties to grasp graph modelling at first ;)
It is always easier to start thinking about the questions you wanna answer with your data before modelling it up front.
Let's imagine you wanna retrieve the GDP of year 2012 computed by CIA of all countries.
A simple way to achieve this is to label country nodes uniformly, and set an attribute name that obviously depends on the country name.
Moreover, CIA/WorldBank/Government in this domain are all "sources", let's label them uniformly as well.
For instance, that could give something like:
(ORGANIZATION {name: CIA})-[:HAS_COMPUTED_GDP {year:2011, value:994}]->(COUNTRY {name:China})
With Cypher Query Language, following this model, you would execute the following query:
START cia = node:nodes(name = "CIA")
MATCH cia-[gdp:HAS_COMPUTED_GDP]->(country)
WHERE gdp.year = 2012
RETURN cia, country, gdp
In this query, I used an index lookup as a starting point (rather than IDs which are a internal technical notion that shouldn't be used) to retrieve CIA by name and match the relevant subgraph to finally return CIA, the GDP relationships and their linked countries matching the input constraints.
Although Neo4J is totally schemaless, this does not mean you should necessarily have a totally flexible data model. Having a little structure will always help to make your queries or traversals easier to read.
If you're not familiar with Cypher Query Language (which is not the only way to read or write data into the graph), have a look at the excellent documentation of Neo4J (Cypher: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html, complete: http://docs.neo4j.org/chunked/stable/index.html) and try some queries there: http://console.neo4j.org/!
And to answer your second question, if you wanna add another year of GDP computations, this will just boil down to adding new relationship "HAS_COMPUTED_GDP" between the organizations and the countries, no more no less.
Hope it helps :)

what-if analysis on Rolap Or Molap & how?

i would like to simulate what-if analysis on a OLAP cube
For example, i would like to know the impact on departmental resource budgets by moving employees between departments or the movement in cost of manufacture if a product is moved from one factory to another.
so should i use an rolap cube'mondrian' or molap ?
i will greatful if you can give me some exemple , tuto ... ;)
thank you in advance
Actually mondrian does support "writeback" (via olap4j) so you can do what if analysis.
Check out Saiku - AFAIK it's the first and only tool to have implemented it so far.
Here is how it works - it's pretty rudimentary:
http://julianhyde.blogspot.co.uk/2009/06/cell-writeback-in-mondrian.html
Martin is close to the point though, it doesn't actually update the raw data, only objects in the cache. But you wouldnt want to update raw data if you were doing what if analysis anyway!
I would say that Mondrian is an engine to query an existing database that is has a dedicated structure for Olap (usually some kind of star schema).
It is definitely not something to manipulate (or even) change data. Since each what-if analysis needs to change data in some way or another, Mondrian is not the tool for it.