what-if analysis on Rolap Or Molap & how? - olap-cube

i would like to simulate what-if analysis on a OLAP cube
For example, i would like to know the impact on departmental resource budgets by moving employees between departments or the movement in cost of manufacture if a product is moved from one factory to another.
so should i use an rolap cube'mondrian' or molap ?
i will greatful if you can give me some exemple , tuto ... ;)
thank you in advance

Actually mondrian does support "writeback" (via olap4j) so you can do what if analysis.
Check out Saiku - AFAIK it's the first and only tool to have implemented it so far.
Here is how it works - it's pretty rudimentary:
http://julianhyde.blogspot.co.uk/2009/06/cell-writeback-in-mondrian.html
Martin is close to the point though, it doesn't actually update the raw data, only objects in the cache. But you wouldnt want to update raw data if you were doing what if analysis anyway!

I would say that Mondrian is an engine to query an existing database that is has a dedicated structure for Olap (usually some kind of star schema).
It is definitely not something to manipulate (or even) change data. Since each what-if analysis needs to change data in some way or another, Mondrian is not the tool for it.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

How to inject (dynamic?) Parameters in Tableau CustomSQL

I currently try to solve the following issue in Tableau:
In the end, I would like to have a Tableau dashboard where the user can select a Customer, and then can see the Customer's KPIs. Nothing spectacular so far.
To obtain a Customer's KPIs, there is a CustomSQL query with a parameter "CustomerName" (that returns the KPIs for that Customer).
Now the thing:
I don't want to have a hardcoded list of CustomerNames, as it would be possible with Tableau Parameters. Instead, the CustomerNames should be fetched from another datasource. I did not find a way to "link" a Parameter to a DataSource, and/or inject something other than static Parameters into CustomSQL.
My Question: Is there really no solution for this, or am I just doing something wrong (I hope so).
I found this workaround here https://www.interworks.com/de/blog/daustin/2015/12/17/dynamic-parameters-tableau that seems to work, but that looks like... a workaround.
Few background info:
I have to stick to using a CustomSQL because
It is not viable for me to calculate all KPIs for all CustomerNames
and then filter by Tableau, since the data amount is too big.
It is not viable to replace the CustomSQL with Tableau Calculations
and Filters (already tried that, ended up in having Tableau pulling
too much data instead of pushing the work to the database).
I cannot believe that Tableau does not offer a solution here, since the use case is pretty common I believe.
Do you have some input for me?
Thank you for your help in advance!
Kind Regards
have you tried using rawsql() functions together with stored functions on the database side? I found it pretty useful when needed to load single value from the dataset completely not related to currently used datasource.
For example, running foo stored function which accepts 2 dates and calculated sum of something, Syntax should be something like:
rawsql_int(your_db_schema.foo(%1,%2),[startDateFieldTableau],[endDateFieldTableau])
but you can access it directly:
rawsql_int("select sum(bar) from sales")
but this is bit risky.
Drawbacks:
it relies on the current connection (you create a calculated field (duh!)
it will not work with extract (but you are using custom sql anyways so I believe you are more into live connection

The Good way to create pentaho cde dashboard

Pentaho version : bi server CE 6.1
I'm new to pentaho universe and I found myself stuck in finding documentation to create a cde dashboard. Just to be clear, I have no idea of what is the good way to create cde dashboard, but i tried many things based on tutorials found pretty much everywhere
What i have done so far
From this data model
I already created a dynamic chart with "sql over sqljdbc" datasource.
Here is my query (and the result behind in picture)
SELECT (select survey_type from survey where id = pr.form_type) as "form type",
pr.date as "Date",
count(pr.id) as "Form number"
FROM result pr
inner join district pd on pr.district_id=pd.id
inner join departement pdep on pd.departement_id=pdep.id
inner join region pre on pdep.region_id=pre.id
WHERE pre.region_text = ${region}
GROUP by date,form_type
ORDER by date;
Dashboard generated by the query - Form number by date, type and region (set dynamically)
What I want to achieve
I want to do this kind of chart : community.pentaho.com/ctools/ccc/#type=bar&anchor=small-multiple-bars or community.pentaho.com/ctools/ccc/#type=bar&anchor=stacked-bar (sorry i don't have enough reputation to post more than 2 links) with a "sql over jdbc" datasource
Can anyone give me an example of sql request to achieve that ? (preferably with the sql request given up on this post with some modification.I tried this but it does not work as expected:
SELECT (select survey_type from survey where id = pr.form_type) as "form type",
pr.date as "Date",
pre.region_text as region,
count(pr.id) as "Form number"
FROM result pr
inner join district pd on pr.district_id=pd.id
inner join departement pdep on pd.departement_id=pdep.id
inner join region pre on pdep.region_id=pre.id
GROUP by date,form_type,pre.id
ORDER by date;
)
And where can i put the code given behind this example to previsualize it in my own instance of pentaho ? I need to know how to reproduce it
What i want to know
The good way to do cde chart on pentaho :
how the query need to be formatted ? (how fields are organised on dashboard, number max of fields...)
what is the difference between mdx queries and sql queries and purpose ?
what is the best way to do chart between those two types (mdx and sql) ?
how can i transform my relational database in mondrian cube if i want to use mdx queries (or what i should do is to redesign the database in datawarehouse using kettle ?)
Thank you for your answers.
First of all you should realize that you're asking alot here. Having said that you've pretty much done what I did when I first started with Pentaho which was experiment. Alot.
Regarding your questions I have some links which should help you (if you haven't checked them already)
http://pentaho-bi-suite.blogspot.be/2014/01/inter-panel-communication-in-pentaho.html
http://holowczak.com/getting-started-with-pentaho-community-edition-dashboard-editor-cde/
The first link is a very good blog on which I have found several answers regarding dashboards.
The second link is more of an overal tutorial.
There is no general "best way" (apart from applying general best practices ofcourse) for creating dashboards. I suggest you keep trying (getting to know all of the properties and settings along the way) and find out what method works best for you.
Regarding your questions about MDX and Mondrian, I haven't had much experience in these area's but as I understand it MDX queries are based off of Mondrian cubes which you prepare in the Mondrian Schema Workbench of Pentaho.
http://mondrian.pentaho.com/documentation/olap.php
I believe this should answer (atleast some of) your questions. Trying many different things and experimenting will get you quite far as you'll catch up with plenty of small things one at a time.
I will elaborate a little bit on this.
As dooms stated, you ask a lot of things here but I am glad you are trying to create some great dashboards.
In order to format charts and tune them, I remembered I had to learn some JavaScript/JQuery.
The difference between SQL and MDX. They are completely different,
even when sometimes the syntax looks similar. You use SQL to query
relational databases whereas MDX is used to query Cubes. If you don’t
have cubes in place you need to use SQL of course. If not, you should
ask the cube developer to introduce you to this world. Basically
cubes are good at aggregating data and allows to easily interact and
perform ad-hoc analysis, it is intended for business analyst to let
them better explore the data. I am a MDX fan, but I would recommend
you to explore new alternatives to multidimensional cubes, like
tabular models or other in-memory technologies.
The best way to do a chart has nothing to do with MDX or SQL. It depends where your data is stored. The most important thing is to have a good data model behind.
Again, depending on your architecture, you should have a multidimensional model in your data mart, without snow flake if possible. That allows you either to build easy SQL queries and a straight forward cube design. Designing cubes required some extra skills. I would try to have a clean data model and then start to evaluate if a cube is required.
I hope I give you some lights, it is not easy to answer the broad questions you asked. Important is to define the scope of your project.
Kind Regards,

How to implement Associative Rules Analysis or Market Basket Analysis from scratch?

I tried to went through numerous articles trying to understand what should be my first step to incorporate associative analysis (may be Market Basket analysis) into my system. They all go deep into implementation of algorithm but no one talked about how to store data in the first place.
I will really appreciate if someone can give me some start pointers or article links that I can begin with.
The first thing I want to implement is to track user clicks and provide suggestions based on tracked data.
E.g. User clicked on link A and subsequently on link B and link C. I can track this activity with some metadata associated (user, user organization, user role etc.)
I do not want it to be limited only to links. In future, I want to add number of similar usecases into the system and want to make it smart. E.g. If user set specific values for fields A and B, most likely he/she will set value <bla> for field C.
My system may generate several thousand such data points in a day (E.g. user clicks, field selection etc.).
Below are my questions:
How should I store my data? Go SQL or No SQL (I briefly looked into Mongo DB and it looked promising)
What tool should I use to perform the associative analysis? Are there any open source tools I can use?
It depend. Does your data suitable for NoSql databases? To answer this question it's better to read CAP Theorem and it's case studies: https://en.wikipedia.org/wiki/CAP_theorem or http://robertgreiner.com/2014/06/cap-theorem-explained/
. Some time you want Consistency(depending to your data) and Availability => so that it's better to use Relational Databases like Mysql(Try to read case studies and analyse your data to pick the best tools)
There is large number of open source libraries, but in my opinion it's better to first read some concepts and algorithms. Try searching for Apriori,ECLAT, FP-GROWTH Algorithms and get concepts of them. then you can pick a tool or write the code your self. Some usefull tools(depending to your programming language):
Python: https://github.com/asaini/Apriori, https://github.com/enaeseth/python-fp-growth, https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
PHP: https://github.com/sigidhanafi/fp-growth-php
JAVA: https://github.com/goodinges/FP-Growth-Java, http://www.philippe-fournier-viger.com/spmf/
Also you can use Spark: https://spark.apache.org/docs/1.1.1/mllib-guide.html

Best practice to map OrientDB ORecordId onto RestFull friendly ID representation

We are looking into OrientDB as our persistency solution behind a restful web service, because a GraphDB would be a perfect match for our use case. One of the things we have noticed is that entities (both Vertex and Edges) are uniquely identified by a ORecordId, containing the '#${clusterId}:${clusterPosition}'. In a restful API, based on my personal experience from relational DB's, you typically have several solutions to identify entities uniquely, for example:
UUID's, generated in code and persisted on DB level
Long/Int values, generated on DB level incrementally
etc...
The problem is that the format "#${clusterId}:${clusterPosition}" is not really URL/REST friendly (example: .../api/user/[#${clusterId}:${clusterPosition}]/address). Do you have any advice/experience on how you would deal with this, keeping in mind that you need a bi-directional mapping between the ORecordId and the "RestFulFriendlyId"?
Any hints and best practices based on experience would be truly appreciated....
Best regards,
Bart
We're looking into using HashID. http://hashids.org/
There are some minor concerns we have still, but theoretically, HashID should get you a hashed Rid, which is also convertible, so it won't take up more storage space (like with a UUID). It will just take a small bit of CPU time.
Please note, this little tool is not in any way a true hash, as in, it makes it very hard to crack the hash. It is more about good obfuscation. If you are at all worried about the Rids being known, this isn't a proper solution.
Scott
Actually, I'd say the RIDs are very RESTful, if you do this:
.../domain.com/other-segments/{cluster}/{position}/...
Since clusters are a "superset" of a specific class (i.e. one class will have one or more clusters), this can be thought of as identifying the target data object by type/record. I'm not sure what backend you're using, but extracting those two URL segments and recombining them to #x:y should be a fairly simple (and maybe mostly automatic) task.