I have to start moving transnational data into a reporting database, but would like to move towards a more warehouse/data mart design, eventually leveraging Sql Server Analytics.
The thing that is being measured is the time between points of a workflow on a piece of work. How would you model that when the things that can happen, do not have a specific order. Also some work wont have all the actions, or might have the same action multiple times.
It makes me want to put the data into a typical relational design with one table the key or piece of work and a table that has all the actions and times. Is that wrong? The business is going to try to use tableau for report writing and I know it can do all kinds of sources, but again, I would like to move away from transaction into warehousing.
The work is the dimension and the actions and times are the facts?
Is there any other good online resources for modeling questions?
Thanks
It may seem like splitting hairs, but you don't want to measure the time between points in a workflow, you need to measure time within a point of a workflow. If you change your perspective, it can become much easier to model.
Your OLTP system will likely capture the timestamp of when the event occurred. When you convert that to OLAP, you should turn that into a start & stop time for each event. While you're at it, calculate the duration, in seconds or minutes, and the occurrence number for the event. If the task was sent to "Design" three times, you should have three design events, numbered 1,2,3.
If you're want to know how much time a task spent in design, the cube will sum the duration of all three design events to present a total time. You can also do some calculated measures to determine first time in and last time out.
Having the start & stop times of the task allow you to , for example, find all of tasks that finished design in January.
If you're looking for an average above the event grain, for example what is the average time in design across all tasks, you'll need to do a new calculated measure using total time in design/# tasks (not events).
Assuming you have more granular states, it is a good idea to define parent states for use in executive reporting. In my company, the operational teams have workflows with 60+ states, but management wanted them rolled up into five summary states. The rollup hierarchy should be part of your workflow states dimension.
Hope that helps.
Related
Kind of related, as allowing infinitely late data would also solve my problem.
I have a pipeline linked to a pub/sub topic, from which data can come very late (days) as well as nearly real time: I have to make simple parsing and aggregation tasks on it, based on event time.
As each element holds a timestamp of the event time, my first idea when looking at the programming guide was to use windowing by setting the timestamp on event time.
However, as the processing can happen a long time after the event, I would have to allow very late data, which based on the answer of the first link I posted seems to not be such a good idea.
An alternative solution I have come up with would be to simply group the data based on event time, without marking it as a timestamp, and then calculate my aggregates based on such a key. As the data for a given even time tend to arrive at the same time in the pipeline, I could then just provide windows of arbitrary length based on processing time so that my data is ingested as soon as it is available.
This got me wondering: does that mean that windowing is useless as soon as your data comes a bit later than real time ? Would it be possible to start a window when the first element is received and close it a bit afterwards ?
I am currently working on a generalized warehouse model containing all processes that take place in warehouse operations. I just started to work with anylogic and I can not figure out how to implement order picking strategies. My current model is able to receive truckloads containing pallets, the pallets are checked, booked and stored in a racking system. For the outbound processes of picking, packing, shipping I created an order containing a single pallet that moves through all the processes. However, a picking process of only single pallets is not really representative of warehouse operations. Therefore, I want to know if it is possible to implement order picking strategies such as batch picking, wave picking, discrete picking, amongst others. I hope someone can help me out.
Kind regards,
Stefan
what is packaged inside standard Anylogic is just a possibility to easily simulate full pallet moves(receive, putaway,picking, shipping) without a lot of programming. To do that you will just use existing AL objects: RackStore, RackPick, maybe some MoveTo, Queue, etc.
But if you go beyond that and want to build some realistic warehouse with processing on a lower level of packing structure (pieces, and maybe even some more - layers, blisters, etc) that will require quite some coding. Depending on your chosen abstraction level you may even want to code everything exactly as its coded in your WMS in extreme case. Or maybe you simplify something but still its quite a modelling task for every new process (Pickwave, Disc Picking, etc) you want to implement. So the answer to your question - yes, its possible but beware of high effort.
As a background to this question: I've been using Tableau for some time now, but I've been using code (Python, Swift, etc) as a crutch for getting some of the more complicated things done. My employer is now making me move what I can away from custom code and into retail software packages because it will make things easier to maintain if I get hit by a bus or something.
The scenario: With code, I find it very easy to deal with constantly changing/growing data by using recursion. I know that this isn't something I can do with Tableau, but I've found that for many problems so far there is a "Tableau way" of thinking/doing that can solve a lot of problems. And, I'm not allowed to use Rserve/TabPy.
I have a batch of transactional data that grows every month by about 1.6mil records. What I would like to do is build something in Tableau that can let me track a complicated rolling total across the data without having to do it manually. In my code of choice, it would have been something like:
Import the data into a frame
For every unique date value in the 'transaction date' field, create a new column with that name
Total the number of transaction in each account for that day
Write the data to the applicable column
Move on to the next day
Then create new columns that store the sum total of transactions for that account over all of the 30 day periods available (date through date + 29 days)
Select the max value of the accounts for a customer for those 30-day sums
Dump all of that 30-day data into a new table based on the customer identifier
It's a lot of steps, but with a couple of nice recursive functions, it's done in a snap with a bit of code. Plus, it can handle the data as it changes.
The actual question: How in the world do I approach problems like this within Tableau since my brain goes straight to recursive function land? I can do this manually with Tableau Prep, but it takes manual tweaking every time the data changes. Is there a better way, or is this just not within the realm of what Tableau really does?
*** Edit 10/1/2020: Minor typo fix. ***
Imagine a large organisation with many applications. The applications are not currently integrated to any great extent. There is a new and empty enterprise data warehouse, and it would store all data in a canonical format. The first step is to set up the warehouse and seed it with data from the applications.
I am looking for pros and cons between the following two enterprise integration patterns:
1) Using a combination of integration tools, setup batching to extract transform and load data on a periodic interval into the warehouse. Then, as part of the process, integrate the data from the warehouse to the required applications.
2) Using a combination of integration tools, detect changes real-time, or in batch and publish them to a service bus (in canonical format). Then, for each required application, subscribe to the messages to integrate them. The data warehouse is another subscriber to the same messages.
Thanks in advance.
One aspect that is hard to get right with integration-via-messages is periodic datasets.
Say you have a table in your data warehouse (DW) that contains data partitioned by day. If an ETL job loads that table, you can be sure that if the load job is finished, the respective dataset is complete (unless there's a bug in the job).
Messaging systems, on the other hand, usually don't provide guarantees of timely delivery. So you might get 90% of messages for a particular day by midnight, 8% within the next hour, and the remaining 2% within the next 6 hours (and a few messages might never arrive). In this situation, if you have a job that depends on this data, how can you know that the dataset is ready? You can set an arbitrary cutoff time (e.g. 1 hour past midnight) based on previous experience, SLAs, or some other criteria, when you consider the dataset complete, but that will by design be an approximation. You will also need some means to detect missing data (because of lost messages) and re-request it from the source.
This answer talks about similar problems.
Another issue is backfills. Imagine your source sends a backdated message, for example to correct some previously-sent one that belongs to a dataset in the past. Presumably, any consumers of that dataset need to be notified of the change and recompute their results. However, without some additional logic in the DW they might not know about it. With the ETL approach, since you already have dependencies between jobs, if you rerun some job with a backfill date, its dependencies will run automatically, or at least it'll be explicitly known that some consumers are affected.
With these caveats in mind, the messaging approach has some great advantages:
all your systems will be integrated using a uniform approach
the propagation time for your data will potentially be much lower
you won't have to fix ETL jobs that exploded because the data volume has grown past their ability to scale
you won't get SLA violations because your ETL jobs timed out
I guess you are talking about both ETL Systems and Mediation (intra-communication) design pattern. I don't know why have to choose between them, in my current project we combine them.
The ETL solution is implemented as Layer responsible for management of the Data integration (via Orchestrator module). It a single entry point and part of the Pipes and filters design pattern
concept that we rely on. It's able to perform a variety of tasks of varying complexity on the information that it processes.
On the other hand the Mediation as EAI system acts as "broker" between multiple applications. Whenever an interesting event occurs in an application (for instance, new information is created or a new transaction completed) an integration module in the EAI system is notified. The module then propagates the changes to other relevant applications.
So as bottom line I can't give you pros & cons for both, since to me they are a good solution together and their use is dependent on your goals, design etc.. But from your description it's seems to me that is similar to what I've suggested.
In general, is it better to store raw data with pre-calculated values in the database and concentrate on keeping the database up-to-date if I remove or delete a row while using the pre-calculated values for display to the user
OR
is it better to store the raw data and calculate the correct display values on-the-fly?
An example (which is pertinent to my project) would be similar to the following:
You have a timer application. In my case its using Core Data. It's not connected to the web, but a self-contained app that runs on a computer or mobile device (user's choice). The app stores a raw start time and a raw end time. The application needs to display the duration of the event and the interval at which the events are occuring. Would it be better to store a pre-calculated "duration" time and even a pre-formatted duration string that will be used for output or would it be better to calculate the duration on-the-fly, so to speak, for display?
Same goes with the interval, although there's another layer involved because when I create/delete/update a row in the database, I'll have update the interval for the items that are affected by this. Or, is it better to just calculate as the app executes?
For the record, I'm not trying to micro-optimize. I'm trying to figure out the best way to reduce the amount of code I have to maintain. If performance improves as a result, so be it.
Thoughts?
Generally, you would want to avoid computed values in the DB (from existing columns/tables), unless profiling absolutely dictates that they are necessary (i.e., the DB is underperforming or to great of a load is being placed on the server). This is even more true for formatting of the data, which should almost always be performed on the client side, instead of wasting DB server cycles.
Of course, any data that is absolutely mandatory to perform the calculations should be stored in the database.
When you speak of reducing the amount of code you need to maintain, keep in mind that the DBA needs to maintain stored-proc code and table schemas, too. Moving maintenance responsibilities from Developers to DBAs is not eliminating work, it is just shifting it.
Finally, database changes often cascade to many applications, whereas application changes only affect that application.
The only time I store calculated values in a database is if I need it for historical purposes. You'll see this all the time in accounting software.
For example if I'm dealing with an invoice, I will typically save the calculated invoice total because perhaps the way that total will get calculated later on will change.
I will also sometimes perform the actual calculation on the database server using views.
As with so many other things, "it depends". For your described case, I would lean towards keeping the calculation in code. If you do choose to use the database, you should use a view to dynamically calculate rather than put in a static value. The risk of changing the start time or end time and forgetting to change the duration would be too high otherwise :)
This really depends on wether you want to be pure (keep your data clean) or fast. Compute capacity on the desktop facilitates purity, high speed cores and large memory spaces make string composition for table cells possible with large data sets.
However on the phone, an iPhone 4 even, computing a single NSString for a UITableViewCell over a set of 1000 objects takes a noticeable amount of time, and this can affect your user experience.
So, tune the balance for your use case. Duration doesn't sound like it will change, so I would precalculate and store the duration AND the display string (feels aweful from the perspective of a DBA, but it will render fast on the phone).
For the interval it sounds like you actually need another entity, to relate the interval to a set of events. It would then be easy enough to pre-compute / maintain this calculation as well each time the relationship changes (i.e. you add an entity to the relationship, update the interval).