What is the "Tableau" way to deal with changing data? - tableau-api

As a background to this question: I've been using Tableau for some time now, but I've been using code (Python, Swift, etc) as a crutch for getting some of the more complicated things done. My employer is now making me move what I can away from custom code and into retail software packages because it will make things easier to maintain if I get hit by a bus or something.
The scenario: With code, I find it very easy to deal with constantly changing/growing data by using recursion. I know that this isn't something I can do with Tableau, but I've found that for many problems so far there is a "Tableau way" of thinking/doing that can solve a lot of problems. And, I'm not allowed to use Rserve/TabPy.
I have a batch of transactional data that grows every month by about 1.6mil records. What I would like to do is build something in Tableau that can let me track a complicated rolling total across the data without having to do it manually. In my code of choice, it would have been something like:
Import the data into a frame
For every unique date value in the 'transaction date' field, create a new column with that name
Total the number of transaction in each account for that day
Write the data to the applicable column
Move on to the next day
Then create new columns that store the sum total of transactions for that account over all of the 30 day periods available (date through date + 29 days)
Select the max value of the accounts for a customer for those 30-day sums
Dump all of that 30-day data into a new table based on the customer identifier
It's a lot of steps, but with a couple of nice recursive functions, it's done in a snap with a bit of code. Plus, it can handle the data as it changes.
The actual question: How in the world do I approach problems like this within Tableau since my brain goes straight to recursive function land? I can do this manually with Tableau Prep, but it takes manual tweaking every time the data changes. Is there a better way, or is this just not within the realm of what Tableau really does?
*** Edit 10/1/2020: Minor typo fix. ***

Related

Data retention in timescaledb

Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.

Reporting of workflow and times

I have to start moving transnational data into a reporting database, but would like to move towards a more warehouse/data mart design, eventually leveraging Sql Server Analytics.
The thing that is being measured is the time between points of a workflow on a piece of work. How would you model that when the things that can happen, do not have a specific order. Also some work wont have all the actions, or might have the same action multiple times.
It makes me want to put the data into a typical relational design with one table the key or piece of work and a table that has all the actions and times. Is that wrong? The business is going to try to use tableau for report writing and I know it can do all kinds of sources, but again, I would like to move away from transaction into warehousing.
The work is the dimension and the actions and times are the facts?
Is there any other good online resources for modeling questions?
Thanks
It may seem like splitting hairs, but you don't want to measure the time between points in a workflow, you need to measure time within a point of a workflow. If you change your perspective, it can become much easier to model.
Your OLTP system will likely capture the timestamp of when the event occurred. When you convert that to OLAP, you should turn that into a start & stop time for each event. While you're at it, calculate the duration, in seconds or minutes, and the occurrence number for the event. If the task was sent to "Design" three times, you should have three design events, numbered 1,2,3.
If you're want to know how much time a task spent in design, the cube will sum the duration of all three design events to present a total time. You can also do some calculated measures to determine first time in and last time out.
Having the start & stop times of the task allow you to , for example, find all of tasks that finished design in January.
If you're looking for an average above the event grain, for example what is the average time in design across all tasks, you'll need to do a new calculated measure using total time in design/# tasks (not events).
Assuming you have more granular states, it is a good idea to define parent states for use in executive reporting. In my company, the operational teams have workflows with 60+ states, but management wanted them rolled up into five summary states. The rollup hierarchy should be part of your workflow states dimension.
Hope that helps.

Simulating an Oracle sequence with MongoDB

Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.

Calculating and reporting Data Completeness

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers