Calculating and reporting Data Completeness - data-quality

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)

I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.

Related

What is the "Tableau" way to deal with changing data?

As a background to this question: I've been using Tableau for some time now, but I've been using code (Python, Swift, etc) as a crutch for getting some of the more complicated things done. My employer is now making me move what I can away from custom code and into retail software packages because it will make things easier to maintain if I get hit by a bus or something.
The scenario: With code, I find it very easy to deal with constantly changing/growing data by using recursion. I know that this isn't something I can do with Tableau, but I've found that for many problems so far there is a "Tableau way" of thinking/doing that can solve a lot of problems. And, I'm not allowed to use Rserve/TabPy.
I have a batch of transactional data that grows every month by about 1.6mil records. What I would like to do is build something in Tableau that can let me track a complicated rolling total across the data without having to do it manually. In my code of choice, it would have been something like:
Import the data into a frame
For every unique date value in the 'transaction date' field, create a new column with that name
Total the number of transaction in each account for that day
Write the data to the applicable column
Move on to the next day
Then create new columns that store the sum total of transactions for that account over all of the 30 day periods available (date through date + 29 days)
Select the max value of the accounts for a customer for those 30-day sums
Dump all of that 30-day data into a new table based on the customer identifier
It's a lot of steps, but with a couple of nice recursive functions, it's done in a snap with a bit of code. Plus, it can handle the data as it changes.
The actual question: How in the world do I approach problems like this within Tableau since my brain goes straight to recursive function land? I can do this manually with Tableau Prep, but it takes manual tweaking every time the data changes. Is there a better way, or is this just not within the realm of what Tableau really does?
*** Edit 10/1/2020: Minor typo fix. ***

What is the recommended way to create a consolidated data store for large SPSS files that have survey data (having 600–800 columns)?

Hello Everyone I just need your suggestion what is the best way to store the data retrieving from SPSS file and storing into Mongo db or RDBMS or any other .
The data comprises of responses to survey questionnaire which can span upto large number of columns (600-800) depending on the number of questions and other attributes recorded for the respondent and the survey study. Also these surveys are conducted periodically - however it's not necessary that the questions remain exactly the same - these may vary from survey to survey.
The need is to consolidate this data into a uniform structure and enable further analysis over the consolidated data spanning over multiple survey for which again the plan is to use SPSS.
One option I considered was to store data in MongoDB as then there is flexibility on how the schema can be modified across surveys i.e. rigid schema definition part can be avoided. However in this case not sure if SPSS would support working against Mongo
Would be very interested to know if someone has had any experience in this area or could provide some suggestion.
Another thing to consider, if you plan to create generalized jobs that can be run over surveys that are similar but differ in details is to set up a classification system for the variables such as demographic, opinion, economic etc, and assign these using custom attributes when the sav files are created. You can then use these attributes in generalized jobs to determine what to do based on generic properties rather than tying the code to specific variable names.
You can use the SPSSINC SELECT VARIABLES to define macros based on variable properties, including custom attributes and then use those macros in your syntax in place of specific variable names.
We have seen that an approach like this can dramatically reduce the number of different but similar jobs that an organization has to otherwise maintain.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Is a document/NoSQL database a good candidate for storing a balance sheet?

If I were to create a basic personal accounting system (because I'm like that - it's a hobby project about a domain I'm familiar enough with to avoid getting bogged-down in requirements), would a NoSQL/document database like RavenDB be a good candidate for storing the accounts and more importantly, transactions against those accounts? How do I choose which entity is the "document"?
I suspect this is one of those cases were actually a SQL database is the right fit and trying to go NoSQL is the mistake, but then when I think of what little I know of CQRS and event sourcing, I wonder if the entity/document is actually the Account, and the transactions are Events stored against it, and that when these "events" occur, maybe my application also then writes out to a easily queryable read store like a SQL database.
Many thanks in advance.
Personally think it is a good idea, but I am a little biased because my full time job is building an accounting system which is based on CQRS, Event Sourcing, and a document database.
Here is why:
Event Sourcing and Accounting are based on the same principle. You don't delete anything, you only modify. If you add a transaction that is wrong, you don't delete it. You create an offset transaction. Same thing with events, you don't delete them, you just create an event that cancels out the first one. This means you are publishing a lot of TransactionAddedEvent.
Next, if you are doing double entry accounting, recording a transaction is different than the way your view it on a screen (especially in a balance sheet). Hence, my liking for cqrs again. We can store the data using correct accounting principles but our read model can be optimized to show the data the way you want to view it.
In a balance sheet, you want to view all entries for a given account. You don't want to see the transaction because the transaction has two sides. You only want to see the entry that affects that account.
So in your document db you would have an entries collection.
This makes querying very easy. If you want to see all of the entries for an account you just say SELECT * FROM Entries WHERE AccountId = 1. I know that is SQL but everyone understands the simplicity of this query. It just as easy in a document db. Plus, it will be lightning fast.
You can then create a balance sheet with a query grouping by accountid, and setting a restriction on the date. Notice no joins are needed at all, which makes a document db a great choice.
Theory and Architecture
If you dig around in accounting theory and history a while, you'll see that the "documents" ought to be the source documents -- purchase order, invoice, check, and so on. Accounting records are standardized summaries of those usually-human-readable source documents. An accounting transaction is two or more records that hit two or more accounts, tied together, with debits and credits balancing. Account balances, reports like a balance sheet or P&L, and so on are just summaries of those transactions.
Think of it as a layered architecture -- the bottom layer, the foundation, is the source documents. If the source is electronic, then it goes into the accounting system's document storage layer -- this is where a nosql db might be useful. If the source is a piece of paper, then image it and/or file it with an index number that is then stored in the accounting system's document layer. The next layer up is digital records summarizing those documents; each document is summarized by one or more unbalanced transaction legs. The next layer up is balanced transactions; each transaction is composed of two or more of those unbalanced legs. The top layer is the financial statements that summarize those balanced transactions.
Source Documents and External Applications
The source documents are the "single source of truth" -- not the records that describe them. You should always be able to rebuild the entire db from the source documents. In a way, the db is just an index into the source documents in the first place. Way too many people forget this, and write accounting software in which the transactions themselves are considered the source of truth. This causes a need for a whole 'nother storage and workflow system for the source documents themselves, and you wind up with a typical modern corporate mess.
This all implies that any applications that write to the accounting system should only create source documents, adding them to that bottom layer. In practice though, this gets bypassed all the time, with applications directly creating transactions. This means that the source document, rather than being in the accounting system, is now way over there in the application that created the transaction; that is fragile.
Events, Workflow, and Digitizing
If you're working with some sort of event model, then the right place to use an event is to attach a source document to it. The event then triggers that document getting parsed into the right accounting records. That parsing can be done programatically if the source document is already digital, or manually if the source is a piece of paper or an unformatted message -- sounds like the beginnings of a workflow system, right? You still want to keep that original source document around somewhere though. A document db does seem like a good idea for that, particularly if it supports a schema where you can tie the source documents to their resulting parsed and balanced records and vice versa.
You can certainly create such a system.
In that scenario, you have the Account Aggregate, and you also have the TimePeriod Aggregate.
The time period is usually a Month, a Quarter or a Year.
Inside each TimePeriod, you have the Transactions for that period.
That means that loading the current state is very fast, and you have the full log in which you can go backward.
The reason for TimePeriod is that this is usually the boundary in which you actually think about such things.
In this case, a relational database is the most appropriate, since you have relational data (eg. rows and columns)
Since this is just a personal system, you are highly unlikely to have any scale or performance issues.
That being said, it would be an interesting exercise for personal growth and learning to use a document-based DB like RavenDB. Traditionally, finance has always been a very formal thing, and relational databases are typically considered more formal and rigorous than document databases. But, like you said, the domain for this application is under your control, and is fairly straight forward, so complexity and requirements would not get in the way of designing the system.
If it was my own personal pet project, and I wanted to learn more about a new-ish technology and see if it worked in a particular domain, I would go with whatever I found interesting and if it didn't work very well, then I learned something. But, your mileage may vary. :)