I notice that whenever I run a ParametersVariation model, the built-in database does not update... I have PLE, so there is no way for me to write my own database. I am currently able to pull data from various logs present in the database, but only from a normal simulation run. Is there a way to have the parameters variation write its data to the database after each simulation run?
I am currently running this code in After simulation run
Database myFile = new Database(this, "A DB from Excel", "C:/Users/Downloads/DataExport.xlsx");
ModelDatabase modelDB = getEngine().getModelDatabase();
modelDB.exportToExternalDB("flowchart_stats_time_in_state_log", myFile.getConnection(), "Sheet", false, true);
The export works perfectly. But the data never changes and this is confirmed by exporting a distribution from a histogram that changes with every simulation run. But for this export, its the same data as was written to the database from the last standard (non-parametersvariation) simulation run.
Model log database tables aren't produced for multi-run experiments. It's not specifically stated anywhere, but they're designed more for testing/debugging (single runs of) models.
(Also, notice that the log tables don't have columns specifying a run ID or similar, so there's no way that you would have been able to distinguish rows for different runs anyway, even if there were rows written in multi-run experiments.)
Unfortunately, because they are one of the only ways to 'automatically' produce certain forms of output data (like the contents of datasets or histograms) many people try to use them for that (even though they have a pretty un-useful 'internal' format). In general you should write to your own internal database tables for any persistent outputs, where you can also govern whether you store outputs for multiple runs or not (which will require you to calculate some form of unique run IDs and use those in columns to differentiate outputs per run, plus have logic or UI elements to determine when the table data is cleared for a new run and when it isn't).
NB: Note that the kinds of data the model log tables (like flowchart_stats_time_in_state_log which you mention) create can in virtually all cases be determined and created 'manually' via your own model code. That table in particular has a large amount of detail on what's happened in each block and, in any given case, it's probably only a fraction of that data (or a simplification/aggregation of it) that you really want/need.
Related
I am trying to determine column-level lineage between a target table and a number of source tables. The columns that end up in the target table come from one or more of the source tables and may have been transformed by one or more intermediate processes. Trouble is, I have no access to the intermediate processes - all I have are source tables and a target table. I am trying to find out whether there exists a class of solutions or tools for column level (fine-grained) lineage that assumes black-box processes.
Solutions that my searches turned up appear to determine lineage from code, e.g. SQL queries, etc, which obviously requires some foresight and pre-integration. I am working with legacy systems and data from different organizations for which getting access to transformation processes is just not going to happen. Searches for blackbox column level lineage didn't return anything I consider useful.
I am about to sketch out a custom solution but didn't want to undertake such a huge task without making sure something already exists. It seems unlikely that no one has tackled this problem.
Separately, I also would like to know whether there exists open-source standalone visualization tools for column-level lineage that was generated using another process.
Here is my issue: I often need to compare the same postgresql tables (or views that depend on it) between some ETL code refactoring to check for non regressions in my developments.
Let's say I have an ETL code I want to refactor, which regularly uploads data in a table. Currently, once my modifs are done, I often download my data from postgresql as a .csv file as a first step, then empty it, fill it again using my refactored code, and download the data again. Then, I compare the .csv files using for instance Python in a Jupyter Notebook.
That does not seem like the way to go at all. That notably supposes I am the only one to use that table during the operation, and so many other things I can't list them all here.
Is there a better way to go ?
It sounds to me like you have the correct approach. There's no magic to the CSV export operation: whatever tool you use runs a query and formats its resultset into the file. Any other before-and-after comparison operation would have to run the same query.
If you're doing this sort of regression test on an active database, it's probably wise to put some sort of distinctive tag on your test records, maybe prepend ETLTEST- to your customer names, so it's ETLTEST-John Bull. Then you can make your queries handle only your test records. And make sure you do something reliable for ORDER BY.
Juptyer seems a complex way to diff your csv files. Most operating systems have lightweight fast difftools.
I just need a bit more clarity around tableau extract VS live. I have 40 people who will use tableau and a bunch of custom SQL scripts. If we go down the extract path will the custom SQL queries only run once and all instances of tableau will use a single result set or will each instance of tableau run the custom SQL separately and only cache those results locally?
There are some aspects of your configuration that aren't completely clear from your question. Tableau extracts are a useful tool - they essentially are temporary, but persistent, cache of query results. They act similar to a materialized view in many respects.
You will usually want to employ your extract in a central location, often on Tableau Server, so that it is shared by many users. That's typical. With some work, you can make each individual Tableau Desktop user have a copy of the extract (say by distributing packaged workbooks). That makes sense in some environments, say with remote disconnected users, but is not the norm. That use case is similar to sending out data marts to analysts each month with information drawn from a central warehouse.
So the answer to your question is that Tableau provides features that you can can employ as you choose to best serve your particular use case -- either replicated or shared extracts. The trick is then just to learn how extracts work and employ them as desired.
The easiest way to have a shared extract, is to publish it to Tableau Server, either embedded in a workbook or separately as a data source (which is then referenced by workbooks). The easiest way to replicate extracts is to export your workbook as a packaged workbook, after first making an extract.
A Tableau data source is the meta data that references an original source, e.g. CSV, database, etc. A Tableau data source can optionally include an extract that shadows the original source. You can refresh or append to the extract to see new data. If published to Tableau Server, you can have the refreshes happen on schedule.
Storing the extract centrally on Tableau Server is beneficial, especially for data that changes relatively infrequently. You can capture the query results, offload work from the database, reduce network traffic and speed your visualizations.
You can further improve performance by filtering (and even aggregating) extracts to have only the data needed to display your viz. Very useful for large data sources like web server logs to do the aggregation once at extract creation time. Extracts can also just capture the results of long running SQL queries instead of repeating them at visualization time.
If you do make aggregated extracts, just be careful that any further aggregation you do in the visualization makes sense. SUMS of SUMS and MINS of MINs are well defined. Averages of Averages etc are not always meaningful.
If you use the extract, than if will behave like a materialized SQL table, thus anything before the Tableau extract will not influence the result, until being refreshed.
The extract is used when the data need to be processed very fast. In this case, the copy of the source of data is stored in the Tableau memory engine, so the query execution is very fast compared to the live. The only problem with this method is that the data won't automatically update when the source data is updated.
The live is used when handling real-time data. Here each query is accessed from the source data, so the performance won't be as good as the extract.
If you need to work on a static database use extract else the live.
I am feeling from your question that you are worrying about performance issues, which is why you are wondering if your users should use tableau extract or use live connection.
From my opinion for both cases (live vs extract) it all depends on your infrastructure and the size of the table. It makes no sense to make an extract of a huge table that would take hours to download (for example 1 billion rows and 400 columns).
In the case all your users are directly connected on a database (not a tableau server), you may run on different issues. If the tables they are connecting to, are relatively small and your database processes well multiple users that may be OK. But if your database has to run many resource-intensive queries in parallel, on big tables, on a database that is not optimized for many users to access at the same time and located in a different time zone with high latency, that will be a nightmare for you to find a solution. On the worse case scenario you may have to change your data structure and update your infrastructure to allow 40 users to access the data simultaneously.
I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.
I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.
Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.
I am importing a file from a user on the web that contains 150K rows and has to be broken up resulting in about 1.6M items that will be added to the database.
At the moment I add the primary record first and then add the children after with the key that was provided after the first record.
While I can precompile the query and re-use it, I'd like to group the lot of them together but I'm concerned that I won't be able to parameterize the queries at that point.
At the moment I'm only importing at around 300 rows or 3000 queries/sec via the query method.
I'm not sure what your constraints are for how you can load the data, but there are a few good avenues to get good bulk import rates in to the database->Performing Bulk Copy Operations. When working with with a data import process, I always found it helpful to break it up into phases:
Import phase - various different bulk methodologies available depending on your situation
Staging phase - process work; e.g. data validations, key relations building, data scrubbing, etc
Final Insert into "live" tables. (hopefully set based insert)
It can be very efficient to pick all of the data up in the first pass with little to no logic work, and move it to a staging area en mass; either going into temp tables or permanent staging tables for that purpose. Then you can do any processing work on the data to have everything properly structured and cleaned, before mass inserting them into their final home in the live tables. This could also give you a layer of insulation from any malicious data or sql injection attacks by having one or more intermediary steps.
This separation allows the bulk import task to be as fast as you can make it, because there is hopefully very little logic required to do a mass import to a big staging dumping ground. Then you can apply whatever logic is required to slice up the data however it is appropriate. Additionally, if you have several steps that need to be done in the staging phase, you can break this up into however many smaller steps are needed and focus on optimizing the biggest/slowest parts.
In your situation, if there is a way that you can structure the data after getting to the staging phase to match what is in the live tables, you might be able to insert it as one big set. Being able to build PK->ForeignKey relations in the staging phase before the final insert, (as well as handling any other data processing work), can allow you to go from iterative inserts to one big bulk set insert..set based is usually a very good thing. That is, oh course, if your system/constraints allow for you to do as such.
I'm not sure if any of that applies to your situation or not, of if I'm way off base from what you were asking; but hopefully there is something in there that can be useful.