Compare tables to ensure non regression in postgresql - postgresql

Here is my issue: I often need to compare the same postgresql tables (or views that depend on it) between some ETL code refactoring to check for non regressions in my developments.
Let's say I have an ETL code I want to refactor, which regularly uploads data in a table. Currently, once my modifs are done, I often download my data from postgresql as a .csv file as a first step, then empty it, fill it again using my refactored code, and download the data again. Then, I compare the .csv files using for instance Python in a Jupyter Notebook.
That does not seem like the way to go at all. That notably supposes I am the only one to use that table during the operation, and so many other things I can't list them all here.
Is there a better way to go ?

It sounds to me like you have the correct approach. There's no magic to the CSV export operation: whatever tool you use runs a query and formats its resultset into the file. Any other before-and-after comparison operation would have to run the same query.
If you're doing this sort of regression test on an active database, it's probably wise to put some sort of distinctive tag on your test records, maybe prepend ETLTEST- to your customer names, so it's ETLTEST-John Bull. Then you can make your queries handle only your test records. And make sure you do something reliable for ORDER BY.
Juptyer seems a complex way to diff your csv files. Most operating systems have lightweight fast difftools.

Related

Does parameters variation not update the builtin database?

I notice that whenever I run a ParametersVariation model, the built-in database does not update... I have PLE, so there is no way for me to write my own database. I am currently able to pull data from various logs present in the database, but only from a normal simulation run. Is there a way to have the parameters variation write its data to the database after each simulation run?
I am currently running this code in After simulation run
Database myFile = new Database(this, "A DB from Excel", "C:/Users/Downloads/DataExport.xlsx");
ModelDatabase modelDB = getEngine().getModelDatabase();
modelDB.exportToExternalDB("flowchart_stats_time_in_state_log", myFile.getConnection(), "Sheet", false, true);
The export works perfectly. But the data never changes and this is confirmed by exporting a distribution from a histogram that changes with every simulation run. But for this export, its the same data as was written to the database from the last standard (non-parametersvariation) simulation run.
Model log database tables aren't produced for multi-run experiments. It's not specifically stated anywhere, but they're designed more for testing/debugging (single runs of) models.
(Also, notice that the log tables don't have columns specifying a run ID or similar, so there's no way that you would have been able to distinguish rows for different runs anyway, even if there were rows written in multi-run experiments.)
Unfortunately, because they are one of the only ways to 'automatically' produce certain forms of output data (like the contents of datasets or histograms) many people try to use them for that (even though they have a pretty un-useful 'internal' format). In general you should write to your own internal database tables for any persistent outputs, where you can also govern whether you store outputs for multiple runs or not (which will require you to calculate some form of unique run IDs and use those in columns to differentiate outputs per run, plus have logic or UI elements to determine when the table data is cleared for a new run and when it isn't).
NB: Note that the kinds of data the model log tables (like flowchart_stats_time_in_state_log which you mention) create can in virtually all cases be determined and created 'manually' via your own model code. That table in particular has a large amount of detail on what's happened in each block and, in any given case, it's probably only a fraction of that data (or a simplification/aggregation of it) that you really want/need.

Best way to report the growth of a file using powershell?

I would like to report the database size to myself via email every week and make a comparison to the week before and display the growth in Megabyte and/or %.
I have everything besides the comparison done.
Imagine this setup :
SQL server with 100 databases
Now there are plenty of ways to do a comparison, I thought about writing the sizes into XML by powershell and later read out using a second script and report to me.
Since I trained myself in powershell I might have gaps here, so I am afraid to miss an easy way.
Does anyone has a nice Idea of how to compare the size?
The report and calculation I will manage myself later, I just need a good way to do that.
Currently I am on Powershell 3.0 but I can upgrade to 4.0
Don't invent the wheel again. Sql Server already has tools to monitor DB file sizes. So does Performance Monitor. There are several 3rd party products available too. Ask your local DBA if there already is such a system present.
A common practice is to query the server for DB file sizes on, say, daily basis and store it in utility db table with timestamp. Calculating change volumes, ratios and whatnot can be done on TSQL side. (Not that it is CPU intensive anyway.)
I would creat foreach database an csv file. then write out two rows:
Date,Size
27.08.2014,1024
28.08.2014,1040
29.08.2014,1080
Then you can import the csv file, sort the row by date, compare the last two sizes and send the result by mail.

In OpenEdge, how do you transfer parts of the data in the database in an easy way?

I have a lot of data in 2 different databases and in many different tables I would like to move from one computer into a few others. The others has the same definition of the db:s. Note, not all the data should be transfered, only some that I define. Some tables fully, and some others just partly.
How would I move these data in the easiest way? To dump each table and load separately in many .d files - is not an easy way. Could you do something similar to the Incremental .df File that contains all that has to be changed?
Dumping (and loading) entire tables is easy. You can do it from the GUI or by command line. Look at for instance this KnowledgeBase entry about command line dump & load and this about creating scripts for dumping the entire database.
Parts of the data is another story. This is very individual and depends on your database and your application. It's hard for a generic tool to compare data and tell if a difference in data depends on changed data, added data or deleted data. Different databases has different kinds of layout, keys and indices.
There are however several built in commands that could help you:
For instance:
IMPORT and EXPORT for importing and exporting data to files, streams etc.
Basic import and export
OUTPUT TO c:\temp\foo.data.
FOR EACH foo NO-LOCK:
EXPORT foo.
END.
OUTPUT CLOSE.
INPUT FROM c:\temp\foo.data.
REPEAT:
CREATE foo.
IMPORT foo.
END.
INPUT CLOSE.
BUFFER-COPY and BUFFER-COMPARE for copying and comparing data between tables (and possibly even databases).
You could also use the built in commands for doing "dump" and then manually edit the created files.
Calling Progress Built in commands
You can call the back end that dumps data from Data Administration. That will require you to extract those .p-files from it's archives and calling them manually. This will also require you to change PROPATHS etc so it's not straightforward. You could also look into modifying the extracted files to your needs. Remember that this might break when upgrading Progress so store away your changes in separate files.
Look at this Progress KB entry:
Progress KB 15884
Best way for you depends on if this is a one time or reacurring task, size and layout of database etc.

How to verify large postgresql Databases running different version have the same data without dumping

How Would I verify that the data in a 8.3 postgresql DB is the same as the data in a 9.0 DB
When I did a sql dump on a example table there we3re many differences that showed but this was due to 9.0 truncating 0's on the end and begining of date fields, also the order of the dump was not fixed, even though this can be sorted with sort(no pun intended). it does not allow validation as it would loose what table it was part of as the sorted sql dump would be a meaningless splat of sql commands with dump settings thrown in for extra.
count(*) is also not adequate.
I would like to be 100% sure that the data in one is equal to the data in the other despite the version differences and the way that at the very least dates are held in 9.0.
I should add I have several hundred tables and many hundred GB of data. so i need a automated process like diff DUMPa.sql DUMP2.sql, a SHA of the data(not the format) would be idea, but one cannot diff binary dumps of PostgreSQL for well known reasons. I am aware mysql has a checksum feature, but im using postgresql.
First the bad news. There is really no way to offer the full concerns you want addressed without loading all the data into an intermediary program and directly comparing. This will take time and it will drag your system down load-wise so my recommendation is set up some sort of replication and compare replicas.
One thing you might be able to do is to use something like Slony or Bucardo to replicate, and then triggers to move data into secondary child partitions and replicate those onto a consolidated server for comparison. You could then compare within PostgreSQL. This would reduce the load and it would mean your reporting data would be relatively easy to manage compared to other approaches. But all the data is going to have to be loaded and compared line-by-line.

SQLite3: Batch Insert?

I've got some old code on a project I'm taking over.
One of my first tasks is to reduce the final size of the app binary.
Since the contents include a lot of text files (around 10.000 of them), my first thought was to create a database containing them all.
I'm not really used to SQLite and Core Data, so I've got basically two questions:
1 - Is my assumption correct? Should my SQLite file have a smaller size than all of the text files together?
2 - Is there any way of automating the task of getting them all into my newly created database (maybe using some kind of GUI or script), one file per record inside a single table?
I'm still experimenting with CoreData, but I've done a lot of searching already and could not find anything relevant to bringing everything together inside the database file. Doing that manually has proven no easy task already!
Thanks.
An alternative to using SQLite might be to use a zipfile instead. This is easy to create, and will surely safe space (and definitely reduce the number of files). There are several implementations of using zipfiles on the iphone, e.g. ziparchive or TWZipArchive.
1 - It probably won't be any smaller, but you can compress the files before storing them in the database. Or without the database for that matter.
2 - Sure. It's shouldn't be too hard to write a script to do that.
If you're looking for a SQLite bulk insert command to write your script for 2), there isn't one AFAIK. Prepared insert statments in a loop inside a transaction is the best you can do, I imagine it would take only a few seconds (if that) to insert 10,000 records.