Multiple datasets (sources) in the same DBT project - postgresql

I am new to DBT and I am facing a design challenge. The thing is I have 20+ data integrations. Each integration is feeding data into one Postgres DB. The Postgres DB has several tables for example integration_1_assets, integration_2_assets. These tables are all for assets but have different datasets (different column names etc).
Now I would like to create a DBT project to pick data for each integration then transform these into one final dim_assets table. They may be one or more data transformations before coming final insert into dim_assets. Business requires that we run the transformations for each integration differently i.e. integration_1 should have its own transformation pipeline independent of others.
Should I create a DBT project for each integration pipeline or I can use one?

If you have one postgres database with many tables (one per integration), then I think you should have one DBT project.
The docs describe how to set up a postgres connection in profiles.yml
You'll want to configure some sources something like as follows (we'll call this sources.yml):
version: 2
sources:
- name: assets
description: "Tables containing assets from various integrations"
database: your_database
schema: your_schema
tables:
- name: integration_1
identifier: integration_1_assets
description: "Assets from integration 1"
- name: integration_2
...
Following best practices for project structure I would suggest you create a set of 'staging' models that read directly from the source models and execute basic cleaning/transformation:
-- stg_assets_1.sql
SELECT
col1 AS standard_column_name_1,
...
colN AS standard_column_name_N
FROM
{{ source('assets', 'integration_1') }}
WHERE
data_is_good = true
... then via a series of intermediate steps combine the many staging models into a single dim_assets model.
The project layout would look something like this
models
├── sources
│   └── sources.yml
├── staging
│   ├── stg_models.yml
│   ├── stg_assets_1.sql
...
│   └── stg_assets_N.sql
├── intermediate
│   ├── int_models.yml
│   ├── int_assets_combined_1.sql
...
│   └── int_assets_combined_M.sql
└── final
├── final_models.yml
└── dim_assets.sql
Here the intermediate/final models (or whatever you prefer to call them) would reference the earlier models using {{ ref('stg_assets_1') }} etc.
The other YAML files are your model files, allowing you to document (and test) the models defined in each subdirectory.
Things such as materialisation strategy can be defined in the top level dbt_project.yml (e.g. intermediate models could be ephemeral or views, while your final dim_assets model could be materialised as a table).

Related

Kustomize structure for different environments and cloud providers

I have a scenario and was wondering the best way to structure it with Kustomize.
Say I have multiple environments: dev, qa, prod
and say I have multiple DCs: OnPrem, AWS, GCP
Let's say each DC above has a dev, qa, prod environment.
I have data that is per environment but also per DC. For example, apply this string to dev overlays but apply these, if AWS.
Is there a way to easily doing this without duplication. An example may be, say if it's AWS, I want to run an additional container in my pod, and if it's prod I want extra replicas. If it's GCP, I want a different image but if it's prod, I still want extra replicas.
The below example, will have a lot of duplication. I've read you can do multiple bases. Maybe it makes sense to have a AWS, GCP, OnPrep Base and then have a dev, qa, prod overlay and have mutiple Kustomize files for each?
ie
├── base
│   ├── guestbook-ui-deployment.yaml
│   ├── guestbook-ui-svc.yaml
│   └── kustomization.yaml
└── overlay
├── dev
│   ├── aws
│   │   ├── guestbook-ui-deployment.yaml
│   │   └── kustomization.yaml
│   └── gcp
│   ├── guestbook-ui-deployment.yaml
│   └── kustomization.yaml
└── qa
├── aws
│   ├── guestbook-ui-deployment.yaml
│   └── kustomization.yaml
└── gcp
├── guestbook-ui-deployment.yaml
└── kustomization.yaml
I recommend having an overlay for each combination you want to build. e.g:
└── overlays
├── aws-dev
├── aws-qa
└── gcp-dev
Then you can structure in different ways, such as using components:
└── components
├── environments
│ ├── dev
│ └── qa
└── providers
├── aws
└── gcp
This makes sense because you usually don't create all combinations of possible environments, but only some that make sense to you.
More documentation: https://github.com/kubernetes-sigs/kustomize/blob/master/examples/components.md

How do I execute a script with Cloudinit, using the nocloud data source

I am trying to automate provisioning of linux machines (rhel8) using cloudinit.
For this purpose I created an ISO with the following content:
[root#]# tree rhel82test1/
rhel82test1/
├── meta-data
├── network-config
├── scripts
│   └── per-instance
│   └── demo.sh
└── user-data
2 directories, 4 files
This iso gets added to the VM during virt-install.
The user-data, meta-data and network-config get applied as expected. However I expected the demo.sh script to be executed as well. I see in the logs that config-scripts-per-instance gets run, but the script is not executed. It is als not present in /var/lib/cloud/instance/scripts
What am I doing wrong here. Is this not the correct way to have scripts executed?
See source, not implemented for NoCloud

Flyway multiple metadata tables in one schema

I'm trying to use Flyway to version the database of a modular application. Each module has its own separate set of tables, and migration scripts that will control the versioning of that set of tables.
Flyway allows me to specify a different metadata table for each module - this way I can version each module independently. When I try to upgrade the application, I run a migration process for each module, each with its own table and set of scripts. Note that these tables are all in the same schema.
However, when I try to migrate my application, the first migration is the only one that works. Subsequent migrations fail with the following exception: org.flywaydb.core.api.FlywayException: Found non-empty schema(s) "public" without metadata table! Use baseline() or set baselineOnMigrate to true to initialize the metadata table.
If I create the metadata table for each module manually, migrations for each module work correctly. Creating the table myself rather than having Flyway create it for me seems like a hack to work around a problem, rather than a solution in itself.
Is this a valid way of managing multiple sets of tables independently, or is there a better way of doing this? Is it a valid approach to create the metadata table myself?
An ideal solution for you would be to split your modules into schemas. This gives you an effective unit of isolation per module and is also a natural fit for modular applications (modules completely isolated and self managing), rather than dumping everything into a single schema (especially public). eg
application_database
├── public
├── module_1
│   ├── schema_version
│   ├── m1_t1
│ └── m1_t2
├── module_2
│   ├── schema_version
│   ├── m2_t1
│ └── m2_t2
...
Your second option is to remain using the public schema to host all tables, but use an individual schema for each schema_version. This is less refactoring effort but certainly a less elegant design than that mentioned above. eg
application_database
├── public
│   ├── m1_t1
│ ├── m1_t2
│   ├── m2_t1
│ └── m2_t2
├── module_1
│   └── schema_version
│
├── module_2
│   └── schema_version
...
I think you need to baseline each module before performing the migrate. You'll need to pass the table option to override schema_version for each module eg flyway.table=schema_version_module1. As the error message suggests you can also baselineOnMigrate however that is warned against in the docs (https://flywaydb.org/documentation/commandline/migrate).
We are considering a similar approach with another schema_version table to apply and log data fixes that cannot be rolled out to every environment cleanly.

Perl modules hierarchy and composition

Reading more and more about Perl, I'm having doubts about how I organized my modules on my current project.
I have a main namespace - let's call it "MyProject".
In this project, the base data are graphs in which each object has a class. objects are linked with relations. Both objects and relations can have attributes.
There is a part of the project where I use a model to validate these graphs.
So I have a Model class that is composed of objects of class Class, Relation and Attribute.
Classes Class, Relation and Attribute are used by class Model only, so it made sense to me to organize the modules as follows:
MyProject::Model
MyProject::Model::Class
MyProject::Model::Relation
MyProject::Model::Attribute
But I'm starting to think that it will make sense to me only if I ever dare to relase parts of my project in CPAN. I think people will believe that Class, Relation and Attribute are inheriting Model and not composing it.
So: shall I reorganize my modules this way:
MyProject::Model
MyProject::Class
MyProject::Relation
MyProject::Attribute
Or maybe indicate that Class, Relation and Attribute are parts of Model by appending their names?
MyProject::Model
MyProject::ModelClass
MyProject::ModelRelation
MyProject::ModelAttribute
My question: What is the currently considered a best practices for module organization and naming when it comes to composition?
Cross-posted at Perlmonks
Your concern is correct. Typically, packages are named by either inclusion or inheritance. I'm going to make up a more realistic example. Let's say we are building an online shop.
Inclusion
You first pick a namespace for your project. In the business world, you often see namespaces with the company name first, to distinguish proprietary modules from CPAN ones. So the project's namespace could be:
OurCompany::Shop
Then probably the main class or module for the application is called the same, so we have a
OurCompany/Shop.pm
We will have a bunch of things that we need to make an online shop. If our project is MCV, there are controllers and models and stuff like that. So we might have these things:
OurCompany::Shop::Controller::ProductSearch
OurCompany::Shop::Controller::Cart
OurCompany::Shop::Controller::Checkout
OurCompany::Shop::Model::Database
All of those map to modules direcly.
OurCompany/Shop/Controller/ProductSearch.pm
OurCompany/Shop/Controller/Cart.pm
OurCompany/Shop/Controller/Checkout.pm
OurCompany/Shop/Model/Database.pm
But there is no OurCompany::Controller as a base class. That name is just a namespace to sort things into.
Then there are some things that are just there, and get used by OurCompany::Shop, like the Session engine.
OurCompany::Shop::Session
Those go on the first level after the project, unless they are very specific.
Now of course there is some kind of engine behind the session system. Let's say we are fancy and use Redis for our sessions. If we implement the communication ourselves (which we wouldn't because CPAN has done that already), we would stick that implementation into
OurCompany::Shop::Session::Engine::Redis
The only thing that uses this module is OurCompany::Shop::Session under the hood. The main application doesn't even know what engine is used. Maybe you don't have Redis on your development machine, so you are using plain files.
OurCompany::Shop::Session::Engine::File
Both of them are there, they belong to ::Session, but they don't get used by any other part of the system, so we file them away where they belong to.
Inheritance
We will also have Products1. The base product class could be this.
OurCompany::Shop::Product
And there is a file for it.
OurCompany/Shop/Product.pm
Just that this base product never gets used directly by the shop, other than checking that certain things have to have ::Product in their inheritance tree (that's an isa check). So this is different from ::Session, which gets used directly.
But of course we sell different things, and they have different properties. All of them have prices, but shoes have sizes and hard drives have a capacity. So we create subclasses.
OurCompany::Shop::Product::Shoe
OurCompany::Shop::Product::HardDrive
And those have their own files.
OurCompany/Shop/Product/Shoe.pm
OurCompany/Shop/Product/HardDrive.pm
We might also distinguish between a mechanical HardDrive and an SSD, so we make an ::SSD subclass.
OurCompany::Shop::Product::HardDrive::SSD
OurCompany/Shop/Product/HardDrive/SSD.pm
So basically things are put in the same namespace if they belong to each other. Here's a tree view of our lib.
.
└── OurCompany
├── Shop
│   ├── Controller
│   │   ├── Cart.pm
│   │   ├── Checkout.pm
│   │   └── ProductSearch.pm
│   ├── Model
│   │   └── Database.pm
│   ├── Product
│   │   ├── HardDrive
│   │   │   └── SSD.pm
│   │   ├── HardDrive.pm
│   │   └── Shoe.pm
| ├── Product.pm
│   └── Session.pm
│   │   └── Engine.pm
│   │   ├── File.pm
│   │   └── Redis.pm
└── Shop.pm
To sum up:
For your main module, the file name is the last part of the project's package name. Everything else goes under there.
Stuff that belongs together gets grouped under one namespace.
Stuff that inherits from something goes under that something's namespace.
Stuff that only gets used by one specific thing goes under that thing.
Never append things together like MyProject::AB to indicate that it belongs to MyProject::A. Always use namespace separators to organize your namespaces. Consider this, which looks just plain wrong:
OurCompany::ShopProductShoe
1) It might be that we also have a backoffice application that also uses the same product classes. In that case, we might have them as OurCompany::Product, and they get used by two different projects, OurCompany::Shop and OurCompany::BackOffice.

How to add files to deploy with Qt Creator, not using qmake

I'm using Qt Creator as my IDE for a non-Qt project, cross-compiled for arm-linux, to be deployed to a raspberry pi (Qt Creator is a pretty good IDE even when not using Qt!). The project doesn't use qmake to build, so there's no .pro file to modify.
I'd like to add a deployment step where the main executable, and maybe more things in the future, get copied over to the device, ready for testing or debugging. From the IDE, there seem to be no way to add files to be deployed:
All help pages I've seen say to add something to the INSTALL variable in your .pro file, but of course, that doesn't apply to me. Is there a way to do this, or is the "custom command" (and writing my own deployment script) my only option?
Qt creator does not know anything about Raspberry Pi, MCUs and other devices. So yes, you need to write your own script, but it can be easily integrated into Qt creator. First, if you do not use qmake, then I'll assume you are using Makefile. If so, write your deployment script as the install target of the Makefile and choose "local" deployment method in Qt Creator's run settings. Add Make deploy step and write install to Additional arguments text box.
You also may tune Qt Creator to run something other than the program you just built. For example, you may run a script which logs into remote RPi and runs what was installed. Another option is not to run anything. For example, I use Qt Creator to develop a program for a bare-metal MCU, so it starts immediately after flashing which in turn is triggered by make install from Qt Creator's deployment stage. Qt Creator needs to run something locally when you press Run button, so to stop it bothering me about executables I pointed it's Run stage in Run Settings to /usr/bin/true binary.
If you want to deploy let's say config folder to target device
├── embix.pro
├── main.cpp
├── main.h [TARGET DEVICE]
...
├── config ├── /etc/embix
│   ├── bbb │   ├── bbb
│   │   └── pin.conf │   │   └── pin.conf
│   ├── orangepi0 ------> │   ├── orangepi0
│   │   └── pin.conf │   │   └── pin.conf
│   └── rpi │   └── rpi
│      └── pin.conf │      └── pin.conf
In pro file do this
# Default rules for deployment.
target.path = /home/pi/$${TARGET}/bin // where your binary goes
# new deploy rule called config
myconf.files = ./config/* // from
myconf.path = /etc/$${TARGET} // to
!isEmpty(target.path): INSTALLS += target
!isEmpty(myconf.path): INSTALLS += myconf