Working with multiple data warehouses in dbt

Working with multiple data warehouses in dbt - amazon-redshift

I'm building an application where each of our clients needs their own data warehouse (for security, compliance, and maintainability reasons). For each client we pull in data from multiple third party integrations and then merge them into a unified view, which we use to perform analytics and report metrics for the data across those integrations. These transformations and all relevant schemas are the same for all clients. We would need this to scale to 1000s of clients.
From what I gather dbt is designed so each project corresponds with one warehouse. I see two options:
Use one project and create a separate environment target for each client (and maybe a single dev environment). Given that environments aren't designed for this, are there any catches to this? Will scheduling, orchestrating, or querying the outputs be painful or unscalable for some reason?
profiles.yml:
example_project:
target: dev
outputs:
dev:
type: redshift
...
client_1:
type: redshift
...
client_2:
type: redshift
...
...
Create multiple projects, and create a shared dbt package containing most of the logic. This seems very unwieldy needing to maintain a separate repo for each client and less developer friendly.
profiles.yml:
client_1_project:
target: dev
outputs:
client_1:
type: redshift
...
client_2_project:
target: dev
outputs:
client_2:
type: redshift
...
Thoughts?

I think you captured both options.
If you have a single database connection, and your client data is logically separated in that connection, I would definitely pick #2 (one package, many client projects) over #1. Some reasons:
Selecting data from a different source (within a single connection), depending on the target, is a bit hacky, and wouldn't scale well for 1000's of clients.
The developer experience for packages isn't so bad. You will want a developer data source, but depending on your business you could maybe get away with using one client's data (or an anonymized version of that). It will be good to keep this developer environment logically separate from any individual client's implementation, and packages allow you to do that.
I would consider generating the client projects programmatically, probably using a Python CLI to set up, dbt run, and tear down the required files for each client project (I'm assuming you're not going to use dbt Cloud and have another orchestrator or compute environment that you control). It's easy to write YAML from Python with pyyaml (each file is just a dict), and your individual projects probably only need separate profiles.yml, sources.yml, and (maybe) dbt_project.yml files. I wouldn't check these generated files for each client into source control -- just check in the script and generate the files you need with each invocation of dbt.
On the other hand, if your clients each have their own physical database with separate connections and credentials, and those databases are absolutely identical, you could get away with #1 (one project, many profiles). The "hardest" parts of that approach would likely be managing secrets and generating/maintaining a list of targets that you could iterate over (ideally in a parallel fashion).

Related

Deploy BizTalk Schema Solution without redeploying dependent Solutions

I have three solutions. One is a schema solution that only has a schema File in it, lets call it the SchemaSolution.
The SchemaSolution is referenced in my other two solutions because the Solution1 creates xml instances of the schema in the SchemaSolution and drops it as self-correlating in the message box.
This works magically but if I want to update one of the solutions where the SchemaSolution is referenced (deploy to BizTalk) I always have to delete the other solutions. This is horrible and I was not able to find a solution until now.
Is there a (no hacky) way? I thought about merging all Projects into one solution, but this is the worst case scenario I can imagine to achieve my goal.
How can I deploy a project that is referenced in different solutions without deleting and redeploying everything?
BizTalk 2013R2 in use

No this is not supported and not recommended to try and hack your way into this idea (definitely need to alter the BizTalk database, and this is not even allowed by Microsoft i think).
I can give you 3 options:
Make the SchemaSolution as small as possible, like break it down into multiple schema solutions per process for instance, so the chances of you needing to change the solution will be smaller. Ideally, in this solution you would have 1 assembly/project per schema, so new schema's can be added without redeploy.
Another option would be to duplicate your schema's into your projects, this is a design choice you could make, but would require some more work as you need to specify schema's in your pipelines (or else it doesn't know which one you mean), and you have double work with changing the same schema's in multiple projects. The downside is, the schema's are not the same to BizTalk so you can't use it in another project without reference.
Your final option would be to get rid of the dependency of that schema completely, you can do this by creating your own internal/generic/cdm schema, which ideally would be more robust and less prone to changes. This schema would still be referenced by multiple projects, but since you're the one in charge of it, you can predict and mold it into your likings. Again, ideally, in this solution you would have 1 assembly/project per schema, so new schema's can be added without redeploy.

I have a very similar (if not the same) issue within a solution.
I have a set of integration projects dependent on a simple schema project. If I deploy one integration project, I must deploy the schema project, which means I must deploy all integration projects!
In order to deploy them independently, I simply turned the redeploy flag from true to false within properties (in VS) of the schema project..
This allows me to redeploy as many other dependent projects as I like without having to delete or mess around. I can deploy a single integration project with no effect on the others.
The only caveat, is when you redeploy, for some reason, VS flags the fact you have set redeploy to False on the schema project as an error and says that one of the projects was not deployed.
Not a true error, more of a warning imo.
I have been doing this in BT2016, I would assume you can do the same in 2013

Multiple users working remotely on Tally ERP9

I am not very sure whether this is the right forum to ask this question or not.
We are having a TallyERP9 server with Multiple Licenses. Now our 3 users working remotely on the same Data. We have set up Google Drive for Data Syncing. But most of the time its giving issue due to synchronisation process.
What could be the best soltion so that multiple users can work on same data from Remote Locations?

This is the answer - http://mirror.tallysolutions.com/Downloads/TallyTips/GettingStartedwithDataSynchronisation.pdf
Thanks to #MitaleeRao...
Edit placed here for brevity:
These are 2 points I've noted regarding the Tally architecture:
The database is a flat file in a tree structure, and there are numerous checkpoints at each level for maintaining this inheritance (for e.g., a voucher has inventory entries that have stock items, which have units, etc.).
The SOAP XML protocol that Tally uses does not have multi-threading capabilities - i.e., the Tally server will only accept one request and give a response at a time.
The Data syncronisation that Tally has introduced is probably the automating of exporting the XML of all masters/vouchers and importing them onto the central Database (whether on the Tally.NET server or on a local computer with a static IP). Not sure how the Google Drive client works, but I'm assuming it is a variation of the same (i.e., XML based data export and then import onto a main computer).

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?

This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

Alfresco: Which tables do I need to query to get workflow details

I am trying to generate analytics reports using high charts.
For that I need to get values from alfresco DB using Postgres query. Can any one tell me what are the tables related workflow creation and save all workflow details.

I got some references from the below link.
http://techogeek.blogspot.in/2015/09/how-retrieve-activiti-workflow-details.html
As Gagravarr suggests, don't hit the database directly and use always the workflowservice to get the data workflow related data.
ACT_RE_*: RE stands for repository. Tables with this prefix contain static information such as process definitions and process resources (images, rules, etc.).
ACT_RU_*: RU stands for runtime. These are the runtime tables that contain the runtime data of process instances, user tasks, variables, jobs, etc. Activiti only stores the runtime data during process instance execution, and removes the records when a process instance ends. This keeps the runtime tables small and fast.
ACT_ID_*: ID stands for identity. These tables contain identity information, such as users, groups, etc.
ACT_HI_*: HI stands for history. These are the tables that contain historic data, such as past process instances, variables, tasks, etc.
ACT_GE_*: general data, which is used in various use cases.

Explain the steps for db2-cobol's execution process if both are db2 -cobol programs

How to run two sub programs from a main program if both are db2-cobol programs?
My main program named 'Mainpgm1', which is calling my subprograms named 'subpgm1' and 'subpgm2' which are a called programs and I preferred static call only.
Actually, I am now using a statement called package instead of a plan and one member, both in 'db2bind'(bind program) along with one dbrmlib which is having a dsn name.
The main problem is that What are the changes affected in 'db2bind' while I am binding both the db2-cobol programs.
Similarly, in the 'DB2RUN'(run program) too.

Each program (or subprogram) that contains SQL needs to be pre-processed to create a DBRM. The DBRM is then bound into
a PLAN that is accessed by a LOAD module at run time to obtain the correct DB/2 access
paths for the SQL statements it contains.
You have gone from having all of your SQL in one program to several sub-programs. The basic process
remains the same - you need a PLAN to run the program.
DBA's often suggest that if you have several sub-programs containing SQL that you
create PACKAGES for them and then bind the PACKAGES into a PLAN. What was once a one
step process is now two:
Bind DBRM into a PACKAGE
Bind PACKAGES into a PLAN
What is the big deal with PACKAGES?
Suppose you have 50 sub-programs containing SQL. If you create
a DBRM for each of them and then bind all 50 into a PLAN, as a single operation, it is going
to take a lot of resources to build the PLAN because every SQL statement in every program needs
to be analyzed and access paths created for them. This isn't so bad when all 50 sub-programs are new
or have been changed. However, if you have a relatively stable system and want to change 1 sub-program you
end up reBINDing all 50 DBRMS to create the PLAN - even though 49 of the 50 have not changed and
will end up using exactly the same access paths. This isn't a very good apporach. It is analagous to compiling
all 50 sub-programs every time you make a change to any one of them.
However, if you create a PACKAGE for each sub-program, the PACKAGE is what takes the real work to build.
It contains all the access paths for its associated DBRM. Now if you change just 1 sub-program you
only have to rebuild its PACKAGE by rebinding a single DBRM into the PACKAGE collection and then reBIND the PLAN.
However, binding a set of PACKAGES (collection) into a PLAN
is a whole lot less resource intensive than binding all the DBRM's in the system.
Once you have a PLAN containing all of the access paths used in your program, just use it. It doesn't matter
if the SQL being executed is from subprogram1 or subprogram2. As long as you have associated the PLAN
to the LOAD that is being run it should all work out.
Every installation has its own naming conventions and standards for setting up PACKAGES, COLLECTIONS and
PLANS. You should review these with your Data Base Administrator before going much further.
Here is some background information concerning program preperation in a DB/2 environment:
Developing your Application

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse