data integration-, multiple databases, unique incremental SOR_id using talend - talend

I'm trying to integrate multiple databases using talend and in turn have an SOR_id for each table for auditing purposes. is it possible to map between multiple source tables simultaneously to destination table having an SOR_id which is meant to be auto incremented? Would I have incremental values for each source tables rows
I have approached this using another way as shown in the image so that my SOR_id can be accounted for.

Related

How to check a table is made from which tables in pyspark

I have a core layer where I have some tables and I want to find out by what tables in the source layer are these tables made up of. Like the tables in core layer are made by joining some of the tables of source layer. I want to generate an excel sheet using code so that I am able to display the core tables are made from which tables.
I am using PySpark on Databricks and the codes are written for creating the tables in notebooks.
Any help on how to approach this will be beneficial.
This is possible when you use Databricks Unity Catalog - as part of it, there is a feature called Data Lineage that tracks what tables & columns were used to create a specific table and who are consumers of it as well. It also includes Lineage API that could be used for exporting of the lineage data.

Row based database or Column based database

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks
For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

Loadind multiple table from source into multiple table into target in Talend

I have around 25 tables to load to target with same structure and which use the same logic for loading. I have prepared one job which does that, but it's a long process to design all the tables.
Is there any way to pass the table name and load to target, basically a small job (in size).
I am using Talend open studio.
Check my answer to a similar question where I proposed a generic solution for loading a MySQL table to another MySQL table.
You just need to modify the queries that retrieve the tables' metadata (columns) depending on your database type.

talend - components vs view (one data source only)

Just wanted to know which will be faster on extracting the data from multiple tables (same database). Is it better to create a view with all the tables joined as an input source, or have all the tables and connect them via talend components?
Regards,
Yin