ETL tool or ad-hoc solutions? - postgresql

I'm designing a data warehouse system, the origin data sources are two: files (hexadecimal format, record structure known) and PostgreSQL database.
The ETL phase has to read the content of the two sources (files and DB) and combining/integrating/cleaning them. After this, loading data into the DW.
For this purpose, is better a tool (for example Talend) or ad-hoc solution (writing ad-hoc routines by using a programming language)?

I would suggest you use the Bulk Loader to get your flat file into DB. This allows you to customize the loading rules and then process/cleanse the resulting data set using regular SQL (no other custom code to write)

Related

Where to use the Azure Data Factory Mapping Data Flow make sense?

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..
I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

tELTPostgresql* usage issue

I'm trying to use tELTPostgresqlOutput with postgres 9.3 server and this is the result:
With a simple tPostgresqlInput and a tLogRow it works perfectly.
This is not how to use the ELT components. These should be used to do in database server transformations such as creating a star schema table from multiple tables in the same database. This allows you to use the database to do the transformation and avoid reading the data into memory for your job. It's particularly useful when dealing with large datasets that can't be broken down for the transformation.
If you want to transfer data from one database server/vendor to another you will need to use ETL components (pretty much anything not explicitly marked ELT) to read data out of the source database and write it back to the target database.
In this case you should be using a tMSSQLInput component to read in the data you need, a tMap to transform the data in the way you want and a tPostgresqlOutput component to write the data out to the Postgres database.

Importing AccessDB and Oracle directly into MongoDB

I am receiving .dmp and .mdb files from a customer & need to get that data into MongoDB.
Is there any way to straight import these file types into Mongo?
The goal is to programmatically ingest these into mongo in any way I can. The only rule is that customer will not change their method of data delivery, meaning I'm stuck with the .dmp and .mdb files as a source.
Any assistance would be greatly appreciated.
Here are a few options/ideas:
Convert mdb to csv, then use mongoimport --type csv to import into MongoDB.
Use an ETL tool, e.g. Pentaho, Informatica, etc. This will give you much more flexibility for doing any necessary transformation/conversion of data.
Write a custom ETL tool, using libraries that know how to read mdb and dmp files.
You don't mention how you plan to use this data, how many tables are in the database, and how normalized the tables are. Depending on the specifics of your use case, it's very possible that loading the data from Access "as is" will not be a good choice since normalized schemas are not a good fit for MongoDB and MongoDB does not natively support joins. This is where an ETL tool can help, by extracting the source data and transforming it into an appropriate JSON structure.
MongoDB has released ODBC drivers. Go Here MongoDB ODBC Drivers connect MSAccess directly to MongoDB through ODBC. Voila!

Migrating a schema from one database to other

As part of some requirement, I need to migrate a schema from some existing database to a new schema in a different database. Some part of it is already done and now I need to compare the 2 schema and make changes in the new schema as per gap finding.
I am not using a tool and was trying to understand some details using syscat command but could not get much success.
Any pointer on what is the best way to solve this?
Regards,
Ramakant
A tool really is the best way to solve this – IBM Data Studio is free and can compare schemas between databases.
Assuming you are using DB2 for Linux/UNIX/Windows, you can do a rudimentary compare by looking at selected columns in SYSCAT.TABLES and SYSCAT.COLUMNS (for table definitions), and SYSCAT.INDEXES (for indexes). Exporting this data to files and using diff may be the easiest method. However, doing this for more complex structures (tables with range or database partitioning, foreign keys, etc) will become very complex very quickly as this information is spread across a lot of different system catalog tables.
An alternative method would be to extract DDL using the db2look utility. However, you can't specify the order that db2look outputs objects (db2look extracts DDL based on the objects' CREATE_TIME), so you can't extract DDL for an entire schema into a file and expect to use diff to compare. You would need to extract DDL into a separate file for each table.
Use SchemaCrawler for IBM DB2, a free open-source tool that is designed to produce text output that is designed to be diffed. You can get very detailed information about your schema, including view and stored procedure definitions. All of the information that you need will be output in a single file, and can be compared very easily using a standard diff tool.
Sualeh Fatehi, SchemaCrawler
unfortunately as per company policy, cannot use these tools at this point of time. So am writing some program using JDBC to get the details and do some comparison kind of stuff.