Can we generate a data lineage from our DataStage Jobs? - datastage

We're using IBM DataStage 11.7.1
The metadata asset manager was not used in the Project.
Can we generate a data lineage out of the existing and used jobs (knowing that not 100% can be covered)? If yes: how?

You can only generate lineage within a job, using DataStage. That is, you can answer questions "show where data flows to" and "show where data comes from" within the context of the one job. You can access this functionality by right-click on the stage about which you're interested in asking the question.
Beyond that, you can generate data lineage more formally using the Information Governance Catalog tool. If you are not using shared metadata resources, and not generating operational metadata when running jobs, then the lineage report will be based on design data only.
If you share the table definitions you use in your jobs into the common metadata repository (from the Repository menu in DataStage Designer), then you will get better lineage results in IGC. If you generate operational metadata when running your jobs then these operational metadata will also be available in lineage reports.
Don't forget that DataStage jobs are not included in lineage by default. You need to mark at least the jobs of interest as "include for lineage" in the Administration page of IGC.

Related

Is azure data factory ELT?

I have a question about Azure Data Factory (ADF). I have read (and heard) contradictory info about ADF being ETL or ELT. So, is ADF ETL? Or, is it ETL? To my knowledge, ELT uses the transformation (compute?) engine of the target (whereas ETL uses a dedicated transformation engine). To my knowledge, ADF uses Databricks under the hood, which is really just an on-demand Spark cluster. That Spark cluster is separate from the target. So, that would mean that ADF is ETL. But, I'm not confident about this.
Good question.
It all depends on what you use and how you use it.
If it is strictly a copy activity, then it is ELT.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
The transform can be a stored procedure (does not matter RDBMS) and the source/destination are tables. If the landing zone is a data lake, then you want to call a Databricks or Synapse notebook. Again, the source is a file. The target is probably a delta table. Most people love SQL and delta tables give you those ACID properties.
Now, if you using a mapping or wrangling data flow, then it is ETL, if the pattern is pure. Of course you can mix and match. Both these data flows use a spark engine. It cost money to have big spark clusters running.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
https://learn.microsoft.com/en-us/azure/data-factory/wrangling-tutorial
Here is an article from MSDN.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl
It has old (SSIS) and new (SYNAPSE) technologies.

write apache iceberg table to azure ADLS / S3 without using external catalog

I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.
I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg

Where to use the Azure Data Factory Mapping Data Flow make sense?

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..
I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

Exporting AnyLogic dataset log files as outputs in Cloud model

I want to upload my AnyLogic model to AnyLogic Cloud through the run configuration process, and for outputs, I want to add several dataset log files. I was wondering how can I do that?
Just add the datasets themselves as model outputs; graphs for them will then be included in the Cloud experiments' dashboards.
The AnyLogic Cloud doesn't currently support having Excel outputs from your models (just Excel inputs), so you can't use AnyLogic-database-dataset-logs exported to Excel. But you can now (since the most recent version) download the outputs from all your model runs in JSON format (which would include the dataset data).
Otherwise you have to do it 'Cloud-natively' instead; e.g., write relevant output data to some Cloud-based database (like an Amazon database). That obviously requires the Java knowledge on how to do that (and an AnyLogic Cloud subscription, or Private Cloud installation, to be able to access external resources).

Can I Extract, Translate, and Compare using Talend?

I am evaluating Talend and I have little experience using the tool. My question is regarding audit. Does Talend offer a way to run with an audit option that extracts the source data, transforms it and then compares the transformed data to the existing target data? I would like a report that tells me which records would have triggered an update if the audit option had not prevented it.
I know that ETL's are triggered on change but I would want to run this audit without regard to changed timestamps. That is, a query would specify what source records to audit.
I have done a search of the talend documentation but nothing jumped out at me.
Thank you.
EDIT
Looking outside of Talend, I found where Tim Mitchell wrote:
While I have not yet found an auditing framework with enough
customization to suit me, I have learned that it is possible to audit
ETL processes in a customized way without reinventing the wheel every
time. I keep on hand a common (but never set in stone) set of table
definitions and supporting logic to shorten the path to the following
auditing objectives:
Simple aggregate auditing of row counts, financial data, and other key metrics
Documentation templates for source-to-target mappings and transformation logic
Lightweight auditing of batch-level changes
Full auditing of every change made in the transformation process
So I guess that auditing is on me.