How to check a table is made from which tables in pyspark - pyspark

I have a core layer where I have some tables and I want to find out by what tables in the source layer are these tables made up of. Like the tables in core layer are made by joining some of the tables of source layer. I want to generate an excel sheet using code so that I am able to display the core tables are made from which tables.
I am using PySpark on Databricks and the codes are written for creating the tables in notebooks.
Any help on how to approach this will be beneficial.

This is possible when you use Databricks Unity Catalog - as part of it, there is a feature called Data Lineage that tracks what tables & columns were used to create a specific table and who are consumers of it as well. It also includes Lineage API that could be used for exporting of the lineage data.

Related

Multiple table import from html page

I am a beginner in using databases, I decide upon Postgres as I have learnt it is usable with Python.
I have reports that I receive in the form of html files, each file has multiple (100+) tables when parsed with Pandas data frame function, there is no unique ID common among all tables, and each table has unique columns.
Is it possible to import all tables, and merge them as a single table with ALL the columns in it, and have each report be a single entry in this new table with a PostgreSQL built-in feature, or do I have to develop a data pipeline using python and add them in manually?
I hope my question is clear enough.
Thank you.

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

data integration-, multiple databases, unique incremental SOR_id using talend

I'm trying to integrate multiple databases using talend and in turn have an SOR_id for each table for auditing purposes. is it possible to map between multiple source tables simultaneously to destination table having an SOR_id which is meant to be auto incremented? Would I have incremental values for each source tables rows
I have approached this using another way as shown in the image so that my SOR_id can be accounted for.

OrientDB Teleporter - Pull only selected columns for Vertex from RDBMS

I am trying to pull data from Oracle RDBMS and move it to OrientDB using teleporter. My relational database have multiple columns and have E-R relationships maintained. I have two questions :
My objective is to get only few columns ( that holds unique identity and foreign key relations ) and not all bulky column data. Is there any configuration using which I could do so. Today include and exclude only works at full DB table level.
Another objective is to keep my graph db sync with these selected table-column data which I pushed in previous run. Additional data which comes to RDBMS I would want in my graph db too.
You can enjoy this feature, and more others, in orientdb 3.0 through a JSON configuration, but there is not any documentation about it yet. Currently in 2.2.x you can just configure relationships and edges as described here:
http://orientdb.com/docs/2.2.x/Teleporter-Import-Configuration.html
In the next 2 weeks all these features will be available also in 2.2.x and well documented in order to make the comprehension of the config very easy.
At the moment you can adopt the following workaround:
import all the columns for each table in the correspondent vertex as usual.
drop the properties you are not interested in after each sync. You could write down a script where you call the teleporter execution and then delete the properties you don't care about from the schema.
I will update here when the alignment with 3.0 and the doc will be complete.

Migrating a schema from one database to other

As part of some requirement, I need to migrate a schema from some existing database to a new schema in a different database. Some part of it is already done and now I need to compare the 2 schema and make changes in the new schema as per gap finding.
I am not using a tool and was trying to understand some details using syscat command but could not get much success.
Any pointer on what is the best way to solve this?
Regards,
Ramakant
A tool really is the best way to solve this – IBM Data Studio is free and can compare schemas between databases.
Assuming you are using DB2 for Linux/UNIX/Windows, you can do a rudimentary compare by looking at selected columns in SYSCAT.TABLES and SYSCAT.COLUMNS (for table definitions), and SYSCAT.INDEXES (for indexes). Exporting this data to files and using diff may be the easiest method. However, doing this for more complex structures (tables with range or database partitioning, foreign keys, etc) will become very complex very quickly as this information is spread across a lot of different system catalog tables.
An alternative method would be to extract DDL using the db2look utility. However, you can't specify the order that db2look outputs objects (db2look extracts DDL based on the objects' CREATE_TIME), so you can't extract DDL for an entire schema into a file and expect to use diff to compare. You would need to extract DDL into a separate file for each table.
Use SchemaCrawler for IBM DB2, a free open-source tool that is designed to produce text output that is designed to be diffed. You can get very detailed information about your schema, including view and stored procedure definitions. All of the information that you need will be output in a single file, and can be compared very easily using a standard diff tool.
Sualeh Fatehi, SchemaCrawler
unfortunately as per company policy, cannot use these tools at this point of time. So am writing some program using JDBC to get the details and do some comparison kind of stuff.