How to get schema of Delta table without reading content? - pyspark

I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table, in the runtime - i.e. same schema, no rows.
Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself.
What I tried:
df.schema - Accessing schema immediately after the delta table load took several minutes as well.
limit(0) - Calling limit(0) immediately after the load still took several minutes.
limit(0).cache() - limit gets sometimes moved around in the plan, so I also tried adding cache to "fix its position".
Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we
Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.

When you access schema of the Delta it doesn't go through all the data as Delta stores the schema in the transaction log itself, so df.schema should be enough. But when transaction log accessed, it may require sometime to reconstruct the actual schema from the JSON/Parquet files that are used for transaction log. Although several minutes is quite strange & you need to dig into execution plan.
I wouldn't recommend to read transaction log directly as its format is an internal thing, plus the latest transaction may not contain schema (it's not put into every log file, only when changes are happening).

Related

Azure Data Factory - Copy Data Upsert only updating a single row at a time

I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.

How can I automatically maintain a dump of modified rows in PostGreSql

So, I have a PostGreSQL DB. For some chosen tables in that DB I want to maintain a plain dump of the rows when modified. Note this dump is not a recovery or backup dump. It is just a file which will have the incremental rows. That is, whenever a row is inserted or updated, I want that appended to this file or to a file in a folder. Idea is to load that folder into say something like hive periodically so that I can run queries to check previous states of certain rows, columns. Now, these are very high transactional tables and the dump does not need to be real time. It can be in batches, every hour. I want to avoid a trigger firing hundreds of times every minute. I am looking for something which is off the shelf - already available in PostGreSQL. I did some research but everything is related to PostGreSQL backup - which is not the exact use case.
I have read some links like https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/ Implementing history of PostgreSQL table etc - but these are based on insert update trigger and create the history table on PostGreSQL itself. I want to avoid both. I cannot have the history on PostGreSQL as it will be huge soon. And I do not want to keep writing to files through a trigger firing constantly.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

PostgreSQL INSERT - auto-commit mode vs non auto-commit mode

I'm new to PostgreSQL and still learning a lot as I go. My company is using PostgreSQL and we are populating the database with tons of data. The data we collect is quite bulky in nature and is derived from certain types of video footage. For example, data related to about 15 minutes worth of video took me about 2 days to ingest into the database.
My problem is that I have data sets which relate to hours worth of video which would take weeks to ingest into the database. I was informed part of the reason this is taking so long to ingest was because PostgeSQK has auto commit set to true by default and committing transactions takes a lot of time/resources. I was informed that I could turn auto commit off, due to which the process would speed up tremendously. However, my concern is that multiple users are going to be populating this database. If i change the program to commit after say every 10 secords and two people are attempting to populate the same table. The first person gets an id and when he's on say record 7 then the second person attempts to insert into the same table they are given the same id key and once the first person decides to commit his changes, the second persons id key will already be used, thus throwing an error.
So what is the best way to insert data into a PostgreSQL database when multiple people are ingesting data at the same time? Is there a way to work around issuing out the same id key to multiple people when inserting data in auto-commit mode?
If the IDs are coming from the serial type or a PostgreSQL sequence (which is used by the serial type), then you never have to worry about two users getting the same ID from the sequence. It simply isn't possible. The nextval() function only ever hands out a given ID a single time.