Copy SQL Table with pyodbc - pyspark

I'm trying to copy a SQL table from one database to another database in another server using Databricks. I have heard that one method of doing this is by using pyodbc because I need to read the data from a stored procedure; JDBC does not support reading from stored procedures. I want to use code similar to the one below:
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
# Example getting records back from stored procedure (could also be a SELECT statement)
cursor = conn.cursor()
execsp = "EXEC GetConfig 'Dev'"
conn.autocommit = True
cursor.execute(execsp)
# Get all records
rc = cursor.fetchall()
The question is, once I get the data into the rc variable using pyodbc, should I bother moving the data into a Databricks Dataframe, or should I just push the data out to my destination?

You may not need to convert data into the Dataframe, and simply write the data into the destination. But it's really depends on the amount of data that you're trying to push - if it's a lot, then creating the Dataframe may help because it will parallelize writing (but that may overload the server). If it's not so much data, just write to destination.
Also, in this case, your worker nodes aren't really used, because all processing will happen on the Driver node - you may consider to use so-called Single Node Cluster, but you will need to size the driver node accordingly to your resultset.
P.S. You can also look to the alternatives listed in this answer.

Related

Tables got locked when we try to import table data from mysql dump file

I am trying to import a dump file which consists of a table with its data c_emailnotificationtemplate which was generated by this command :-
mysqldump --host=10.88.129.238 --user=root --password client_1002
c_emailnotificationtemplate --single-transaction
--set-gtid-purged=OFF > c_emailnotificationtemplate.sql
But when I am trying to import this c_emailnotificationtemplate.sql to my database , my database gets locked , I am not able perform any query also data is not get inserted on the table.
I tried to add --skip-lock-tables on the command but it doesn't work
so is there any way I can skip the lock operation which is happening when I am trying to import the sql file.
some details
database:- client_1002 ,
tablename:- c_emailnotificationtemplate ,
db instance :- gcp cloud sql
When importing your data to a Cloud SQL instance it is likely that you encounter long import times, depending on the respective file size of the data you are trying to import.
It's possible for queries to lock the MySQL database causing all subsequent queries to block/timeout.Connect to the database and try to execute this query:
SHOW PROCESSLIST
The first item in the list may be the one holding the lock, which the subsequent items are waiting on.Try to check if there are any issues with redundancies or data consistency and eliminate those.
Also check for the status logs to understand what is the table data or item which is causing this issue and try fixing this.
SHOW ENGINE INNODB STATUS\G
Something you can do in order to avoid the locks and operation stuck issues and also decrease the amount of time it takes to complete each operation is using the Cloud SQL import or export functionality with smaller batches of data. Another way of reducing the amount of time the import operation may take would be by reducing the number of connections the database receives while you’re importing data into your instance.
Check for the Best Practices for SQL Import and Export and also a helpful guide Import export document for your reference.

How to pass variable in Load command from IBM Object Storage file to Cloud DB2

I am using below command to load Object Storage file into DB2 table:NLU_TEMP_2.
CALL SYSPROC.ADMIN_CMD('load from "S3::s3.jp-tok.objectstorage.softlayer.net::
<s3-access-key-id>::<s3-secret-access-key>::nlu-test::practice_nlu.csv"
OF DEL modified by codepage=1208 coldel0x09 method P (2) WARNINGCOUNT 1000
MESSAGES ON SERVER INSERT into DASH12811.NLU_TEMP_2(nlu)');
above command inserts 2nd column from object storage file to DASH12811.NLU_TEMP_2 in nlu column.
I want to insert request_id from variable as a additional column:request_id in DASH12811.NLU_TEMP_2(request_id,nlu).
I read in some article to use statement concentrator literals to dynamically pass a value. Please let us know if anyone has an idea on how to use it.
Note, i would be using this query in DB2 but not DB2 warehouse. External tables wont work in DB2.
LOAD does not have any ability to include extra values that are not part of the load file. You can try to do things with columns that are generated by default in Db2 but it is not a good solution.
Really you need to wait until DB2 on Cloud supports external tables

IBM DB2 and IBM IMS Change Data Capture Capabilities

I'd like to understand if the CDC enabled IBM IMS segments and IBM DB2 table sources would be able to provide both the before and after snapshot change values (like the Oracle .OLD and .NEW values in trigger) so that both could be used for further processing.
Note:
We are supposed to retrieve these values through Informatica PowerExchange and process and push to targets.
As of now, we need to know would we be able to retrieve both before snapshot and after snapshot values from IBM DB2 and IBM IMS (.OLD and .NEW as in Oracle triggers - not an exact similar example, but mentioned just as an example to understand)
Any help is much appreciated, Thanks.
I don't believe CDC captures before data in its change messages that it compiles from the DBMS log data. It's main purpose is to issue the minimum number of commands needed to replicate the data from one database to another. You'll want to take a snapshot of your replica database prior to processing the change messages if you want to preserve the state of data such that you can query it.
Alternatively for Db2, it's probably easier to work with the temporal tables feature added in Db2 10 as that allows you to define what changes should drive a snapshot. You can then access the temporal data using a temporal SQL query.
SELECT … FROM…period specification
Example trigger with old and new referencing...
CREATE TRIGGER danny117
NO CASCADE BEFORE Update ON mylib.myfile
REFERENCING NEW AS N old as O
FOR EACH ROW
-- don't let the claim change and force upper case
--just do something automatically on update blah...
BEGIN ATOMIC
SET N.claim = ucase(O.claim);
END
w.r.t PowerExchange 9.1.0 & 9.6:
Before snapshot data can't be processed via the powerexchange for DB2 database. Recently I worked on a migration project and I thought like the Oracle CDC which uses SCN numbers there should be something for db2 to start the logger from any desired point. But to my surprise Inforamtica global support confirmed that before snapshot data can't be captured by PowerExchange.
They talk about materialize and de-materialize targets which was out of my knowledge at that time, later I found out they meant to export and import of history data.
Even if you have table with CDC enanbled, you can't capture the data before snapshot from PWX.
DB2 reads capture data from the DB2-logs which has a marking for the operation like U/I/D that's enough for PowerExchange to progress.

tELTPostgresql* usage issue

I'm trying to use tELTPostgresqlOutput with postgres 9.3 server and this is the result:
With a simple tPostgresqlInput and a tLogRow it works perfectly.
This is not how to use the ELT components. These should be used to do in database server transformations such as creating a star schema table from multiple tables in the same database. This allows you to use the database to do the transformation and avoid reading the data into memory for your job. It's particularly useful when dealing with large datasets that can't be broken down for the transformation.
If you want to transfer data from one database server/vendor to another you will need to use ETL components (pretty much anything not explicitly marked ELT) to read data out of the source database and write it back to the target database.
In this case you should be using a tMSSQLInput component to read in the data you need, a tMap to transform the data in the way you want and a tPostgresqlOutput component to write the data out to the Postgres database.

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.