I have setup Google DataStream to replicate data from PostgresSQL using CDC. It works fine, but I have noticed that all character varying columns are not being replicated. I can see them in the source schema, but the destination table that been created it doesn't have that column at all.
Related
Wanted to check if Large Object replication is supported by AWS DMS when Source and destination DB's are PostgreSQL?
I just used pglogical to replicate a DB which has Large Object (Like IOD's etc) and the target DB does not have LO's.
When I query a table on the destination which uses a OID column:
select id, lo_get(json) from table_1 where id=998877;
ERROR: large object 6698726 does not exist
The json column is oid datatype
If AWS DMS takes care of it, I will start using it.
Thanks
I have Azure Data Factory copy activity which loads parquet files to Azure Synapse. Sink is configured as shown below:
After data loading completed I had a staging table structure like this:
Then I create temp table based on stg one and it has been working fine until today when new created tables suddenly received nvarchar(max) type instead of nvarchar(4000):
Temp table creation now is failed with obvious error:
Column 'currency_abbreviation' has a data type that cannot participate in a columnstore index.'
Why the AutoCreate table definition has changed and how can I return it to the "normal" behavior without nvarchar(max) columns?
I've got exactly the same problem! I'm using a data factory to read csv-files into my Azure datawarehouse and this used to result in nvarchar(4000) columns, but now they are all nvarchar(max). I also get the error
Column xxx has a data type that cannot participate in a columnstore index.
My solution for now is to change my SQL code and use a CAST to change the formats, but there must be a setting in the data factory to get the former results back...
I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .
I have created a source jdbc connector for a table that has no primary key (table has column a,b,c,d,e) and it is part of an external database. I have the replica table in my database and I have created primary key using the columns a,b and c since those three combined together form unique data and can be used to form primary key. I am trying to create upsert sink connector and gave the pk.fields as a,b,c but when I launch the sink connector, it goes to degraded State and I am not able to see any proper error in the connect.log as well. I have given the pk.mode as record_value and in the pk.fields I gave it as a,b,c. Can someone please let me know if there is anything missing in the setup?
Note: it works if I change the mode to insert and remove the pk.fields. the pk.mode is record_value.
Update:
Hi Robin, Source table named as AccountDetails has columns accNumber, bankABA, bankOrigAccNumber, SpendingLimit and ExpirationDate and there is no primary key for this table. The target table is AccountInformation and has the same columns but has the primary key as (accNumber, bankABA and bankOrigAccNumber) since we need to have primary key at target for using in a different application. I have created source connector which is working fine to pull the data once in 24 hours. I am trying to create a sink connector with the mode as upsert for pushing the data from topic to table and the primary key mode as record_value and primary key fields as "accNumber,bankABA,bankOrigAccNumber". When i launch the sink, it goes to degraded state.
I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.