Adding new fields in pipeline - google-cloud-data-fusion

Adding new fields in pipeline - google-cloud-data-fusion

How can I add new fields to the schema? We are pulling in records from a SQL database and would like to add a few metadata columns in the pipeline process.
Also, is there a way to use the record duplicator plugin such that the child records have a distinct identifying field? Thanks.

Related

How to Call the Copy Activity Dynamically Single Pipeline

We are executing pipeline one by one in sequence manager and load the data on premise SQL.
but we would want to load the data all the copy activity in single trigger. which means we have to load the 15 tables data into on premise DB. if tomorrow, we have to add one more table, we should not change in pipeline. we would like dynamic table insert. kindly advise.
thanks to all

I reproduced the above scenario and got the below results.
Use two lookups one for your source database and one for on-prem SQL.
Here I have used Azure SQL database for both source and target. You can use your database with SQL Server linked service in lookup.
Use the below query in both lookups to get the list of tables.
SELECT TABLE_NAME
FROM
information_schema.tables;
lookup list of of source tables:
Give the same query in second lookup.
List of target tables with another lookup:
Use filter activity to get the list of new tables which are not copied to target.
Items: #activity('sql source lookup').output.value
Condition: #not(contains(activity('on-prem lookup').output.value,item()))
Filter result:
Give this Value array to a ForEach activity and use copy activity inside ForEach.
Copy Source:
Copy sink:
You can see new tables are copied to my target. Schedule this pipeline everyday so that your every new table gets copied to target.

How add new column in to existing druid schema?

I create a schema and i add 1TB data to druid schema. then the log file version was upgraded and new two columns was added. then i want to add that data to druid schema. but couldn't yet.

In order to add a new column to existing datasource you need to follow the below steps:
Go to the Tasks menu in druid console.
From the listed datasources, go to the 'Actions' column in the last of the datasource in which you want to add the column.
There will be a magnifying glass like button, click on that to copy the existing payload.
Copy the payload in notepad and add the 2 columns to "dimensions" array.
Copy the updated payload and submit it via Submit Supervisor button.
You'll find the new columns in the datasource which you can verify by querying the datasource in query section of druid.

iSQLOutput - Update only Selected columns

My flow is simple and I am just reading a raw file into a SQL table.
At times the raw file contains data corresponding to existing records. I do not want to insert a new record in that case and would only want to update the existing record in the SQL table. The challenge is, there is a 'record creation date' column which I initialize at the time of record creation. The update operation overwrites that column too. I just want to avoid overwriting that column, while updating the other columns from the information coming from the raw file.
So far I am having no idea about how to do that. Could someone make a recommendation?

I defaulted the creation column to auto-populate in the SQL database itself. And I changed my flow to just update the remaining records. Talend job is now not touching that column. Problem solved.
Yet another reminder of 'Simplification is underrated'. :)

How to update PostgreSQL full text search field when relational data changes

I have the following strategy for the full text search in my web app which uses PostgreSQL for relational data storage. For example I will take Invoices table.
In the tables I have one additional field ALTER TABLE invoices ADD COLUMN tsv tsvector on which the full text search query is done like this ... WHERE tsv ## to_tsquery('query:*') ...
On every full text search table I have set an update trigger that updates tsv field on every change of the record. Update sets and concatenates the data from different fields to tsv field, sets the right weights, etc...
The data that gets set into tsv field can also be relational data from other tables. From example in table invoices I have client_id field but since I want to search invoices by the client name as well I also include clients.client_name data in the invoices.tsv field
My question is what is the best strategy to keep the relational data in tsv selectors in sync. In above scenario -> if client name changes I would need to update this in tsv field for every invoice...
Should I set cron job setup up that would do this every night? It could be also done with triggers, but since my database schema is very large I am scared it might get out of control if I have triggers all over the place.

If you add the clients name into the tsv field you will end up with more complexity. You might want to look into Materialized views as mentioned in this article. The trade-off might be speed in showing results and the need to refresh the view periodically. As of Postgres 9.4 you can now refresh a view concurrently.
Another thing you could do is create an update trigger in the Client table and when there's an update it will update the data in the Invoices table as well.

Sorrm how to add a column to an existing table

I need to add a new column to an existing table, I know I could create a new model and migrate the data over but that wouldn't be idle any other way?

Unfortunately the only other way is to use some administration tool for your database to manually update the schema to match the one for the new model.