How to remove duplication with Talend Data Preparation? - talend

I would like to remove duplication with my Talend Data Preparation and I have a column named: HOURS, I want to calculate those hours between them and remove the email and names duplication, here is an example of my table :
As you can see I have a lot of user_name and email is the same, but my hours are not same, I want to add my hours together depending on the user_name and email and remove any duplication of my user_name and email at the same time.

(I am not really into Data Prep, so perhaps there is an inside solution that I don't know of).
I think you can't have a GROUP BY with a SUM operation in Talend Data Preparation, as the tool is only able to correct lines of data, and can't make aggregation operations.
You'll be able to sum your data with a tAggregateRow in Talend Data Integration, after exporting your corrected data from Data Prep.

Related

REST V2 ingest of JSON events to a S3 bucket (avoid duplicates)

I would like to ask you for help.
I am trying to ingest events in JSON from a source using REST API (REST V2 connector) in a raw format.
The source allows me to pass parameters "take" and "days" in the headers. The parameter "take" allows us to specify how many records to take, parameter "days", specifies how old events to request.
The job I have created works fine for Data Ingestion to a database, where I map filed to columns in the database.
I tried million things, and the two recent problems I am facing when I try to ingest files into a bucket or database in raw format:
For mass ingestion: there are no incremental jobs options available (for REST V2 source), so I am getting duplicate records, and ingestion never stops.
Is there a way to stop mass ingestion and avoid duplicates when all records are ingested?
For Data integration to a DB: Each record/event I attempt to ingest has multiple fields. Since I DON'T want to separate the records (I WANT entire documents in JSON), I pack all files into an array. The problem is that when I request ten records (or N records), all records get ingested into a single row in a table.
Here is what I mean:
TABLE DB:
ROW1: "array packed" JSON1, JSON2 .... JSNO_N...JSON10 "/array packed"
ROW2: empty
This is what I need (each record in a separate row in raw format)
TABLE DB:
ROW1: JSON1
ROW2: JSON2
ROWN: JSON_N
ROW10: JSON_10
I was also trying to accomplish this using a lambda function. The problem with lambda is that I will have to make sure there is no duplicates (Informatica has this cool option "upsert" that allows me to avoid duplicates).
At the end of the day, I don't care if this will be accomplished using data integration, mass ingestion, or lambda and if ingest will be directly into DB or S3. For now, I am trying to find a working solution.
If somebody can come up with some ideas, I will appreciate the help.

SAP incremental data load in Azure Data Factory

I'm trying to implement an Extractor pipeline in ADF, with several Copy Data activities (SAP ERP Table sources). To save some processing time, I'd like to have some deltas (incremental load). What's the best way to implement this?
What I'm trying at the moment is just to use the "RFC table options" in each Copy Data activity. However, this seems to be quite limited (only very simple queries allowed). Also, each SAP ERP table requires a different query. I found 3 different situations, regarding table field formats:
Timestamp in miliseconds (e.g. COVP);
Timestamp in YYYYMMDDHHMMSS (e.g. FAGLFLEXA);
Last change date and last change time, in separated fields (e.g. CATSDB)
Has anyone ever tried this? What would you advise?
Thanks!

ELT pipeline for Mongo

I am trying to get my data into Amazon Redshift using Fivetran, but have some questions in general about the ELT/ETL process. My source database is Mongo but I want to perform deep analysis on the data using a 3rd party BI tool like Looker, but they integrate with SQL. I am new to the ELT/ETL process and was wondering would it look like this.
Extract data from Mongo (handled by Fivetran)
Load into Amazon Redshift (handled by Fivetran)
Perform Transformation - This is where my biggest knowledge gap is. I obviously have to convert objects and arrays into compatible SQL types. I can perform a transformation on all objects to extract those to columns and transform all arrays to a table. Is this the right idea? Should I design a MYSQL schema and write all the transformations according to that schema design?
as you state, Fivetran will load your data into Redshift putting individual fields in columns where it can and putting everything else into varchar columns as JSON. So at that point you basically have a Data Lake - all your data in an analytical platform but basically still in source format and available for you to do whatever you want with it.
Initially, if you don't know much about your data and just want to investigate it, you can probably leave it as it is. Redshift has SQL functions that allow you to query the elements of a JSON structure so there is no need to build additional tables and more ETL just to allow you to investigate your data - especially as these tables may get thrown away once you understand your data and decide what you want to do with it.
If you have proper reporting requirements then that is the point where you can start to design a schema that will support these requirements (I'm not sure why you suggested a MYSQL schema as MYSQL is a database vendor?). Traditionally an analytical schema would be designed as a Kimball Dimensional model (facts and dimensions) but the type of schema you decide to design will depend on:
The database platform you are using (in your case, Redshift) and the type of structures it works best with e.g. star schema or "flat" tables
The BI tool you are using and how it expects to have data presented to it
For example (and I'm not saying this is a real world example), if Redshift works ok with star schemas but better with flat tables and Looker has to have a star schema then it probably makes more sense to build star schemas in Redshift as this is a single modelling exercise - rather than model flat tables in Redshift and then have to model star schemas in Looker.
Hope this helps?
It depends on how you need the final stage of your data analysis presented, and what the purpose of your data analysis is. As stated by NickW, assuming you need to integrate your data into a BI tool the schema should be adapted according to the tool's data format requirements.
a mongodb ETL/ELT process might looks like this:
Select Connection: Select the set connection
Collection Name:Choose the collection by using the [database].[collection] format.
If you pulling data from your authentication database, only the [collection] name can be determined. Examples: ea sample.products east .
Extract Method:
All: pull the entire data in the table.
Incremental: pull data by incremental value.
Incremental Attributes: Set the name of the incremental attribute to run by. I.e: UpdateTime .
Incremental Type: Timestamp | Epoch. Choose the type of incremental attribute.
Choose Range:
In Timestamp, choose your date increment range to run by.
In Epoch, choose the value increment range to run by.
If no End Date/Value entered, the default is the last date/value in the table.
The increment will be managed automatically
Include End Value: Should the increment process take the end value or not
Interval Chunks: On what chunks the data will be pulled by. Split the data by minutes, hours, days, months or years.
Filter: Filter the data to pull. The filter format will be a MongoDB Extended JSON.
Limit: Limit the rows to pull.
Auto Mapping: You can choose the set of columns you want to bring, add a new column or leave it as it is.
Converting Entire Key Data As a STRING
In cases the data is not as expected by a target, like key names started with numbers, or flexible and inconsistent object data, You can convert attributes to a STRING format by setting their data types in the mapping section as STRING
Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.
Use cases:
Here are few filtering examples:
{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}
date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

Tableau - How to query large data sources efficiently?

I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.