Remove Duplicates + first occurrences - google-cloud-dataprep

sorry but do anybody know how i can remove duplicate Rows AND the first Occurrence in Google Dataprep?
So both rows (duplicate row + 1. occurrence) will be deleteted?
col1,col2
john,simpson
will,farrell
john,simpson
elon,musk
will be:
col1,col2
will,farrell
elon,musk
Thank you guys!

It's entirely possible that there's a more efficient way for larger datasets, but my mind initially jumps to using grouping.
Conceptually what I'm talking about is using grouping (joins to the same data could also work) as a way to identify which rows have duplicates, then using a separate rule to filter them out.
Here's a proof-of-concept recipe wrangle based on your sample data:
groupby group: col1,col2 value: COUNT() type: flatAgg
filter type: greaterThan col: row_count greaterThan: 1 action: Delete
drop col: row_count action: Drop
(if you paste these into new recipe steps one at a time, it will create them for you)
Note in the above that you don't have to pass a parameter to COUNT() in this caseā€”it will just count the number of rows in each group (similar to COUNT(*) in SQL).
You can also see that I'm using the flatAgg type, which corresponds to the "Group by as new column(s)" option in the Group By step. This is incredibly helpful in scenarios where you have many columns you don't want to have to re-specify like you would in a normal Group By (which creates a new table with only your columns). To help clarify this, here's what the settings should look like for this step:
Hope that helps, and happy wrangling!

Related

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

Sort data within a subquery with another subquery?

I am trying to sort the OUN.note column by using the OUN.outcomeKey, since
the way it it is working right now is putting the notes in the wrong order (sorting alphabetically). Any idea on how to go about this? I've been trying to sort the data using another sub-query within, but I haven't had much luck (I don't have a plethora of experience).
Here's my current query:
SELECT DISTINCT OC.outcomeKey [Outcome Key], OC.outcome [Result],
STUFF((SELECT ','+' '+ OUN.note
FROM
Outcome AS OUT
JOIN OutcomeNote AS OUN
ON OUT.outcomeKey = OUN.outcomeKey
WHERE OUN.outcomeKey = OC.outcomeKey
GROUP BY OUN.note
FOR XML PATH ('')), 1, 1, '') [Outcome Note]
FROM Outcome AS OC
Any help or tips would be greatly appreciated! Also, please let me know if any more info is needed.
You may replace the line
GROUP BY OUN.note
with the line
ORDER BY OUN.outcomeKey
Also, because the concatenation starts with ', ', you may want to use 1, 2, '' as the additional arguments of the STUFF function. Otherwise, the values in your [Outcome note] column always start with a space.
Edit:
By the way, sorting the notes by outcomeKey in the subquery that generates the values for the [Outcome note] column has no effect... since all the notes in each subquery result will have the same outcomeKey value...
But you may sort on any column you want, of course. Perhaps there are other columns in your OutcomeNotes table that can serve as a useful sorting column of your outcome notes.
If I misunderstood your question, please provide definitions of the Outcome and OutcomeNote tables, together with a demo population of those tables and the desired/expected query result, please.
Edit 2:
Starting with SQL Server 2017, Transact-SQL contains a function called STRING_AGG, which seems to be functionally equivalent (more or less) to MySQL's GROUP_CONCAT function. Using this function, your query would become something like this:
SELECT
OUN.outcomeKey [Outcome Key],
OC.outcome [Result],
STRING_AGG(OUN.[Note], ', ') WITHIN GROUP (ORDER BY OUN.outcomeKey) [Outcome Note]
FROM
Outcome AS OC
JOIN OutcomeNote AS OUN ON OUN.outcomeKey = OC.outcomeKey
GROUP BY
OUN.outcomeKey,
OC.outcome;
When using SQL Server 2017 or SQL Azure, this might be a more fitting choice, since it does not only make the query more readable, but it also eliminates the use of (way less efficient) XML-functions in your query.
I too have used the XML-functionality for field concatenation (the way you use it) intensively in the past, but I noticed a considerable drop in performance of my queries (which sometimes contained up to 10 columns with concatenated data). Since then, I tend to go for recursive common table expressions or scalar UDF with recursion approaches in pre SQL Server 2017 environments.

Updating the table through tOracleOutput in Talend using an additional SQL query

I have a job where I am getting a flow into tOracleOutput where I am updating the table. Now, I have to update that table using an SQL statement, which I guess we have option in Advanced settings of tOracleOuptut, but I don't know how to use it or you can say that I am not getting the settings properly. I referred to official documentation but could not understand. Can any one explain the fields like Name, SQL expression, Position, Reference Column in a better way?
the SQL query which I am using is:
update set COL1=SOMETHING1
where COL2=SOMETHING2
Now, value for COL1 is coming from the flow but COL2 is some column in the table which is not coming from the flow.
Have a look to tOracleRow for such a case.
Hope this helps.
TRF
Using tOracleOutput is helpful when a ready data source (table or file (...) with same columns as destination) the more elaborate your query is, the more you should do as TRF said (and use tOracleRow), but here's an example to your question:
file contain 3 column,
DB table of destination contains 4 column, where the 4th is the date of update, (the first 3 are identical to the input)
so you add the destination's column's name in Name and put the SQL function for the date (eg: SYSDATE) and where to put it (Position) in reference to a column of your choice (Reference Column)
In my view it helps avoid using tMap for a miserable additional column when you want to Insert, but you want to Update, in which case the component doesn't offer the additional column section, plus I don't think you can add the WHERE clause here
Hope it helps

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution