Optimizing FuzzyMatch in Talend - talend

I am using Talend to check quality of data where I compare the names of the person of two databases.
One database will have correct names and another database will have corrupted names. What I have to do is compare both names and find correct names from corrupted names.
I am using the tFuzzyMatch component to match the names.
The database which has the correct names has 212000 records.
The database which has the incorrect names has 50000 records.
tFuzzyMatch takes a lot of time to lookup correct names for each corrupted name.
Can anyone help me to optimize tFuzzyMatch to reduce execution time?
My job looks like this:
Please take a look at fuzzy match lookup. It has 3124340 rows.
I would like to speed up Fuzzy Match lookup.

Related

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

What are some of the most efficient workflows for processing "big data" (250+ GB) from postgreSQL databases?

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

Storing user-defined segments JSONB vs separate table?

I want to store user-defined segments. A segment will consist of several different rules. I was thinking I would either create a separate separate table of "Rules" with three columns: attribute name, operator, and value. For example, if a Segment is users in the united states the rule would be "country = US" in their respective columns. A segment can have many rules.
The other option is to store these as JSONB via Postgres in a "Rules" column in the Segment table. I'd follow a similar pattern to the above with an array of rules or something. What are the pros and cons of each method?
Maybe neither one of these is the right approach.
The choice is basically about the way you wish to read the data.
You are better off with JSON if:
you are not going to filter (with a WHERE clause) through the Rules
you do not need to get statistics (i.e. GROUP BY)
you will not imply any constraints on attributes/operators/values
you simply select the values (SELECT ..., Rules)
If you meet these requirements you can store data as JSON, thus eliminating JOINs and subselects, eliminating the overhead of primary key and indexes on Rules, etc.
But if you don't meet these you should store the data in a common relational design - your approach 1.
I would go with the first approach of storing the pieces of data individually in a relational database. It sounds like your data (segments->rules) will always contain the same structure (which is fairly simple), so there isn't a pressing reason to store the data as JSON.
As a side note, I think you will need another column in the "Rules" table, serving as a foreign key to the "Segments" table.
Pros to approach 1:
Data is easy to search and select. Your SQL statements can directly access specific information about the rules (the specific operators, name, value, etc) without having to parse the JSON object for the desired rule.
The above will result in reduced processing time
Only need to parse the JSON once (before the insert)
Cons to approach 1:
Requires parsing of JSON before the insert
Requires multiple inserts per segment
Regarding your last sentence, it is hard to prescribe a database design without knowing more about your intended functionality. For example, if the attribute names have meaning beyond a single segment, you would want to store the attribute names separately and reference them in the Rules table.

Create HDB from Existing Tables

I have a numerous amount of tables stored in memory in KDB. I am hoping to create an HDB of these tables so I can free up memory space. I am a bit confused on the process of creating an HDB - splaying tables, etc. Can someone help me with the process of creating an HDB, and then what needs to be done moving forward - ie to upload whatever new data I have end of day.
Thanks.
There's many ways to create a HDB depending on the scenario. General practices are:
For small tables, just write them as flat/serialised files using
`:/path/to/dbroot/flat set inMemTable;
or
`:/path/to/dbroot/flat upsert inMemTable;
The latter will add new rows while the former overwrites. However since you're trying to free up memory, using flat/serialised won't be all that useful since flat/serialised files will get pulled into memory in full anyway.
For larger tables (10's of millions) that aren't growing too much on a daily basis, you can splay them using set along with .Q.en (enumeration is required when the table is not saved flat/serialised):
`:/path/to/dbroot/splay/ set .Q.en[`:/path/to/dbroot] inMemTable;
or
`:/path/to/dbroot/splay/ upsert .Q.en[`:/path/to/dbroot] inMemTable;
again depending on whether you want to overwrite or add new rows.
For tables that grow on a daily basis and have a natural date separation, you would write as a date-partitioned table. While you can also use set and .Q.en for date partitioned tables (since they are the same as splayed tables, just separated into physical date directories) the easier method might be to use .Q.dpft or dsave if you're using a recent version of kdb. These will do a lot of the work for you.
It's up to you then to maintain the tables, ensure the savedowns occur on a regular basis (usually daily), append to tables if necessary etc etc

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.