Pentaho: combination lookup/update does not process all rows in the source - pentaho-spoon

I am using Pentaho Data Integration to do a SCD type 1 transformation. I am using combination lookup/update transform to generate the surrogate key value (upon insert). The commit size is 100000 and the cache size is 99999. My source table has 19763 rows and when I run the job to load data into the destination (dimension table), the combination lookup/update just processes 10000/19763 rows every single time.
How can I get it to process all the records (19763) in the source table ????

Finally !!!!!!!!! I found the answer. Its simple. Click on Edit -> Setting -> Miscellaneous -> Nr of rows in rowset - Change it from 10000 to the desired number of records coming from source. For me, the value was set to 10000 and hence it used to only write 10000 records to my destination dimension table. I changed it to a million and now I am getting all my 19763 records in my destination table.

It seams you are doing an incremental update. There is a special step, named Merge Rows (Diff), to compare two streams and tell if they exists in both streams and if they have changed.
The two streams, a reference stream (the current data) and a compare stream (the new data), are merged. The row are merged and marked as :
identical The key was found in both streams and the values to compare are identical;
changed The key was found in both streams but one or more values is different;
new The key was not found in the reference stream;
deleted The key was not found in the compare stream.
The two streams must be sorted before to be merged. You can do this in the sql query, but it is best to put an explicit Sort row step, because otherwise the process will stop after reading 1000 records, or whatever is in the Rowset limit (seams familiar?).
The stream is then directed with a Swich/Case step to the appropriate action.
The "best pratice" pattern is as follows, in which I added a first step to get the max date, and a step to built the surrogate key.
This pattern has been proven since years as much faster. In facts, it avoids the very slow Database lookup (one db full search by row) and reduce the use of the slow Insert/Update step (3 access to the db: one to fetch the record, one to change the values and one to store it). And the sort (which can be pre-prepared in the stream) is pretty efficient.

Related

Reliable way to poll data from a Postgres table

I want to use a table in Postgres database as a storage for input documents (there will be billions of them).
Documents are being continuously added (using "UPSERT" logic to avoid duplicates), and rarely are removed from the table.
There will be multiple worker apps that should continuously read data from this table, from the first inserted row to the latest, and then poll new rows as they being inserted, reading each row exactly once.
Also, when worker's processing algorithm changes, all the data should be reread from the first row. Each app should be able to maintain its own row processing progress, independent of other apps.
I'm looking for a way to track last processed row, to be able to pause and continue polling at any moment.
I can think of these options:
Using an autoincrement field
And then store the autoincrement field value of the last processed row somewhere, to use it in a next query like this:
SELECT * FROM document WHERE id > :last_processed_id LIMIT 100;
But after some research I found that in a concurrent environment, it is possible that rows with lower autoincrement values will become visible to clients LATER than rows with higher values, so some rows could be skipped.
Using a timestamp field
The problem with this option is timestamps are not unique and could overlap during high insertion rate, what, once again, leads to skipped rows. Also, adjusting system time (manually or by NTP) may lead to unpredicted results.
Add a process completion flag to each row
This is the only actually reliable way to do this I could think of, but there are drawbacks to it, including the need to update each row after it was processed and extra storage needed to store the completion flag field for each app, and running a new app may require DB schema change. This is the last resort for me, I'd like to avoid it if there are more elegant ways to do this.
I know, the task definition screams I should use Kafka for this, but the problem with it is it doesn't allow to delete single messages from a topic, and I need this functionality. Keeping an external list of Kafka records that should be skipped during processing feels very clumsy and inefficient to me. Also, a real-time deduplication with Kafka would also require some external storage.
I'd like to know if there are other, more efficient approaches to this problem using the Postgres DB.
I ended up saving the transaction id for each row and then selecting records that have txid value lower than the transaction with smallest id to the moment like this:
SELECT * FROM document
WHERE ((txid = :last_processed_txid AND id > :last_processed_id) OR txid > :last_processed_txid)
AND txid < pg_snapshot_xmin(pg_current_snapshot())
ORDER BY txid, id
LIMIT 100
This way, even if Transaction #2, that was started after Transaction #1, completes faster than the first one, the rows it written won't be read by a consumer until the Transaction #1 finishes.
Postgres docs state that
xid8 values increase strictly monotonically and cannot be reused in the lifetime of a database cluster
so it should fit my case.
This solution is not that space-efficient, because an extra 8-byte txid field must be saved with each row, and an index for the txid field should be created, but the main benefits over other methods here are:
DB schema remains the same in case of adding new consumers
No updates needed to mark row as processed, a consumer only should keep id and txid values of the last processed row
System clock drift or adjustment won't lead to rows being skipped
Having the txid for each row helps to query data in insertion order in cases when multiple producers insert rows with id, generated using preallocated pool (for example, Producer 1 in the moment inserts rows with ids in 1..100, Producer 2 - 101..200 and so on)

Aggregation for a DynamoDB table

I have a DynamoDB table with 10000 entries and I want to perform the sum of a specific attribute for all of these entries.
At the same time, there are lots of updates coming in every second - 10000 updates every second.
The problem is the fact that reading 10000 DDB entries from a single host is very slow.
Q: What is the best way to do this aggregation keeping in mind the fact I need the output sum written to another DDB table every second ?
My current two options I'm thinking about are:
Cache in front of DDB table (such as DAX - Dynamo DB Accelerator)
DDB Streams for all of the changes => which get fed into a Kinesis stream => have a host that processes the changes => Writes them to the output table 1/sec
Q: Also how should the single point of failure be addressed regarding both options, as we only have 1 host performing the aggregation.
Looking forward to hear some other suggestions and better ways of doing this.

Incremental upload/update to PostgreSQL table using Pentaho DI

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
you are looking for an incremental load. you can do it in two ways.
There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.

howa to optimise tUniqRow and tSortRow

Is it better to put tSortRow before tUniqRow or vice versa for the best perfermence ?
And how to optimize tUniqRow ?
Even if I use "disk option", the job crashes.
I'm working on a 3Million line file
In order to optimize your job, you can try the following:
Use the option "use disk" on tSortRow with a smaller buffer (the default 1 million rows buffer is too big, so start with a small number of rows, 50k for instance, then increase it in order to get better performance). This will use more (smaller) files on disk, so your job will run slower, but it will consume less memory.
Try with a tSortRow (using disk) and a tAggregateSortedRow instead of tUniqRow (by specifying the unique columns in the Group By section, it acts as a tUniqRow, the columns not part of the unique key must be specified in the Operations tab each using 'First' function). As it expects the rows to already be sorted, it doesn't sort them first in memory. Note that this component requires you to know beforehand the number of rows in your flow, which you can get from a previous subjob if you're processing your data in multiple steps.
Also, if the columns you're sorting by in tSortRow come from your database table, you can use an ORDER BY clause in your tOracleInput. This way the sorting will be done on the database side and your job won't consume memory for sort.

Redshift and ultra wide tables

In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.