I have a job in Talend. The target of that job its to transform 18 tables (with 2 millions record each) into one.
The 18 tables share 2 columns.
So, while working I found these problems:
1) I cannot complete the job when connecting the 18 tables at once. (memory error) My job look like this, but with much more connections :
2) I tried to connect only the half, but it last forever (8 hours and still counting) -not efficient at all!- :
3) I tried to split the job into several small ones, but still without success. I am stuck here.
Any recommendation of how optimize this job?
Thanks so much for reading and double thanks for answering.
You can easily have dozens of lookups in your job, if you optimize the way they're used. You can do the following in order to optimize your job :
Instead of having a single tMap with lots of lookups, you can split it in multiple tMaps like this :
lkp_1 lkp_2 lkp_3 lkp_y
| | | |
Source --- tMap_1 --- tMap_2 --- tMap_3 ---... --- tMap_x --- target
This is not mandatory but this way you can easily modify your lookups.
Next, in order to optimize memory use, you can take a look the "reload at each row" option of tMap. Instead of using the default "load at once" which loads your lookup table in memory, you can use "reload at each row" in order to executre the lookup query for the current row :
In your lookup query, you have access to a global variable defined in your tMap, for instance : (Integer)globalMap.get("myLookupKey"), in order to filter your data on the database side and only return the value that matches your lookup key.
Here is a detailed example.
There's also the option of "Store temp data" for lookups, which optimizes memory usage, as data from lookup tables is stored on disk instead of memory.
Related
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
you are looking for an incremental load. you can do it in two ways.
There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
Is it better to put tSortRow before tUniqRow or vice versa for the best perfermence ?
And how to optimize tUniqRow ?
Even if I use "disk option", the job crashes.
I'm working on a 3Million line file
In order to optimize your job, you can try the following:
Use the option "use disk" on tSortRow with a smaller buffer (the default 1 million rows buffer is too big, so start with a small number of rows, 50k for instance, then increase it in order to get better performance). This will use more (smaller) files on disk, so your job will run slower, but it will consume less memory.
Try with a tSortRow (using disk) and a tAggregateSortedRow instead of tUniqRow (by specifying the unique columns in the Group By section, it acts as a tUniqRow, the columns not part of the unique key must be specified in the Operations tab each using 'First' function). As it expects the rows to already be sorted, it doesn't sort them first in memory. Note that this component requires you to know beforehand the number of rows in your flow, which you can get from a previous subjob if you're processing your data in multiple steps.
Also, if the columns you're sorting by in tSortRow come from your database table, you can use an ORDER BY clause in your tOracleInput. This way the sorting will be done on the database side and your job won't consume memory for sort.
I have to determine the complexity level (simple/medium/complex etc) of a sql by counting number of occurrences of specific keywords, sub-queries, derived tables, functions etc that constitute the sql. Additionally, I have to syntactically validate the sql.
I searched on the net and found that Perl has 2 classes named SQL::Statement and SQL::Parser which could be leveraged to achieve the same. However, I found that these classes have several limitations (such as CASE WHEN constructs not supported etc).
That been said, is it better to build a custom concise sql parser with Lex/Yacc or Flex/Bison instead ? Which approach would be better and quick ?
Please share your thoughts on this. Also, can anyone point me to any resources online that discusses the same.
Thanks
Teradata has many non ANSI features and you're considering re-implementing the parser for it.
Instead use the database server and put an 'explain' in front of your statements and process the result.
explain select * from dbc.dbcinfo;
1) First, we lock a distinct DBC."pseudo table" for read on a RowHash
to prevent global deadlock for DBC.DBCInfoTbl.
2) Next, we lock DBC.DBCInfoTbl in view dbcinfo for read.
3) We do an all-AMPs RETRIEVE step from DBC.DBCInfoTbl in view
dbcinfo by way of an all-rows scan with no residual conditions
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 432
rows (2,374,272 bytes). The estimated time for this step is 0.01
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.
This will also validate your SQL.
i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.
I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.