Multiple joins in Pentaho - left-join

I am trying to join 5 flows in which the first one is the driver and the others are being left outer joined to driver. I have used Ab Initio in the past where we could use a single join component and specify the kind of join for each input flow. I couldn't find any such Step in the Pentaho and hence I have to rely on the Merge Join which left outer joins only two tables at a time and then take its result to join with the next and so on and so forth. I am planning to do all that in a single transformation.
What I am worried about it since Pentaho runs all the Steps in parallel, it might start to run a join which is much later in the flow without waiting for an earlier join to complete. Is this a valid concern? if so how do you tackle it in a single transformation?

That’s correct you can only join 2 steps at a time.
Answer to your second point, no it will execute parallel but your second join will wait for your first joie to finish. So you will get a proper result only.

Related

How can I achieve a paginated join, with the first page returned quickly?

I am looking to join multiple big tables in the OLAP layer to power the UI. Since the tables are really large, response for each join query takes too long. I want to get results in less than 3 seconds. But the catch is I don't want the entire joined data at once because I am only displaying a small subset of the result in the UI at any particular point. Only user interaction would require me to show the next subset of the result.
I am looking for a strategy to create a system where I can perform the same join queries, but initially only a small subset is joined and used for powering the UI. Meanwhile, the rest of the smaller subsets of data is joined in the background and that gets pulled into the UI when required. Is this the right way to approach this problem, where I have to perform really big joins? If so, how can I design such a system?
You can use a WITH HOLD cursor:
START TRANSACTION;
DECLARE c CURSOR WITH HOLD FOR SELECT /* big query */;
FETCH 50 FROM c;
COMMIT;
The COMMIT will take a long time, as it materializes the whole result set, but the FETCH 50 can be reasonably fast (or not, depending on the query).
You can then continue fetching from the cursor. Don't forget to close the cursor when you are done:
CLOSE c;

Spark SQL (Scala) - What's the most efficient way to join two very large tables that have skewed keys

My current job is to create ETL processes with SparkSQL/Scala using Spark 2.2 with Hive support (all tables are on Hive warehouse/HDFS).
One specific process requires joining a table with 1b unique records with another one of 5b unique records.
The join key is skewed, in the sense that some keys are repeated way more than others, but our Hive is not configured to skew by that field, nor it is possible to implement that in the current cenario.
Currently I read each table into two separate dataframes and perform a join between them. Tried inner join and a right outer join on the 5b table to see if there was any performance gain (I'd drop the rows with null records afterwards). Could not notice one, but it could be caused by cluster instability (am not sure if a right join would require less shuffling than an inner one)
Have tried filtering the keys from 1b table on the 5b one by creating a temp view and adding a where clause to the select statement of the 5b table, still couldn't notice any performance gains (obs: it's not possible to collect the unique keys from 1b table, since it'll trigger memory exception). Have also tried doing the entire thing on one SQL query, but again, no luck.
I've read some people talking about creating a PairRDD and performing partitionBy with hashPartitioner, but this seems outdated with the release of dataframes. Right now, I'm in search for some solid guidance for dealing with this join of two very large datasets.
edit: there's an answer here that deals pretty much with the same problem that I have, but it's 2 years old and simply tells me to firstly join a broadcasted set of records that correspond to the keys that repeat a lot, and then perform another join with the rest of the records, unioning the results. Is this still the best approach for my problem? I have skewed keys on both tables

PostgreSQL Results Same Explanation on Different Queries

I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?
I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html

Using a UNION or UNION ALL on two select statements makes them incredibly slower

I have two queries, let's call them Query A and Query B.
Both of these queries run in under a second for the scenario I'm testing and Query A returns 1 result, and Query B returns 0 results.
If I union (or union all) these two queries, it takes over a minute to return the (expected) 1 result.
Both queries select the same columns from the same tables. I could potentially rewrite this entire thing without a union by having a highly conditional where clause but I was trying to get away from doing that.
Any ideas? I'm not sure how much of the exact query and schema I can get away with sharing, but I'm happy to provide what I can.
This is on MSSQL 2008 if it matters to anyone's response.
I would try looking at the execution plans within Management Studio for the individual queries, and then compare that to the execution plan for the query containing the UNION.
If there's that drastic of a difference in the execution times, I would imagine that there's something wrong with the execution plan for the UNION'd query. Identifying what's different will help point you (and maybe us) in the right direction on what the underlying problem is.
The separate clauses in a UNION that are very similar and on the same tables can be merged into one query by the optimiser. You can see this by the lack on UNION operator in the query plan. I've seen similar things before but rarely
What you can do is a SELECT.. INTO #temp... for the first query followed by an INSERT #temp... for the second
Now, where did I read this...
Are they both doing table scans? It sounds like it might be exceeding cache capacity and you're caching to disk.
Even if they are from the same table the records would probably lock independently.

sqlite3_enable_shared_cache and sqlite_backup_init slowing execution on iPhone

I have some relatively complex sqlite queries running in my iPhone app, and some are taking way too much time (upwards of 15 seconds). There are only 430 or so records in the database. One thing I've noticed is that opening a new database connection (which I only do once) and stepping through query results (with sqlite3_step()) causes sqlite3_backup_init() and sqlite3_enable_shared_cache() to run, which take 4450ms and 3720ms respectively of processing time throughout the test period. I tried using sqlite3_enable_shared_cache(0); to disable shared caching before opening the database connection, but this seems to have no effect.
Does anyone have any idea how to disable these so I can get some sort of speed improvement?
Well, I suppose this doesn't directly answer the question, but part of the problem was my use of a cross join instead of a left join. That reduced the query time from about 4000ms to 60ms. Also, the backup_init function is no longer called and enable_shared_cache doesn't spin as much.
I fixed my app by replacing inner join with left join. My data allows it.
If your data does not allow it consider adding a where clause
apples a inner join bananas b on b.id=a.id
vs.
apples a left join bananas b on b.id=a.id where b.id is not null