Data Flow left outer Join results in empty data set when executed, but not on debuging - azure-data-factory

We have a Left outer join configured between two data sets. When executed, the Join shows no results, although when debugging it does. Both dataset of the join contains data and the condition is fulfilled, although with a left outer join I would expect at least the content from the first data set.
The dataflow pipeline contains two similar joins and in both we have the same behavior.
The datasets involved in the join contains from 20K to 60K records. However, the flow loads a couple of datasets around 1 million records. We would expect some error if this was related to memory though...

Related

How to use OPTIMIZE ZORDER BY in Databricks

I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

Spark union of dataframes does not give counts?

I am trying to union these dataframes ,i used G_ID is not Null or MCOM.T_ID is not null and used trim, the count does not come up ,its running since 1hr. there are only 3 tasks remaining out of 300 tasks.Please suggest how can i debug this ? is null causing issue how can i debug ?
val table1 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.CIDS WHERE
_UPDT_TM >= '2020-02-01 15:14:39.527' """)
val table2 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.MIDS MCOM INNER
JOIN ab.VD_MBR VDBR
ON Trim(MCOM.T_ID) = Trim(VDBR.T_ID) AND Trim(MCOM.G_ID) = Trim(VDBR.G_ID)
AND Trim(MCOM.C123M_CD) IN ('BBB', 'AAA') WHERE MCOM._UPDT_TM >= '2020-02-01 15:14:39.527'
AND Trim(VDBR.BB_CD) IN ('BBC') """)
var abc=table1.select("PC_ID").union(table2.select("PC_ID"))
even tried this --> filtered = abc.filter(row => !row.anyNull);
It looks like you have a data skew problem. Looking at the "Summary Metrics" it's clear that (at least) three quarters of your partitions are empty, so you are eliminating most of the potential parallelization that spark can provide for you.
Though it will cause a shuffle step (where data gets moved over the network between different executors), a .repartition() will help to balance the data across all of the partitions and create more valid units of work to be spread among the available cores. This would most likely provide a speedup of your count().
As a rule of thumb, you'd likely want to call .repartition() with the parameter set to at least the number of cores in your cluster. Setting it higher will result in tasks getting completed more quickly (it's fun to watch the progress), though adds some management overhead to the overall time the job will take to run. If the tasks are too small (i.e. not enough data per partition), then sometime the scheduler gets confused and won't use the entire cluster either. On the whole, finding the right number of partitions is a balancing act.
You have added alias to the column "C_ID" as "PC_ID". and after that you are looking for "C_ID".
And Union can be performed on same number of columns, your table1 and table2 has different in columns size.
otherwise you will get: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns
Please take care of these two scenario first.

Performance issue in combined where clauses

Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.

using talend to output data not matched in lookup table

I know how to use talend's tMap component to output matched data in lookup data, however, I don't know how to output these rows that is not matched with data in lookup table. Maybe a simple question to senior user. Thanks all the way.
Regards,
Joe
Two steps are required to gather rejected rows:
On the left hand side you have to set Join Model to Inner Join on the join you want to find rejected rows
On the right hand side set Catch lookup inner join reject to true. This row will get all rejected entries. So you can create one row which gets all found entries and another row which delivers only the rejected rows
Usually this leads to a tMap with two output rows in your job.
in tMap output table there is setting options. Go to that and there you will see couple of options like "Catch lookup inner join reject" & "catch output reject" - you can set them to false/true based on your need. My guess is that you are looking for "Catch lookup inner join reject".

Where clause versus join clause in Spark SQL

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.