Spark union of dataframes does not give counts? - scala

I am trying to union these dataframes ,i used G_ID is not Null or MCOM.T_ID is not null and used trim, the count does not come up ,its running since 1hr. there are only 3 tasks remaining out of 300 tasks.Please suggest how can i debug this ? is null causing issue how can i debug ?
val table1 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.CIDS WHERE
_UPDT_TM >= '2020-02-01 15:14:39.527' """)
val table2 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.MIDS MCOM INNER
JOIN ab.VD_MBR VDBR
ON Trim(MCOM.T_ID) = Trim(VDBR.T_ID) AND Trim(MCOM.G_ID) = Trim(VDBR.G_ID)
AND Trim(MCOM.C123M_CD) IN ('BBB', 'AAA') WHERE MCOM._UPDT_TM >= '2020-02-01 15:14:39.527'
AND Trim(VDBR.BB_CD) IN ('BBC') """)
var abc=table1.select("PC_ID").union(table2.select("PC_ID"))
even tried this --> filtered = abc.filter(row => !row.anyNull);

It looks like you have a data skew problem. Looking at the "Summary Metrics" it's clear that (at least) three quarters of your partitions are empty, so you are eliminating most of the potential parallelization that spark can provide for you.
Though it will cause a shuffle step (where data gets moved over the network between different executors), a .repartition() will help to balance the data across all of the partitions and create more valid units of work to be spread among the available cores. This would most likely provide a speedup of your count().
As a rule of thumb, you'd likely want to call .repartition() with the parameter set to at least the number of cores in your cluster. Setting it higher will result in tasks getting completed more quickly (it's fun to watch the progress), though adds some management overhead to the overall time the job will take to run. If the tasks are too small (i.e. not enough data per partition), then sometime the scheduler gets confused and won't use the entire cluster either. On the whole, finding the right number of partitions is a balancing act.

You have added alias to the column "C_ID" as "PC_ID". and after that you are looking for "C_ID".
And Union can be performed on same number of columns, your table1 and table2 has different in columns size.
otherwise you will get: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns
Please take care of these two scenario first.

Related

How to use OPTIMIZE ZORDER BY in Databricks

I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

Performance issue in combined where clauses

Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Reducing shuffle disk usage in Spark aggregations

I have a table in hive which is 100 GB of size. I am trying to group, count function and storing the result as hive table..my hard disk has 600 GB of space, by the time job reaches to 70% all of the disk space is being occupied.
so my job got fails...How can I minimize the shuffle data writes
hiveCtx.sql("select * from gm.final_orc")
.repartition(300)
.groupBy('col1, 'col2).count
.orderBy('count desc)
.write.saveAsTable("gm.result")
spark_memory
In cloud-based execution environments adding more disk is usually a very easy and cheap option. If your environment does not allow this and you've verified that your shuffles settings are reasonable, e.g., compression (on by default) is not changed, then there is only one solution: implement your own staged map-reduce using the fact that counts can be re-aggregated via sum.
Partition your data in any way that seems fit (by date, by directory, by number of files, etc.)
Perform the counting by col1 and col2 as separate Spark actions.
Re-group and re-aggregate.
Sort.
For simplicity, let's assume that col1 is an integer. Here is how I'd break up processing into 8 separate jobs, re-aggregating their output. If col1 is not an integer, you can hash it or you can use another column.
def splitTableName(i: Int) = s"tmp.gm.result.part-$i"
// Source data
val df = hiveCtx.sql("select col1, col2 from gm.final_orc")
// Number of splits
val splits = 8
// Materialize partial aggregations
val tables = for {
i <- 0 until splits
tableName = splitTableName(i)
// If col1 % splits will create very skewed data, hash it first, e.g.,
// hash(col1) % splits. hash() uses Murmur3.
_ = df.filter('col1 % splits === i)
// repartition only if you need to, e.g., massive partitions are causing OOM
// better to increase the number of splits and/or hash to un-skew skewed data
.groupBy('col1, 'col2).count
.write.saveAsTable(tableName)
} yield hiveCtx.table(tableName)
// Final aggregation
tables.reduce(_ union _)
.groupBy('col1, 'col2)
.agg(sum('count).as("count"))
.orderBy('count.desc)
.write.saveAsTable("gm.result")
// Cleanup temporary tables
(0 until splits).foreach { i =>
hiveCtx.sql(s"drop table ${splitTableName(i)}")
}
If col1 and col2 are so diverse and/or so large that the partial aggregation storage is causing disk space issues then you have to consider one of the following:
Smaller number of splits will generally use less disk space.
Sorting on col1 will help (because of Parquet run length encoding) but that would slow down execution.
How to create splits that are independent, e.g., find distinct values of col1, partition those into groups.
If you are extremely short on disk space you'd have to implement multi-step re-aggregation. The simplest approach is to generate the splits one at a time and keep a running aggregate. The execution would be much slower but it will use a lot less disk space.
Hope this helps!