Efficient way to check if there are NA's in pyspark - pyspark

I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?

There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

How to split a large data frame and use the smaller parts to do multiple broadcast joins in Spark?

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).
I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.
Has anyone tried this?
To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.
Any other better way to accomplish this task?
Thanks!
Has anyone tried this?
Yes, someone already tried that. In particular GoDataDriven. You can find details below:
presentation - https://databricks.com/session/working-skewed-data-iterative-broadcast
code - https://github.com/godatadriven/iterative-broadcast-join
They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:
There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.
randomSplit method but that doesn't look so great an option.
It is actually not a bad one.
Any other better way to accomplish this task?
You may try to filter by partition id.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).