Mergesort with pytable - mergesort

I have two large tables (not fiting in memory) with an id, I'd like to merge (join).
Is there a way to do that fastly with pytables ? i thank about full-indexing the two tables and using itersorted with this kind of algorithm : http://en.literateprograms.org/Merge_sort_%28Python%29 .
But it may be possible to take advantage of the numexp library to do that faster ?
Thx.

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Hash Table With Adaptive Hash Function

The performance of a particular hash table depends heavily on both the keys and the hash function. Obviously one can improve the performance greatly by trying different hash functions based on the incoming elements, and picking the one resulting into the least collisions. Are there any publications on this subject, exploring the methods of selecting such functions dynamically with or without user guidance?
I doubt there is a formal process to choose the best. There are too many moving parts. Especially when it comes to performance - there is no single "best performance" approach. Is it best latency? throughput? memory usage? cpu usage? More reads? More writes? Concurrent access? etc, etc, etc.
The only sensible way is to run performance tests for your specific code and use cases and choose what works for you.

How to split a large data frame and use the smaller parts to do multiple broadcast joins in Spark?

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).
I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.
Has anyone tried this?
To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.
Any other better way to accomplish this task?
Thanks!
Has anyone tried this?
Yes, someone already tried that. In particular GoDataDriven. You can find details below:
presentation - https://databricks.com/session/working-skewed-data-iterative-broadcast
code - https://github.com/godatadriven/iterative-broadcast-join
They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:
There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.
randomSplit method but that doesn't look so great an option.
It is actually not a bad one.
Any other better way to accomplish this task?
You may try to filter by partition id.

indexing of large text files line by line for fast access

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem:
the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala scripts for the task
Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.
Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).
Now you just need to make one pass through the sorted data to gather the groups of subjects.
Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.
A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.
There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

Using xmlpipe2 with Sphinx

I'm attempting to load large amounts of data directly into Sphinx from Mongo; and currently the best method I've found has been using xmlpipe2.
I'm wondering however if there are ways to just do updates to the dataset, as a full reindex of hundreds of thousands of records can take a while and be a bit intensive on the system.
Is there a better way to do this?
Thank you!
Main plus delta scheme. When all the updates goes to separate smaller index as described here:
http://sphinxsearch.com/docs/current.html#delta-updates