Faster stringdist_inner_join - match

I am trying to match company names from two large databases (4000 and 23000). However, it is taking a lot of time. Is there any way to speed up this process? for example, parallelize this code?. I have been searching here but found nothing relevant yet. I would appreciate any help you could provide. My basic command is:
x <- stringdist_inner_join(tbl.nomatch, tbl.reuters, by ="company", method="jw")
Thank you very much.

Related

How to parallelize a function the correct way

I currently have a setup where I have a very large Pandas dataframe “df” that can be filtered into separate batches df_batch_1= df[df[‘column’]==<filter 1>], df_batch_2= df[df[‘column’]==<filter 2>],…etc. In addition, I have a very “heavy” function “heavy_function(df_batch)” that must do some heavy computation on each data frame batch df_batch_1, df_batch_2,...etc. My plan is to do this “heavy lifting” on a data bricks (pyspark) cluster since it runs too slow on a regular computer.
So far, I am running this by using threads on a data bricks cluster like this:
threads = [threading.Thread(target= heavy_function, args=(df[df[‘column’]==filter]) for filter in [<filter 1>, <filter 2>,…etc.]]
for t in threads:
t.start()
I have been told that this is an anti-pattern and that I should find another way of doing this. I hope you can help pointing me in the right direction. What is the right “pysparkian” way of doing this?
Any help is appreciated!
Best regards,
DK

Mongo DB find the bulk of data (without pagination) efficiently using a large sentence

we have some functionality where we search large amount of data using large sentence which user provide. provided sentence can have 1000+ words. now we're using text search to find the data, but it is taking a lot of time even it is getting crashed at some point. we can not use pagination because we need entire searched results in front end. so how can we fix this issue? any kind of help is welcome. Thanks in advance.

Can I update values from within ElasticSearch native script plugin?

I'm writing a naive plugin for ElasticSearch and I would much like to update a field from within this script. Is there a way?
Context: I'm trying to use ELK stack to chart differences between documents. The documents are produced from two separate sources continuously.
I have sorted all the pieces, but this one is the last mile for me. Any help is greatly appreciated.
Never mind. I figured it'll need a org.elasticsearch.client.Client within Plugin code. Thanks all.

Drools for rating telco records

Has anyone successfully used Drools as a kind of "rating engine" before? What are your experiences?
I'm trying to process a couple of millions of records (of slightly different types) and apply rating/pricing to these records.
Rating would be based of tables or database lookups as well as chains of if/then/else/else/else/else conditions using the lookup data.
Traditional rating engines don't employ rule mechanisms in ways that I'm comfortable with...
thanks for your help
To provide a slightly more informative response (although your question can't be answered based on the very vague description you've given), your "rating" is just one of the many names for what I use to call "classification problem". It has been solved many times using Drools.
However, this doesn't mean to say that your problem, with its particular environmental flavour and expected performance (how fast do you want to have the 2M records processed?) can be solved best using Drools - especially when the measure for deciding the quality isn't settled. (For instance: Is ease of maintenance more important than top efficiency?)
Go ahead and rig up a prototype and run a test to see how it goes. That will give you a more reliable answer than anything else. If someone says that something similar couldn't be done, it could be due to bad rule coding. If someone says that something similar was done successfully, it may not have had one of the quirks of your setup. And so on.

Using xmlpipe2 with Sphinx

I'm attempting to load large amounts of data directly into Sphinx from Mongo; and currently the best method I've found has been using xmlpipe2.
I'm wondering however if there are ways to just do updates to the dataset, as a full reindex of hundreds of thousands of records can take a while and be a bit intensive on the system.
Is there a better way to do this?
Thank you!
Main plus delta scheme. When all the updates goes to separate smaller index as described here:
http://sphinxsearch.com/docs/current.html#delta-updates