Search functions inadequately searching large dataframe in R - filtering

first time asking a question on here and really hoping to get some help. I don't believe this question is out there yet.
I have a dataframe of 7,000,000+ observations, each with 140 variables. I am trying filter the data down to a smaller cohort using a set of multiple criteria, any of which would allow for inclusion in the smaller, filtered dataset.
I have tried two methods to search my data:
The first strategy does filter_all() on all variables and searches for my criteria for inclusion
filteredData <- filter_all(rawData, any_vars(. %in% c(criteria1, criteria2, criteria3)))
The second strategy does a series of which() functions, also trying to identify every row that contains one of my criteria.
filteredData <- rawData[which(rawData$criteria1 == "criteria" | rawData$criteria2 == "criteria | rawData$criteria3 == "criteria"),]
These results will accurately pull one or two rows meeting this criteria, however I don't believe all 7,000,000 rows are being searched. I added a row label to my rawData set and saw that the function successfully pulled row #60,192. I am expecting hundreds of rows in the final search and am very confused why only a couple from early on in the dataframe get accurately identified.
My questions:
Do the filter_all() and which() functions have size limits that they stop searching after?
Does anyone have a suggestion on how to filter/search based on multiple criteria on a very large dataset?
Thank you!

Related

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

Remove rows from search expression solr

I'm trying to search for the items which's attribute matches the given function below in my large dataset, but I'm facing a problem here.
The row parameter only selects first 300 objects and the function then filters the matching results, but I'm trying to search the whole index, not only just first few, how can I rewrite this to achieve it?
having(
select(search(myIndex,q="*:*", fl="*", rows=300),
id,
dotProduct(ATTRIBUTE, array(4,5,2)) as prod,
l1norm(array(1,2,3)) as a,
l1norm(ATTRIBUTE) as b,
div(prod, add(a, sub(b, prod))) as c
), and(gteq(c, 5), lteq(c, 8)))
The simplest would be to increase the number of rows to cover the number of entries in the index.
However if this number is huge, you should probably use the /export request handler instead of a regular select-like handler.
The /export request handler allows a fully sorted result set to be
streamed out of Solr using a special rank query parser and
response writer. These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Depending on your needs, you could also do multiple queries playing with paginated results using both parameter start and rows, or if the number of entries is not known by the client code, use cursorMark.

Active Record efficient querying on multiple different tables

Let me give a summary of what I've been attempting to do and the efficiency issues I've been running into:
Essentially I want my users to be able to select parameters to filter data from my database, then I want to pass relevant data which passes those filters from the controller.
However, these filters query on data from multiple different tables (that is, about 5-6 different tables), some of which are quite large (as in 100k+ rows). These tables are all related to what I want to show, e.g. Here is a bond that meets so and so criteria, which is issued by so and so issuer, which must meet these criteria, and so on.
From an end result, I only really need about 100 rows after querying based on the parameters given by the user, but it feels like I need to look at everything in every table because I dont know how strict the filters will be beforehand. e.g. With a starting universe of 100k sets of data, passing filter f1,f2 of Table 1 might leave 90k, but after passing through filter f3 of table 2, f4,f5,f6 of table 3, and so ..., we might end up with 100 or less sets of data that pass these parameters because the last filters checked might be quite strict.
How can I go about querying through these multiple different tables efficiently?
Doing a join between them seems like it'd yield some time complexity of |T_1||T_2||T_3||T_4||T_5||T_6| where T_i is the "size" of table i.
On the other hand, just looking through the other tables based off the ids of the ones that pass the previous filter (as in, id 5,7,8 pass filters in T_1, which of those ids then pass filters in T_2, then which of those pass filters in T_3 and so on) looks like it might(?) have time complexity of |T_1| + |T_2| + ... + |T_6|.
I'm relatively new to Ruby on Rails, so im not entirely sure all of the tools at my disposal that could help with optimizing this, but at the same time I'm not entirely sure how to best approach this algorithmically.

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).