ios coredata searching large datasets - iphone

I'm working on an enterprise ios application for the iphone that uses coredata as its persistence store. We have a couple core data entities and one of them has 30~ fields. My data set exceeds 45,000 of these 30-field entities.
I need to construct a query where my search string can match one of 12 different fields on the object. My NSPredicate is basically a big rats nest of (property == $A) || (otherproperty == $A) ... etc where $A is my search string.
We're just starting to tune this for performance because it's pretty bad. Are there any obvious things we should do? Would it be better to use compound predicates and just have each sub-predicate be (property == $A) ?
Thoughts? Thanks in advance.
My (partial) solution
- I watched some wwdc videos on core data and basically did what Apple said. Which was to create a relationship from my real object to another object to represent a search-token. This new search-token entity had two fields - token and weight. When I insert my main object into coredata I give him a set of these search-tokens. One search-token per field I want to query over and the contents of the token property are some sort of normalized (all lowercase) data. Then my predicate is simply "ANY searchTokens.token beginswith %#". So now we're just searching one table over its indexed field and we're not having to do any expersions in the right side of the where clause. Coredata is really fast searching like this. The only other thing I did was to put an NSTimer in shouldReloadTableForSearchString to only fire off the search after 1 second of inactivity.
UISearchDisplayController - wait for N seconds OR for user to press "Search" before conducting search

Using compound predicates will give you the same result as a long string with ||, so no, that won't help.
What might help is to optimize the order of the predicates. They're evaluated left to right, and evaluation stops as soon as the result is known. If you have a long chain of logical ORs, and the first one is true, it doesn't matter what the rest of them look like. Those predicates won't even be evaluated.
So, if you can reasonably expect that some properties are more likely to match than others, move those to the front of the list. If one will be YES almost all of the time, make that the first one.
Also, if any of these predicates are numeric, move those toward the front of the line. Numeric comparisons are much faster than string comparison, so any numeric check that returns YES will let you skip any following string comparisons.

Related

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.

Better performance for SQLite Select Statement

I'm developing an Iphone App where the user types in any string into a searchbar and presses the search button. After that a result list should appear.
In my SQLite I have four columns a, b, c, d. Let's say they have the following Values:
Dataset 1:
a: code1
b: report1
c: description1_1
d: description1_2
Dataset 2:
a: code2
b: report2
c: description2_1
d: description2_2
So if the user enters a value of: "1_1" then the first dataset will be selected because of clumn c.
If the user enters a value of: "report" then the first and second dataset will be selected.
As I'm using a database with nearly 60.000 Datasets searching for a part-string is really killing the performance.
Setting an index at all 4 columns will make the size of the SQLite database much too huge.
So I didn't use an index at all.
My Select Statement looks like this:
NSString *sql = [NSString stringWithFormat:#"SELECT * FROM scode WHERE a LIKE '%#%#%#' OR c LIKE '%#%#%#' OR d LIKE '%#%#%#'", wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard];
Is there any good way to enhance the performance of searching for a part-string in all columns?
Thank you and kind regards,
Daniel
You're after Full Text Searching, which SQLite doesn't natively support. I don't have any experience with 3rd party support, but based on results there are a few options.
You answered your own question: Do the index on all four columns. And measure the size difference. Considering the storage capacity of the iPhone, you're probably out of balance trying to reduce storage.
The rule of thumb with SQLite performance is not to doa query that isn't indexed.
You can see what SQLite is actually doing by creating your database on the Mac using the same schema and EXPLAIN QUERY PLAN. (There's also EXPLAIN, which is more detailed but less obvious.)
You can create a separate table, with two columns: a pattern string and a key value (which is used to refer to your data tables). Lets call this table "search_index".
Then, on any change to your data table entries, you update the "search_index" table:
remove rows with keys of changed data table rows
for each column in data table, use the first X characters of the data, and add them to search_index with the key
You can work out the details yourself, but in this way, you just build your own (partial) search index.
When querying, you can use up to X characters to search in the search_index table alone. If the user types more than X characters you at least have a limited set of data table rows to search in. So you can search those 60k rows easily.
Find a good value for X to balance storage requirements and usability and performance.
EDIT: Looks like you do not want to search only the beginning of the words? Well, then you should not just use the "first X characters", but you should split the data into single words, and use the full words in search_index. Though in practice you will still have around a fourth of the index storage requirements compared to giving all columns an index. So, its still a good thing to build your own "search_index".

What's a real world example of something you would represent with a hash?

I'm just trying to get a grip on when you would need to use a hash and when it might be better to use an array. What kind of real-world object would a hash represent, say, in the case of strings?
I believe sometimes a hash is referred to as a "dictionary", and I think that's a good example in itself. If you want to look up the definition of a word, it's nice to just do something like:
definition['pernicious']
Instead of trying to figure out the correct numeric index that the definition would be stored at.
This answer assumes that by "hash" you're basically just referring to an associative array.
I think you're looking at things in the wrong direction. It is not the object which determines if you should use a hash but the manner in which you are accessing it. A common use of a hash is when using a lookup table. If your objects are strings and you want to check if they exist in a Dictionary, looking them up will (assuming the hash works properly) by O(1). WIth sorting, the time would instead be O(logn), which may not be acceptable.
Thus, hashes are ideal for use with Dictionaries (hashmaps), sets (hashsets), etc.
They are also a useful way of representing an object without storing the object itself (for passwords).
The phone book - key = name, value = phone number.
I also think of the old World Book Encyclopedias (actual books). Each article is "hashed" into a single book (cat goes in the "C" volume).
Any time you have data that is well served by a 1-to-1 map.
For example, grades in a class:
"John Smith" => "B+"
"Jacob Jenkens" => "C"
etc
In general hashes are used to find things fast - a hash map can be used to assosiate one thing with another fast, a hash set will just store things "fast".
Please consider also the hash function complexity and cost when considering whether it's better to use a hash container or a normal less then container - the additional size of the hash value and the time needed to compute a "perfect" hash, and the time needed to make a 1:1 comparision at the end in case of a hash function conflict may in fact be a lot higher then just going through a tree structure with logharitmic complexity using the less then operators.
When you need to associate one variable with another. There isn't a "type limit" to what can be a key/value in a hash.
Hashed have many uses. Aside from cryptographic uses, they are commonly used for quick lookups of information. To get similarly quick lookups using an array you would need to keep the array sorted and then used a binary search. With a hash you get the fast lookup without having to sort. This is the reason most scripting languages implement hashing under one name or another (dictionaries, et al).
I use one often for a "dictionary" of settings for my app.
Setting | Value
I load them from the database or config file, into hashtable for use by my app.
Works well, and is simple.
One example could be zip code associated with an area, city or any postal address.
A good example is a cache with lot's of elements in it. You have some identifer by which you want to look up the a value (say an URL, and you want to find the according cached webpage). You want these lookups to be as fast as possible and don't want to search through all the stored pages everytime some URL is requested. A hash table is a great data structure for a problem like this.
One real world example I just wrote is when I was adding up the amount people spent on meals when filing expense reports.I needed to get a daily total with no idea how many items would exist on a particular day and no idea what the date range for the expense report would be. There are restrictions on how much a person can expense with many variables (What city, weekend, etc...)
The hash table was the perfect tool to handle this. The key was the date the value was the receipt amount (converted to USD). The receipts could come in in any order, i just keep getting the value for that date and adding to it until the job was done. Displaying was easy as well.
(php code)
$david = new stdclass();
$david->name = "david";
$david->age = 12;
$david->id = 1;
$david->title = "manager";
$joe = new stdclass();
$joe->name = "joe";
$joe->age = 17;
$joe->id = 2;
$joe->title = "employee";
// option 1: lets put users by index
$users[] = $david;
$users[] = $joe;
// option 2: lets put users by title
$users[$david->title] = $david;
$users[$joe->title] = $joe;
now the question: who is the manager?
answer:
$users["manager"]