Let's say I have a Core Data database for NSPredicate rules.
enum PredicateType,Int {
case beginswith
case endswith
case contains
}
My Database looks like below
+------+-----------+
| Type | Content |
+------+-----------+
| 0 | Hello |
| 1 | end |
| 2 | somevalue |
| 0 | end |
+------+-----------+
I have a content "This is end". How can I query Core Data to check if there is any rule that satisfies this content? It should find second entry on the table
+------+-----------+
| Type | Content |
+------+-----------+
| 1 | end |
+------+-----------+
but shouldn't find
+------+-----------+
| Type | Content |
+------+-----------+
| 0 | end |
+------+-----------+
Because in this sentence end is not at the beginning.
Currently I am getting all values, Create predicate with Content and Type and query the database again which is a big overhead I believe.
They way you doing it now is correct. You first need to build your predicate (which in your case is very complex operation that also requires fetching) and run each predicate to see if which one matches.
I wouldn't be so quick to assume that there is a huge overhead with this. If your data set is small (<300) I would suspect that there would be no problem with this at all. If you are experencing problems then (and only then!) you should start optimizing.
If you see the app is running too slowly then use instrements to see where the issue is. There are two possible places that I could see having perforance issues - 1) the fetching of all the predicates from the database and 2) the running of all of the predicates.
If you want to make the fetching faster, then I would recommend using a NSFetchedResultsController. While it is generally used to keep data in sync with a tableview it can be used for any data that you want to have a correct data for at any time. With the controller you do a single fetch and then it monitors core-data and keeps itself up to data. Then when you you need all of the predicate instead of doing a fetch, you simply access the contoller's fetchedObjects property.
If you find that running all the predicates are taking a long time, then you can improve the running for beginsWith and endsWith by a clever use of a bianary search. You keep two arrays of custom predicate objects, one sorted alphabetically and the other will all the revered strings sorted alphabetically. To find which string it begins with use indexOfObject:inSortedRange:options:usingComparator: to find the relevant objects. If don't know how you can improve contains. You could see if running string methods on the objects is faster then NSPredicate methods. You could also try running the predicates on a background thread concurrently.
Again, you shouldn't do any of this unless you find that you need to. If your dataset is small, then the way you are doing it now is fine.
Related
right now I have the following table:
students | classes |
-------------------------------------
Ally | Math |
Ally | English |
Ally | Science |
Kim | Math |
Kim | English |
I am currently building an advanced search feature where you can search by class and return students who have those classes. I would like to build a query that will return student's that have Math and English and Science in the classes column, so in the case above it would only return the rows that have Ally in them, since she meets the three classes criteria.
If anyone has any advice I would greatly appriciate it, thank you.
I've renamed your tables and such slightly, but partly cause I'm lazy. Here's what I came up with:
select student from studentclasses where
class in ('Math', 'English', 'Science')
group by student
having count(*) = 3;
See the db-fiddle
The idea is to grab all the student-class rows that match what your search is (basically an OR) and group it by the student so that we can limit by the having clause. We could use >= here, but if count for a particular student gets more than 3, we screwed up the IN :) If there are fewer than 3, then we're missing one class, so not all classes were found for that student.
The only caveats are:
I'm assuming you're using a student ID rather than just first name, and that the first name bit is just to make it easier for us to read, otherwise duplicates will abound.
There are no duplicates of a given class for a particular student. That is, if Kim is in Science twice, then that comes up with 3. In that case, you'll need to use a DISTINCT in there somewhere.
I am using Postgres 9.5. If I update certain values of a row and commit, is there any way to fetch the old value afterwards? I am thinking is there something like a flashback? But this would be a selective flashback. I don't want to rollback the entire database. I just need to revert one row.
Short answer - it is not possible.
But for future readers, you can create an array field with historical data that will look something like this:
Column | Type |
----------------+--------------------------+------
value | integer |
value_history | integer[] |
For more info read the docs about arrays
I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9
There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)
Currently I have a table schema that looks like this:
| id | visitor_ids | name |
|----|-------------|----------------|
| 1 | {abc,def} | Chris Houghton |
| 2 | {ghi} | Matt Quinn |
The visitor_ids are all GUIDs, I've just shortened them for simplicity.
A user can have multiple visitor ids, hence the array type.
I have a GIN index created on the visitor_ids field.
I want to be able to lookup users by a visitor id. Currently we're doing this:
SELECT *
FROM users
WHERE visitor_ids && array['abc'];
The above works, but it's really really slow at scale - it takes around 45ms which is ~700x slower than a lookup by the primary key. (Even with the GIN index)
Surely there's got to be a more efficient way of doing this? I've looked around and wasn't able to find anything.
Possible solutions I can think of could be:
The current query is just bad and needs improving
Using a separate user_visitor_ids table
Something smart with special indexes
Help appreciated :)
I tried the second solution - 700x faster. Bingo.
I feel like this is an unsolved problem however, what's the point in adding arrays to Postgres when the performance is so bad, even with indexes?
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm working on an application that has the following use case:
Users upload csv files, which need to be persisted across application restarts
The data in the csv files need to be queried/sorted etc
Users specify the query-able columns in a csv file at the time of uploading the file
The currently proposed solution is:
For small files (much more common), transform the data into xml and store it either as a LOB or in the file system. For querying, slurp the whole data into memory and use something like XQuery
For larger files, create dynamic tables in the database (MySQL), with indexes on the query-able columns
Although we have prototyped this solution and it works reasonably well, it's keeping us from supporting more complex file formats such as XML and JSON. There are also a few more niggling issues with the solution that I won't go into.
Considering the schemaless nature of NoSQL databases, I though they might be used to solve this problem. I have no practical experience with NoSQL though. My questions are:
Is NoSQL well suited for this use case?
If so, which NoSQL database?
How would we store csv files in the DB (collection of key-value pairs where the column headers make up the keys and the data fields from each row make up the values?)
How would we store XML/JSON files with possibly deeply hierarchical structures?
How about querying/indexing and other performance considerations? How does that compare to something like MySQL?
Appreciate the responses and thanks in advance!
example csv file:
employee_id,name,address
1234,XXXX,abcabc
001001,YYY,xyzxyz
...
DDL statement:
CREATE TABLE `employees`(
`id` INT(6) NOT NULL AUTO_INCREMENT,
`employee_id` VARCHAR(12) NOT NULL,
`name` VARCHAR(255),
`address` TEXT,
PRIMARY KEY (`id`),
UNIQUE INDEX `EMPLOYEE_ID` (`employee_id`)
);
for each row in csv file
INSERT INTO `employees`
(`employee_id`,
`name`,
`address`)
VALUES (...);
Not really a full answer, but I think I can help on some points.
For number 2, I can at least give this link that helps sorting out NoSQL implementations.
For number 3, using a SQL database (but should fit as well for a NoSQL system), I would represent each column and each row as individual tables, and add a third table with foreign keys to columns and rows, and with the value of the cell. You get a big table with easy filtering.
For number 4, you need to "represent hierarchical data in a table"
The common approach to this would be to have a table with attributes, and a foreign key to the same table, pointing to the parent, like this for example :
+----+------------+------------+--------+
| id | attribute1 | attribute2 | parent |
+----+------------+------------+--------+
| 0 | potato | berliner | NULL |
| 1 | hello | jack | 0 |
| 2 | hello | frank | 0 |
| 3 | die | please | 1 |
| 4 | no | thanks | 1 |
| 5 | okay | man | 4 |
| 6 | no | ideas | 2 |
| 7 | last | one | 2 |
+----+------------+------------+--------+
Now the problem is that, if you want to get, say, all the child elements from element 1, you'll have to query every item individually to obtain its childs. Some other operations are hard, because they need to get a path to the object, traversing many other objects and making extra data queries.
One common workaround to this, and the one I use and prefer, is called modified pre-order tree traversal.
Using this technique, we need an extra layer between the data storage and the application, to fill some extra columns at each structure-altering modification. We will assign to each object three properties : left, right and depth.
The left and right properties will be filled counting each object from the top, traversing all the tree leaves recursively.
This is a vague approximation of the traversal algorithm for left and right (the part with depth can be easily gussed, this is just some lines to add) :
Set the tree root (or the first tree root if there are many) left
attribute to 1
Go to its first (or next) child. Set its left attribute to
the last number plus one (here, 2)
Does is it have any child ? If yes, go back to number 2. If no, set its right to the last number plus one.
Go to next child, and do the same as in 2
If no more child, go to next child of parent and do the same as in 2
Here is a picture explaining the result we get :
(source: narod.ru)
Now it is really easier to find all descendants of an object, or all of its ancestors. This can be done with only a single query, using left and right.
What is important when using this is having a good implementation of the layer between the data and the application, handling the left, right and depth attribute. These fields have to be ajusted when :
An object is deleted
An object is added
The parent field of an object is modified
This can be done with a parallel process, using locks. It can also be implemented directly between the data and the application.
See these links for more information about trees :
Managing hierarchies in SQL: MPTT/nested sets vs adjacency lists vs storing paths
MPTT With Django lib
http://www.sitepoint.com/hierarchical-data-database-2/
I personally had great results with django-nonrel and django-mptt the few times I did NoSQL.