I'm having trouble finding information about how to look up records by an index using sequelize/postgres for node.js.
The only documentation of indexes appears to be here: http://sequelizejs.com/documentation#migrations-functions
To illustrate what I'm asking, let's take a simple model where there is a person there are Persons, Projects, and Tasks. Each person references a number of assigned tasks, and each project has a number of assigned tasks. Each task has a back-reference to the project and person. We'll assume that each person only has one task per project.
Let's say I have a person and project, and I need to find if there is a task associated. I've tried implementing this through an index on task of person/project.
I've found through searches that you can also create indexes through the slightly unintuitive syntax:
global.db.sequelize.getQueryInterface().addIndex('Tasks',
['ProjectId', 'PersonId'],
{
indexName: 'IndexName',
indicesType: 'UNIQUE'
}
This seems to work, and the index is created. However, I can't find a reference anywhere in the docs or even on the internet about how to use this index to find the task.
Any suggestions?
You have a fundamental misunderstanding of how a RDBMS is supposed to work.
It is supposed to pick the best indexes for each query based upon the pattern of database access required. This is performed by the "planner" in the RDBMS.
Some terms you will find useful to search against as you use PostgreSQL:
- Primary Key
- Foreign Key
- Constraint (both the above are these)
- EXPLAIN ANALYSE (or ANALYZE depending on your dialect of English)
- http://explain.depesz.com/ - a useful site to colour the above explains
- pg_dump / pg_restore - make sure you can use these tools to backup your database
Finally, make yourself a good hot cup of tea or coffee and sit down and at least skim through the PostgreSQL manuals. At least it will give you an idea of where to find further information.
Good Luck!
True, I'm coming from Cache's database structure, which very few people actually use.
I think the best answer to the question is that you just do the lookup as normal, and PostgreSQL takes care of the rest. Good to know!
Related
I'm planning to build a wiki/resource app which, by itself, makes sense to be using Mongo for. However, the primary purpose of the app is to have associative tables showing connections between individual content items. A majorly simplified example would be, Odin, Zeus, Jupiter would be a row within an "Allfather" association. The problem is that these tables could grow indefinitely and it seems like developing this type of network in Mongo would be a rather complex and frustating experience.
I've thinking about using Mongo for the pages and just maintaining a small Postgres database for these associations, but something intuitively feels wrong about that. However, I'm an experienced frontend dev that's just starting to dabble in backend/database, so I'm not willing to trust my intuitions on databases yet haha.
Is postgres + mongo a good solution for the above problem, or is this where something like a graph database (which I only learned about yesterday) would come into play?
After spending the last several hours researching further, it does appear that a graph database is the right solution to managing the "association" feature that I'm looking to develop here because the actual relationships will be rather multidimensional in nature.
Furthermore, I've decided to go with ArangoDB as it merges key-value (ie. Redis or postgres' hstore IIRC), document store (ie. Mongo's documents or Postgres' JSONB), and also does graph database functionality. Arango can do joins between documents and, even better, it has a single, unified query language that works across all 3 types of models. It also has a rather robust tooling environment around it already that seems pretty promsing.
I found this youtube video rather enlightening as well if anyone wants a nice introduction to understanding why you might want to use a "multi-model database" like ArangoDB.
I have a materialized view (which is very much a table) where I need to make where in kind of queries.
The column I want to query (say view_id) definitely has repetitions (15-20).
The where in queries would also be very large i.e - it would contain a lot of view_id to query.
Should I go ahead and create an index on this column?
Will it give me some performance improvements?
I have another column which would help help me have a multi column index(unique). Should this be a better option?
With questions such as these on performance, there is no substitute for testing it with your exact case. There's little harm in trying it out (even on a production system, but utilize a test system if you can!), other than perhaps slowing performance until you undo what you did. Postgres makes this kind of tinkering safe.
#tim-biegeleisen's first comment is spot on: with your setup, your cardinality is reduced, but that doesn't mean it's not a win.
In short, try it and see. There is no better answer you will get than what your own dataset and access patterns will give you.
I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.
I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.
For example, if I would implement it in Lucene, I would do the following:
Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
Than adding the created elements to the Lucene index
And finally using the MoreLikeThis class to find the similar documents
So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.
My question is: are the any applications/frameworks, which might fit for the above mentioned problem?
Thanks,
krisy
UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.
There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...
Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.
Also, the answer that #mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).
If I am getting correctly you have your own database, and you are searching if its duplicate, or copy/similar, in database while/after user uploads.
If That is the case, the domain is very big in comparison..
1) For Image you must use pattern matching, there are few papers available for image duplicate finder, on net, search for them you will get many options for that,
2) for Document there is again characteristically division
DOC(x)
PDF
TXT
RTF, etc..
Each document carry different property, now here Lucene may help you but its search engine,
While searching for Language pattern, there are many things we need to check, as you are searching for similar(not exact same).
So, fuzzy language program will come handy.
This requirement is too large that the forum page will not be enough to explain everything anyways, I hope this much will do
I am developing a system which has multiple modules,
Social Media User Demography - (Document) - Name, loc, interests, work, education
Social Media User Connections - (Graph) - friends
CRM - (Rows and Columns) - telecom + banking etc
to name a few. I'm pretty sure that I have already crossed millions of records in each one of them.
When I look for a NoSql database to choose from I have at least 10 in each category. For Document database I have a arraylist right from MongoDB to DjonDB. Its the same case when I look for a graph database, so on and so forth. And also I have seen other key value store databases, columnar databases etc to name a few at http://nosql-database.org/.
So I wanted to know are there any generic thumb rules that I should follow to choose among these databases, when a columnar DB is optimized, to what type of data does a key value store suits best etc..
What are the best suited databases for what type of data and why? and most importantly
What are the worst suited databases for what type of data and why?
Thanks
This is a very open question, but I'll give it a shot.
Some things to keep in mind:
Pick a database that has a great (not good, great) community around it. You might find FooDB and it claims to do everything you want, but it no one is contributing to it, then you'll have built your application around a dead technology. You want active contributions, lots of customers with production deployments and ideally something not in its first version.
Try to find technologies that play well together. For example, Elastic Search, MongoDB, CouchDB, Couchbase all more or less work with JSON. That should help you narrow your choices of technologies.
I wouldn't try and spread myself too thin. Each type of database (graph, document, row/column, key value pair) has it's own learning curve. It takes quite a while to learn how to model data in a denormalized fashion. The more variety you have, the harder it will be to maintain all those different databases.
I don't know why I rarely see this as advice, but I would pick something you actually like developing in. Does the query syntax seem intuitive and fun? If not, you're going to hate developing in it. This isn't the most important factor, but I think it should be considered.
I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:
Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:
cluster
\--news1
\--word1
\--word2
\--news2
\--word3
\--news3
\--word1
\--word3
And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.
Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.
I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.
Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.
I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.
NEW
As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)
TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)
You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).
All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.
TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.
As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).
NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.
When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:
news-analysis-example http://github.com/neo4j-examples/domain-models/raw/master/news-analysis.png
In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.
How about db4o? db4o
ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".
I don't understand why distributedness is not required. Could you elaborate on that?
Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.
Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra #classmethod really be that big a deal?
Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.
In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.
MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.
The good news:
This will make it easy to modify fields.
Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.
The bad news:
It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.
If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.