I'm working my way through a postgres query plan for the first time.
I'm having some trouble because I don't seem to be able to find any documentation that describes what each of the plan nodes are. In many cases, the name provides me with a reasonable guess, but in several the name of the plan node is too generic for me to have confidence in it.
Where can I find a list of types of plan nodes, with descriptions of each?
Chapter "56.1. Row Estimation Examples" explaines a lot, take a look.
Adding to Frank's answer, you might also want to peek at:
http://explain.depesz.com/
It reformats the plans in a more readable manner.
Related
I know there are several ways of creating a dataframe in spark.
Using toDF().
Using createDataframe().
Using spark.read (it can be csv/avro/text/json or any kind of file)
NOTE: There can be any other methods apart from the above three. Will be happy if you mention those methods as well.
Lets say, I'm reading a raw data from HDFS and storing it in a dataframe.
My question is, which of the above methods will give better performance?
I'm a spark practitioner so any useful information provided is highly appreciable.
I will normally use spark.read.text / spark.read.csv to create a dataframe. Kindly suggest which method would be the optimal.
This is a very broad question. To define optimal, you must first define a way to order one method to another:
The quickest?
The most easy to use?
The most easy to read?
...
As you can imagine, this can only be answered on a case-by-case basis. And this is in some ways quite subjective as well.
So instead of answering your question directly, I will talk about a tool with which you can decide these questions for yourself (again on a case-by-case basis).
This tool is the explain method on any type of Dataset/Dataframe. As the docs say:
Prints the physical plan to the console for debugging purposes.
So now, you can have a look at the physical plan for yourself when executing these! You might even get identical physical plans, which would mean that there is no difference at runtime between certain methods. But if the plans are different, you might be able to notice something that will give you a preference toward one or another.
So in your examples, you could do:
...toDF().explain
...createDataframe().explain
...spark.read.explain
Hope this helps!
Has anyone successfully used Drools as a kind of "rating engine" before? What are your experiences?
I'm trying to process a couple of millions of records (of slightly different types) and apply rating/pricing to these records.
Rating would be based of tables or database lookups as well as chains of if/then/else/else/else/else conditions using the lookup data.
Traditional rating engines don't employ rule mechanisms in ways that I'm comfortable with...
thanks for your help
To provide a slightly more informative response (although your question can't be answered based on the very vague description you've given), your "rating" is just one of the many names for what I use to call "classification problem". It has been solved many times using Drools.
However, this doesn't mean to say that your problem, with its particular environmental flavour and expected performance (how fast do you want to have the 2M records processed?) can be solved best using Drools - especially when the measure for deciding the quality isn't settled. (For instance: Is ease of maintenance more important than top efficiency?)
Go ahead and rig up a prototype and run a test to see how it goes. That will give you a more reliable answer than anything else. If someone says that something similar couldn't be done, it could be due to bad rule coding. If someone says that something similar was done successfully, it may not have had one of the quirks of your setup. And so on.
I'm having trouble finding information about how to look up records by an index using sequelize/postgres for node.js.
The only documentation of indexes appears to be here: http://sequelizejs.com/documentation#migrations-functions
To illustrate what I'm asking, let's take a simple model where there is a person there are Persons, Projects, and Tasks. Each person references a number of assigned tasks, and each project has a number of assigned tasks. Each task has a back-reference to the project and person. We'll assume that each person only has one task per project.
Let's say I have a person and project, and I need to find if there is a task associated. I've tried implementing this through an index on task of person/project.
I've found through searches that you can also create indexes through the slightly unintuitive syntax:
global.db.sequelize.getQueryInterface().addIndex('Tasks',
['ProjectId', 'PersonId'],
{
indexName: 'IndexName',
indicesType: 'UNIQUE'
}
This seems to work, and the index is created. However, I can't find a reference anywhere in the docs or even on the internet about how to use this index to find the task.
Any suggestions?
You have a fundamental misunderstanding of how a RDBMS is supposed to work.
It is supposed to pick the best indexes for each query based upon the pattern of database access required. This is performed by the "planner" in the RDBMS.
Some terms you will find useful to search against as you use PostgreSQL:
- Primary Key
- Foreign Key
- Constraint (both the above are these)
- EXPLAIN ANALYSE (or ANALYZE depending on your dialect of English)
- http://explain.depesz.com/ - a useful site to colour the above explains
- pg_dump / pg_restore - make sure you can use these tools to backup your database
Finally, make yourself a good hot cup of tea or coffee and sit down and at least skim through the PostgreSQL manuals. At least it will give you an idea of where to find further information.
Good Luck!
True, I'm coming from Cache's database structure, which very few people actually use.
I think the best answer to the question is that you just do the lookup as normal, and PostgreSQL takes care of the rest. Good to know!
Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance
You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.
I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.
Can anyone recommend for/against the time-travel functions in postgresql's contrib/spi module? Is there an example available anywhere?
Tnx
The argument for time-travel would be being able to look at tables that are updated often at an earlier insertion/deletion point. Say a table of stock prices for a firms investment portfolio.
The argument against would be the extra storage space it eats up.
Here is an Example of use.
See This discussion for an alternative approach to historical reporting.