Which is the intermediate language used by PostgreSQL for query processing and optimization? - postgresql

I'm currently doing a paper on PostgreSQL and I can't find anywhere (including their documentation) which is the intermediate language used for query processing and optimization.

There is no intermediate "language" as such.
The SQL is parsed into a parse tree of Node*. This is then passed through the query rewriter, then transformed into a plan tree by the planner/optimizer. You can view these trees using the (documented) options debug_print_parse, debug_print_rewritten and debug_print_plan.
See the source code - src/backend/parser/, src/backend/rewrite and src/backend/optimizer/ in particular, along with src/include/nodes/nodes.h, plannodes.h, parsenodes.h, etc. Note that there are README files in both the optimizer and parser source directories.

Related

Term for diff/delta on multiple files or data structures

I would like to know whether there is a proper term to describe "diffing" of / obtaining the delta between multiple files or data structures, such that the resulting "diff" contains first a description of the parts common to all files/structures, then descriptions of how this "base" file/structure must be modified to obtain the individual ones, ideally in a hierarchical fashion if some files/structures are more similar to each other than others.
There are some questions and answers about how to do this with certain tools (e.g. DIFF utility works for 2 files. How to compare more than 2 files at a time?), but as I want to do this for a specific type of data structure (namely JSON), I'm at a loss as to what I should even search for.
This type of problem seems to me like it should be common enough to have a name such as "hierarchical diff" (which however seems to be reserved for 2-way diffs on hierarchical data structures), "commonality finding", or something like that.
I guess a related concept about hierarchical ordering of commonalities and differences is formal concept analysis, but this operates on sets of properties rather than hierarchical data structures and won't help me much.
There are multiple valid denominations :
Data comparison (or Sequence comparison)
Delta encoding
Delta compression (or Differential compression)
Algorithms:
An O(ND) Difference Algorithm and Its Variations (Eugene Myer)
A technique for isolating differences between files (Paul Heckel)
The String-to-String Correction Problem with Block Moves (Walter Tichy)
Good Wikipedia links
Longest common subsequence problem
Comparison of file comparison tools
Diff Unix Utility
Some implementations
diff-match-patch (Neil Fraser - Google)
jsdifflib
jsondiffpatch

convert natural language to OCL script

i have some statements in natural language that i have to convert to OCL format, i am beginner and have written some by myself but failed to convert some advanced level statements , i have searched a lot like this, this and this but failed to found something exact or similar.
statements that i want to convert:
1.Disable web server directory listing and ensure file meta data (e.g GIT) and backup files are not present within web roots
2.Log access failures , alert admins on repeated failures
3.Limit API and control access to minimize the harm from automated attack
Block SQL injection through query parameterization.
please help me
thanks
Natural language is imprecise and so cannot always be converted to a precise form such as OCL without additional human insight. The converse is possible.
The statements you want to convert appear to be from disciplined fields. You therefore just need to transliterate each idiom.

How does the built-in Apache Hive hash function work and where can I find that documentation?

I'm working with Apache Hive and need to be certain of how the built-in hash function works. I found this page that lists hash under the Misc. Functions section. It says that hash has been available "As of Hive 0.4".
I would just like to see some documentation on what it's doing exactly. Is it deterministic? Will it always produce the same output given the same input? How many collisions should I expect?
A hash function is deterministic, by definition, cf. https://en.wikipedia.org/wiki/Hash_function#Determinism
So if the implementation of hash() was not deterministic, then it would be a bug, and someone would have noticed!
Caveat: that implementation is subject to change (and bug fixes) hence determinism stands only for a given version of Hive.
Hive is Open Source. Documentation is not bad by Apache standards, but still incomplete. Just inspect the source code => https://github.com/apache/hive
For Hive 2.1 for example:
the hash() function (an UDF in Hive jargon) is defined here
it just calls ObjectInspectorUtils.getBucketHashCode() which calls ObjectInspectorUtils.hashCode() on each argument, then merges its hash into a global "bucket" hash - as defined here
a comment shows that the (crude) hashing method implemented by Hive is derived from String.hashCode()
For alternative hashing functions in Hive, see Calculate hash without using exisiting hash fuction in Hive but the answer basically points to the same documentation page that you already found.

Where is the source code that performs access path selection in Postgres?

There must be a part in the query planner of Postgres that is responsible for identifying which index to use based on various information (relation, column name, operator class/family, statistics, etc.).
I know that the source code of Postgres is available online but I would like a direct link to the part that performs the access path selection. The codebase is big and I can't find the relevant part.
The possible index access paths are found in the function create_index_paths in src/backend/optimizer/path/indxpath.c.

How to get column name and data type returned by a custom query in postgres?

How to get column name and data type returned by a custom query in postgres? We have inbuilt functions for table/views but not for custom queries. For more clarification I would say that I need a postgres function which will take sql string as parameter and will return colnames and their datatype.
I don't think there's any built-in SQL function which does this for you.
If you want to do this purely at the SQL level, the simplest and cheapest way is probably to CREATE TEMP VIEW AS (<your_query>), dig the column definitions out of the catalog tables, and drop the view when you're done. However, this can have a non-trivial overhead depending on how often you do it (as it needs to write view definitions to the catalogs), can't be run in a read-only transaction, and can't be done on a standby server.
The ideal solution, if it fits your use case, is to build a prepared query on the client side, and make use of the metadata returned by the server (in the form of a RowDescription message passed as part of the query protocol). Unfortunately, this depends very much on which client library you're using, and how much of this information it chooses to expose. For example, libpq will give you access to everything, whereas the JDBC driver limits you to the public methods on its ResultSetMetadata object (though you could probably pull more information from its private fields via reflection, if you're determined enough).
If you want a read-only, low-overhead, client-independent solution, then you could also write a server-side C function to prepare and describe the query via SPI. Writing and building C functions comes with a bit of a learning curve, but you can find numerous examples on PGXN, or within Postgres' own contrib modules.