Has anyone done or even no if its possible to use NLTK within a Postgres Python Stored Procedure or trigger
You can use pretty much any Python library in a PL/Python stored procedure or trigger.
See the PL/Python documentation.
Concepts
The crucial point to understand is that PL/Python is CPython (in PostgreSQL up to and including 9.3, anyway); it uses exactly the same interpreter that the normal standalone Python does, it just loads it as a library into the PostgreSQL backed. With a few limitations (outlined below), if it works with CPython it works with PL/Python.
If you have multiple Python interpreters installed on your system - versions, distributions, 32-bit vs 64-bit etc - you might need to make sure you're installing extensions and libraries into the right one when running distutils scripts, etc, but that's about it.
Since you can load any library available to the system Python there's no reason to think NLTK would be a problem unless you know it requires things like threading that aren't really recommended in a PostgreSQL backend. (Sure enough, I tried it and it "just worked", see below).
One possible concern is that the startup overhead of something like NLTK might be quite big, you probably want to preload PL/Python it in the postmaster and import the module in your setup code so it's ready when backends start. Understand that the postmaster is the parent process that all the other backends fork() from, so if the postmaster preloads something it's available to the backends with greatly reduced overheads. Test performance either way.
Security
Because you can load arbitrary C libraries via PL/Python and because the Python interpreter has no real security model, plpythonu is an "untrusted" language. Scripts have full and unrestricted access to the system as the postgres user and can fairly simply bypass access controls in PostgreSQL. For obvious security reasons this means that PL/Python functions and triggers may only be created by the superuser, though it's quite reasonable to GRANT normal users the ability to run carefully written functions that were installed by the superuser.
The upside is that you can do pretty much anything you can do in normal Python, keeping in mind that the Python interpreter's lifetime is that of the database connection (session). Threading isn't recommended, but most other things are fine.
PL/Python functions must be written with careful input sanitation, must set search_path when invoking the SPI to run queries, etc. This is discussed more in the manual.
Limitations
Long-running or potentially problematic things like DNS lookups, HTTP connections to remote systems, SMTP mail delivery, etc should generally be done from a helper script using LISTEN and NOTIFY rather than an in-backend job in order to preserve PostgreSQL's performance and avoid hampering VACUUM with lots of long transactions. You can do these things in the backend, it just isn't a great idea.
You should avoid creating threads within the PostgreSQL backend.
Don't attempt to load any Python library that'll load the libpq C library. This could cause all sorts of exciting problems with the backend. When talking to PostgreSQL from PL/Python use the SPI routines not a regular client library.
Don't do very long-running things in the backend, you'll cause vacuum problems.
Don't load anything that might load a different version of an already loaded native C library - say a different libcrypto, libssl, etc.
Don't write directly to files in the PostgreSQL data directory, ever.
PL/Python functions run as the postgres system user on the OS, so they don't have access to things like the user's home directory or files on the client side of the connection.
Test result
$ yum install python-nltk python-nltk
$ psql -U postgres regress
regress=# CREATE LANGUAGE plpythonu;
regress=# CREATE OR REPLACE FUNCTION nltk_word_tokenize(word text) RETURNS text[] AS $$
import nltk
return nltk.word_tokenize(word)
$$ LANGUAGE plpythonu;
regress=# SELECT nltk_word_tokenize('This is a test, it''s going to work fine');
nltk_word_tokenize
-----------------------------------------------
{This,is,a,test,",",it,'s,going,to,work,fine}
(1 row)
So, as I said: Try it. So long as the Python interpreter PostgreSQL is using for plpython has nltk's dependencies installed it will work fine.
Note
PL/Python is CPython, but I'd love to see a PyPy based alternative that can run untrusted code using PyPy's sandbox features.
Related
I have a generic question (not an issue). I am trying to run a big query with (lot of join conditions connecting 15-20 tables). Do we have any limitations in ibm_db while running big queries ? The query has been running in our production environment for more than 15 years. I am able to run the query in a in-home .Net tool. However, while running it using ibm-db in pycharm I keep getting sqlcode -905 resource limitation error. Is there anything I am missing with ibm-db usage ?
Any insight will be helpful. Thank you for the help.
Most likely this -905 sqlcode has nothing to do with python or ibm_db.
Instead, it is more likely due to how the workload/resource management is configured at the Db2 server. Your question gave zero facts about the target Db2 environment, or about the difference(s) between the execution that works (.net) versus the execution that triggers the limitation.
One specific detail to eliminate is that the account (auth-id) used for the .net application might be different to the account you use when connecting from python. The Db2-server may be configured to allocate based on User-Id (auth-ID) , or other client side factor (depending on the Db2-server platform and version).
You can prove that the -905 symptom has nothing to do with python or ibm_db by temporary eliminating both, for example by submitting the same query from the db2cli tool (or db2 clp if your client workstation has it), or by submitting the query from jdbc (as long as you use the same account name for connecting as you do with python).
Contact your DBA team for details of the configuration of the WLM or RLF or whatever resource management tooling is deployed at the target Db2 subsystem. In addition, use python ibm_db to print out the full details of the exception (including the resource name, limit amount1/2, limit source, as they can also yield more information).
Is it possible to read, write, delete OS files with PL/pgSQL?
Can I run OS commands?
I've seen some examples that you can copy files like CSV but can you read/write/delete OS files? Can you execute OS commands?
No, that's not possible.
PL/pgSQL is a trusted language and as such does not allow access to server resources, let alone running OS commands.
Explanation of "trusted language"
The optional key word TRUSTED specifies that the language does not grant access to data that the user would not otherwise have. Trusted languages are designed for ordinary database users (those without superuser privilege) and allows them to safely create functions and procedures. Since PL functions are executed inside the database server, the TRUSTED flag should only be given for languages that do not allow access to database server internals or the file system
There are some some SQL functions available that enable roles with superuser privilege to read files on the server - but that is independent of PL/pgSQL.
If you do want to open up your database server for all kind of attacks, use a non-trusted language, like PL/Python or if you are really adventurous PL/sh
PostgreSQL has some functions to read files in the data directory: pg_read_file and pg_read_binary_file
The “adminpack” extension has a function to write files: pg_file_write
Perhaps you can abuse COPY ... TO PROGRAM to run code on the server.
But the smart thing to do is to write a function in PL/PerlU or PL/Python.
We have an application which is using Cassandra for its database. How should we deploy schema changes in a live production environment.
In development we are just blowing the database away and recreating it with a 'database.cql' script kept in version control. This clearly isn't a solution in production.
In the relational world I would either use a sequence of upgrade scripts and apply them in order, or use a tool to interactively compare the staging and production databases and make the appropriate schema changes.
How do I solve the same problem in the Cassandra?
Here's one I've started and have been using for a while.
https://github.com/heartysoft/aedes
It supports multiple environments, and versioning. Since we're Windows based, it's mainly powershell, but there's no reason a bash script couldn't be written to do the equivalent. The powershell script itself is extremely simple. It requires Powershell v3+. Usage is pretty easy:
aedes.ps1 192.168.40.4 [-u username -p password -env dev]
will look for schema files in the ..\schema folder. Schema files are expected to have a n_ prefix. Environment specific files have a .env.cql postfix. So, if the files are:
1_people.dev.cql
1_people.prod.cql
2_people_some_indexes.cql
3_jobs.dev.cql
3_jobs.prod.cql
4_jobs_something_changed.cql
And run it for prod, then the ones with .prod.cql and no "env" .cql will be applied in order. You can also specify a $start version that can be used to specify where to start applying from (e.g. if start is specified as 3, then anything with 1_ and 2_ will be skipped).
It's pretty basic but seems to work quite well. We just have Cassandra downloaded (not installed) on the "applier machine" (which could be your machine, i.e. not part of a cluster) and have cqlsh on the PATH for easier application. Did (and do) have plans for more features, but working nicely as is for the time being.
Since there wasn't an existing tool, I ended up writing one.
It is called cql-migrate, and provides incremental updates to a deployed Cassandra schema.
[update] Since writing this, I have found a couple more options: one for for rails and another for go
I'm going to build a extremly small script for dumping a Sybase database in perl. The problem is that Perl doesn't come with preinstalled Sybase-support. I don't have access to the servers root so I can't install any packages and I can't reach the perl-folder. The server is not configured for internet access so I have to deliver the packages "manually" thorugh FTP.
So, my question is if there are any easy ways of doing this. The only library I need is DBI::Sybase or Sybase standalone (maybe I haven't done my research enough and doesn't even need this much?) which means I would love to just be able to put the .pm file there, loading it through
use localModule
and then run my small script.
The solution has to work on both Red hat and Solaris if I understood my supervisor correctly.
Best regards
Since you are primarily concerned with dumping the database, and not data retrieval and manipulation, you could probably get by without having to use DBI::Sybase or other perl module that is not preinstalled.
Without more details, it's hard to be very specific, but here's the overview. Your perl script can execute some SQL scripts which can dump the databases.
You can either put the list of databases you wish to dump in a config file (or env file), or you can generate it dynamically by calling isql using the -b option to suppress headers, and nocount to suppress footers, and store the output in an array.
Once you have the list of databases, just loop them, running another isql command to dump each database.
I have written a program in java that reads .csv files and stores them into a database table. But the performance of the storing operation is very slow. When I use DB2 Command Line Processor there is a drastic change in performance and it's very fast. So, I am trying to customize DB2 Command Line Processor according to my requirement. I searched on Google but I only found topics for how to use it. I would like to get clear on following subjects before I start.
Is "DB2 Command Line Processor" open source?
Which programming language is used?
Is there alternative like DB2 Command Line Processor with open source-code in java?
Is there a way to call DB2 Command Line Processor out of a java program?
It may be worth investigating the Java program, the slow run times may be related to how often you are commiting the data (i.e. you may running in auto-commit mode (commiting after every insert)).
Committing after every 500 insert may be a lot faster than commiting after every record
see DB2 autocommit for details on auto-commit
1) DB2 CLP (command line processor) is part of DB2. It is not open source, and it is included in all editions (Express-C, express, workgroup, extended), and in the Data Server client. This last is free to download, and install in all clients.
2) The best way to use the capabilities os DB2CLP is via scripts, such as bash scripts or windows scripts.
You can also call the db2clp from another program, such as a java application (runtime).
3) There are shells for databases with open source licence, however, you are mixing two things: a shell, that is normally a black screen where you type commands. And a driver to query a database from a program developed by yourself.
4) Again, via Runtime, http://docs.oracle.com/javase/6/docs/api/java/lang/Runtime.html
Finally, the best is to use a JDBC driver, in order to do things directly, and not with a lot of tiers. You have to check your Java code, probably the reading is not efficient. And also, check the properties of the DB2 Java driver.
One more thing, if you want the fatest, try to use LOAD to insert data in the database. It does not perform any log. You can call LOAD from a java application (remember to load the db2 environment before executing any command)