Parallelism or PLINQ with Postgres and Npgsql - postgresql

I have a Posrgres 9.04 database table with over 12,000,000 rows.
I need a program to read each row, do some calculations and lookups (against a 2nd table), then write a new row in a 3rd table with the results of these calculations. When done, the 3rd table will have the same number of rows as the 1st table.
Executing serially on a Core i7 720QM processor takes more than 24 hours. It only taxes one of my 8 cores (4 physical cores, but 8 visible to Windows 7 via HTT).
I want to speed this up with parallelism. I thought I could use PLINQ and Npgsql:
NpgsqlDataReader records = new NpgsqlCommand("SELECT * FROM table", conn).ExecuteReader();
var single_record = from row in records.AsParallel()
select row;
However, I get an error for records.AsParallel(): Could not find an implementation of the query pattern for source type 'System.Linq.ParallelQuery'. 'Select' not found. Consider explicitly specifying the type of the range variable 'row'.
I've done a lot of Google searches, and I'm just coming up more confused. NpgsqlDataReader inherits from System.Data.Common.DbDataReader, which in turn implements IEnumerable, which has the AsParallel extension, so seems like the right stuff is in place to get this working?
It's not clear to me what I could even do to explicitly specify the type of the range variable. It appears that best practice is not to specify this.
I am open to switching to a DataSet, presuming that's PLINQ compatible, but would rather avoid if possible because of the 12,000,000 rows.
Is this even something achievable with Npgsql? Do I need to use Devart's dotConnect for PostgreSQL instead?
UPDATE: Just found http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae, which led me to try this:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
So far no errors in the IDE, but is this a proper way of constructing this?

This is indeed the solution:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
This solution was inspired by what I found at http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae#1956768e-9403-4671-a196-8dfb3d7070e3. It's not clear to me why the cast and type specification is needed, but it works.
EDIT: While this doesn't cause syntax or runtime errors, it in fact does not make things run in parallel. Everything is still serialized. See PLINQ on ConcurrentQueue isn't multithreading for a superior solution.

You should consider using Greenplum. It's trivial to accomplish this in a Greenplum database. The free version isn't gimped in any way and it's postgresql at its core.

Related

How to tell PostgreSQL via JDBC that I'm not going to fetch every row of the query result (i.e. how to stream the head of a result set efficiently)?

Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method

Applying updates to a KDB table in thread safe manner in C

I need to update a KDB table with new/updated/deleted rows while it is being read by other threads. Since writing to K structures while other threads access will not be thread safe, the only way I can think of is to clone the whole table and apply new changes to that. Even to do that, I need to first clone the table, then find a way to insert/update/delete rows from it.
I'd like to know if there are functions in C to:
1. Clone the whole table
2. Delete existing rows
3. Insert new rows easily
4. Update existing rows
Appreciate suggestions on new approaches to the same problem as well.
Based on the comments...
You need to do a set of operations on the KDB database "atomically"
You don't have "control" of the database, so you can't set functions (though you don't actually need to be an admin to do this, but that's a different story)
You have a separate C process that is connecting to the database to do the operations you require. (Given you said you don't have "access" to the database as admin, you can't get KDB to load your C binary to use within-process anyway).
Firstly I'm going to assume you know how to connect to KDB+ and issue via the C API (found here).
All you need to do then is to concatenate your "atomic" operation into a set of statements that you are going to issue in one call from C. For example say you want to update a table and then delete some entry. This is what your call might look like:
{update name:`me from table where name=`you; delete from `table where name=`other;}[]
(Caution: this is just a dummy example, I've assumed your table is in-memory so that the delete operation here would work just fine, and not saved to disk, etc. If you need specific help with the actual statements you require in your use case then that's a different question for this forum).
Notice that this is an anonymous function that will get called immediately on issue ([]). There is the assumption that your operations within the function will succeed. Again, if you need actual q query help it's a different question for this forum.
Even if your KDB database is multithreaded (started with -s or negative port number), it will not let you update global variables inside a peach thread. Therefore your operation should work just fine. But just in case there's something else that could interfere with your new anonymous function, you can wrap the function with protected evaluation.

variable table or column names in a function

I'm trying to search all tables and columns in a database, a la here. The suggested technique is to construct SQL query strings and then EXEC them. This works well, as a stored procedure. (Another example of variable table/column names is here. Again, EXEC is used to execute "dynamic SQL".)
However, my app requires that I do this in a function, not an SP. (Our development framework has trouble obtaining results from an SP.) But in a function, at least on SQL Server 2008 R2, you can't use EXEC; I get this error:
Invalid use of a side-effecting operator 'INSERT EXEC' within a function.
According to the answer to this post, apparently by a Microsoft developer, this is by design; it has nothing to do with the INSERT, only the fact that when you execute dynamically-constructed SQL code, the parser cannot guarantee a lack of side effects. Therefore it won't allow you to create such a function.
So... is there any way to iterate over many tables/columns within a function?
I see from BOL that
The following statements are valid in a function: ...
EXECUTE
statements calling extended stored procedures.
Huh - How could extended SP's be guaranteed side-effect free?
But that doesn't help me anyway:
The extended stored procedure, when it is called from inside a
function, cannot return result sets to the client. Any ODS APIs that
return result sets to the client will return FAIL. The extended stored
procedure could connect back to an instance of SQL Server; however, it
should not try to join the same transaction as the function that
invoked the extended stored procedure.
Since we need the function to return the results of the search, an ESP won't help.
I don't really want to get into extended SP's anyway: incrementing the number of programming languages in the environment would complicate our development environment more than it's worth.
I can think of a few solutions right now, none of which is very satisfactory:
First call an SP that produces the needed data and puts it in a table, then select from the function which merely reads the result from the table; this could be trouble if the search takes a while and two users' searches overlap. Or,
Have the application (not the function) generate a long query naming every table and column name from the db. I wonder if the JDBC driver can handle a query that long. Or,
Have the application (not the function) generate a long series of short queries naming every table and column name from the db. This will make the overall search a lot slower.
Thanks for any suggestions.
P.S. Upon further searching, I stumbled across this question which is closely related. It has no answers.
Update: No longer needed
I think this question is still valid, and we may again have a situation where we need it. However, I don't need an answer anymore for the present problem. After much trial-and-error I managed to get our application framework to retrieve row results from the RDBMS via the JDBC driver from the stored procedure. Therefore getting the thing to work as a function is unnecessary.
But if anyone posts an answer here that helps with the stated problem, I will be happy to upvote and/or accept it as appropriate.
An sp is basically a predefined sql statment with some add ons.
So if you had
PSEUDOCODE
Create SP_DoSomething As
Select * From MyTable
END
And you can't use the SP
Then you just execute the SQL as in "Select * From MyTable"
As for that naff sql code.
For start you could join table to column with a where clause, which would get rid of that line by line if stuff.
Ask another question. Like How could this be improved, there's lots of scope for more attempts than mine.

Catching errors with DBD::Informix

I need to run dynamically constructed queries against Informix IDS 9.x; while WHERE clause is mostly quite simple, Projection clause can be quite complicated with lots of columns and formulas applied to columns. Here is one example:
SELECT ((((table.I_ACDTIME + table.I_ACWTIME + table.I_DA_ACDTIME + table.I_DA_ACWTIME +
table.I_RINGTIME))+(table.I_ACDOTHERTIME + table.I_ACDAUXINTIME +
table.I_ACDAUX_OUTTIME)+(table.I_TAUXTIME + table.I_TAVAILTIME +
table.I_TOTHERTIME)+((table.I_AVAILTIME + table.I_AUXTIME)*
((table.MAX_TOT_PERCENTS/100)/table.MAXSTAFFED)))/(table.INTRVL*60))
FROM table
WHERE ...
The problem arises when some of the fields used contain zeroes; Informix predictably throws division by zero error, but the error message is not very helpful:
DBD::Informix::st fetchrow_arrayref failed:
SQL: -1202: An attempt was made to divide by zero.
In this case, it is desirable to return NULL upon failed calculation. Is there any way to achieve this other than parse Projection clause and enclose each and every division attempt in CASE ... END? I would prefer to use some DBD::Informix magic if it's there.
I don't believe you'll be able to solve this with DBD::Informix or any other database client, without resorting to parsing the SQL and rewriting it. There's no option to just ignore the column with the /0 arithmetic: the whole statement fails when the error is encountered, at the engine level.
If it's any help, you can write the code to avoid /0 as a DECODE rather than CASE ... END, which is a little cleaner, ie:
DECODE(table.MAXSTAFFED, 0, NULL,
((table.MAX_TOT_PERCENTS/100)/table.MAXSTAFFED)))/(table.INTRVL*60)))
DBD::Informix is an interface to the Informix DBMS, and as thin as possible (which isn't anywhere near as thin as I'd like, but that's another discussion). Such behaviour cannot reasonably be mediated by DBD::Informix (or any other DBD driver accessing a DBMS); it must be handled by the DBMS itself.
IDS does not provide a mechanism to yield NULL in lieu of a divide by zero error. It might be a reasonable feature request - but it would not be implemented until the successor version to Informix 11.70 at the earliest.
Note that Informix Dynamic Server (IDS) 9.x is several years beyond the end of its supported life (10.00 is also unsupported).
From experience working with informix I would say you woud be lucky to get that kind of functionallity within IDS (earlier versions of IDS - not much earlier than your version - had barely any string manipulation function nevermind anything complicated.)
I would save yourself the time and generate the calculations against an in memory list.

How to organize parameters for a postgres application

I am working on a postgres application. For the moment I am not sure how to manage application constant parameters best. For example I want to define a threshold variable which I am going to use in several functions.
One idea is making a table "config" and query the variable every time I need them. And for a shortcut wrap the sql query into an other function i.e.: t := get_Config('Threshold');
But in fact I am not really lucky with this. What is the best way to handle custom application configuration parameters? They should be handy in maintainance and I want to avoid querying every time for constants. In oracle for example you could compile constants into package specs. Are there any better ways to deal with such configuration parameters?
I have organized global parameters just the way you describe it for some years now. It seems a bit awkward but it works just fine.
I have got quite a number of those, so I added an integer plus index to my config table and use get_config($my_id) (plus comment) - which is slightly faster but less readable.
OR you can use custom_variable_classes. See:
How to declare variable in PostgreSQL