Experiences with PostgreSQL Java/JDBC Copy API for bulk inserts - postgresql

With version 8.4 PostgreSQL finally integrated a proprietary API into their JDBC driver, which allows stream based inserts and selects. The so called Copy API grants access to COPY TO/COPY FROM SQL commands, which read text data from a stream/reader into one table at a time or write text data to a stream/writer from one table. Constraints and triggers are regarded for insert operations. Basic transformations (delimiter, quotation, null values etc.) are available. The performance gain is quite impressive, which probably is because of less object instantiation and a much simpler protocol between client and server backend.
Has anyone experiences with this API, good or bad. Is it production ready? Are there any pitfalls one has to be aware of? BTW: The fact that it is a proprietary API is a non-issue for me.

The COPY API is present in PostgreSQL C library for at least 6 years. It is very stable.
See: http://www.postgresql.org/docs/9.0/interactive/libpq-copy.html
and http://www.postgresql.org/docs/9.0/interactive/sql-copy.html
JDBC implementation should have same properties, but I haven't used it.
PS. I think there is a misunderstanding when you call this "proprietary". Both protocol specification and server/client/driver source code is free (as in freedom).

Related

Postgresql tuning to be used as DatawareHouse

I am assembling a Business Intelligence solution using the Pentaho software as a BI engine. Within this solution, I had to set up a requirements for a PostgreSQL database server.
The current situation is very easy, since no ETL process is being carried out for data extraction, so the PostgreSQL configuration has not changed it much, and it is practically as it is configured as "factory".
I would like to know what Postgres configuration parameters have to be touched and modified to optimize it as a Datawarehouse. I have seen a lot of documentation, but it is not clear to me at all, since one documentation says that such values have to be modified, and other documentation, other completely different values.
I would like to know just that, if there is a clearer and more precise documentation to optimize a postgres 9.6 to be used as a Pentaho DW.
Thank you very much

Challenges in migrating from IBM DB2 to Netezza

Due to added advantage of high performance and reduction in turnaround time, I am trying to migrate all the data from IBM DB2 to Netezza in my organization.
But what I realized is there is no concept of primary key in Netezza? If true, I can try and take care of these issue by using duplicate removal stage in Datastage.
Also, could you guys please assist me understanding if there are any more constraints that I should consider or challenges I could face for DB2 to Netezza migration?
Netezza does allow you to specify Primary Key and Foreign Key restraints, but they are not enforced. Which is to say that they are purely informational (for bot the user and the optimizer). A well-formed upsert process in ETL is a good way to manage for this.
On the topic of other issues you may face, here are a few thoughts:
Surrogate Keys
Be sure that you generate your surrogate keys either with Netezza's SEQUENCE object, or with a surrogate key generator in your ETL tool. Avoid using ROW_NUMBER for this process as it will most often prevent you from exploiting the parallel nature of the system when used in this way.
Stored Procedures
Stored procedures should avoid row-by-row/cusor-based processing when possible, as this is another case where you may prevent yourself from exploiting the parallel nature of the system.
SQL Extension Functions
If you find that you rely on functions that exists in DB2 that you don't find natively in Netezza, be sure to check what is available in the SQL Extensions Toolkit, which is included with Netezza, but not automatically installed/configured.
MERGE
If you rely on MERGE in your current environment, be aware that you must be on v7.2.1 to use MERGE in Netezza. Otherwise you will have to break it out into an INSERT/UPDATE operation.
Once you load the data in Netezza, one method we have utilized is to create a View to access the data and only expose the view. The view would have the logic inside to remove the duplicates.
Good luck!
Delan

Upsert in Amazon RedShift without Function or Stored Procedures

As there is no support for user defined functions or stored procedures in RedShift, how can i achieve UPSERT mechanism in RedShift which is using ParAccel, a PostgreSQL 8.0.2 fork.
Currently, i'm trying to achieve UPSERT mechanism using IF...THEN...ELSE... statement
e.g:-
IF NOT EXISTS(SELECT...WHERE(SELECT..))
THEN INSERT INTO tblABC() SELECT... FROM tblXYZ
ELSE UPDATE tblABC SET.,.,.,. FROM tblXYZ WHERE...
which is giving me error. As i'm writing this code independently without including it in function or SP's.
So, is there any solution to achieve UPSERT.
Thanks
You should probably read this article on upsert by depesz. You can't rely on SERIALIABLE for this since, AFAIK, ParAccel doesn't support full serializability support like in Pg 9.1+. As outlined in that post, you can't really do what you want purely in the DB anyway.
The short version is that even on current PostgreSQL versions that support writable CTEs it's still hard. On an 8.0 based ParAccel, you're pretty much out of luck.
I'd do a staged merge. COPY the new data to a temporary table on the server, LOCK the destination table, then do an UPDATE ... FROM followed by an INSERT INTO ... SELECT. Doing the data uploads in big chunks and locking the table for the upserts is reasonably in keeping with how Redshift is used anyway.
Another approach is to externally co-ordinate the upserts via something local to your application cluster. Have all your tools communicate via an external tool where they take an "insert-intent lock" before doing an insert. You want a distributed locking tool appropriate to your system. If everything's running inside one application server, it might be as simple as a synchronized singleton object.

Accessing non-public schema in PostgreSQL with Pentaho

Let me start by saying, what I know about Pentaho wouldn't fill up a single paragraph. I'm more knowledgeable about PostgreSQL. I'm working with some contractors that are building a set of monthly reports in Pentaho (v. 4.5) for my company. Some of the data needs to go through a ETL process and get rolled up for reporting purposes. From a dba(ish) point of view, I would like to move these tables into a separate PostgreSQL schema.
I know that Pentaho is often times used with MySQL (which doesn't have schemas) and I'm concerned this might cause problems. I've done some "googlin'" and I don't turn up a lot of hits on the topic, but I did find a closed bug from a few years ago - thus implying that the functionality should be supported.
before I do this, I would like to see if anyone knows of a reason this will fail or be a bad idea. (or if you've done it an it works great, please let me know that, too).
Final notes: I'm using PostgreSQL 9.1.5, and I don't have access to a Pentaho instance to even test this myself. And I'm hoping the good folks in the Stackoverflow community will share their expertise and save me from having to install one and the hours of playing/testing to get an idea of this is a bad idea.
EDIT:
I sort of knew this question was a bit vague, but I was hoping that some one would read it and share any experience they have. So, Let me spell it out more clearly and ask more explicit questions.
I have not done anything. I don't know Pentaho. I don't want to learn Pentaho (not that there is anything wrong with Pentaho... It's just not where my interests are right now). My company hired contractors (I did not hire them). They have experience with Pentaho, but with MySQL. They don't really know anything about PostgreSQL. There are some important difference between PostgreSQL and MySQL. Including the fact that PostgreSQL supports schemas (whereas MySQL uses separate database... similar in concept be behave differently in some ways). Some ORMs (and tools) don't really like this... for example, the Django framework still doesn't really fully support schemas in Postgresql (I know this because I use Python and Django often and my life is much better when I keep things in the "public" schema). Because of my experience with Django and PostgreSQL schemas, I'm a bit leery of moving this data to a new schema.
I do understand that where ever the tables are, they will need permissions to be able to access the data.
My explicit questions:
Do you use Pentaho to access a PostgreSQL database to access tables in schemas other than "public" (the default).
If so, does it just work (no problems)?
If you had problems, would you please be willing to share with me (and the Stackoverflow community) any online resources that helped you? Or would you be willing to detail what you remember here?
Do you know of anything that just won't work correctly? For example, an open bug in Pentaho related to this topic.
Again, it's not your standard kind of question. I'm hoping that someone out there has experience and is willing to share it here and save me from having to spend time setting up a new Pentaho instance and trying to learn Pentaho well enough to test it, etc.
Thanks.
Two paths you can take:
1) What previous post said ("Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.")
2) In database connection, advanced tab, "The preferred schema name".
If you're working with different schemas, you can create one database connection per schema. With this approach you can leave schema field in input/output steps empty.
We use MS SQL server and I can tell you that Pentaho does struggle with the idea of a schema. Many of their apps allow you to select a schema but Pentaho, like you said, is built to use something like mySQL.
Make you pentaho database user work like it would be working in mySQL.
We made the database user default to dbo then we structured our tables like dbo.dimDimension,
dbo.factFactTable etc. Basically, only use dbo for Pentaho purposes. (Or whatever schema you want to default to.)
I use PDI and PgSQL extensively every day with a bunch of different schemas. It works fine. The only trouble you might run into is Pg's troublesome practice of forcing unquoted identifiers to lower instead of upper case. I soon realized everything was easier when I set the Advanced connection property to "Quote all in database".
Yes, you have to quote everything when you type SQL if PDI doesn't do it for you, but it works quite well. Haven't experimented with forcing all identifiers to lower case, but I expect that would work as well.
And yes, use the "Preferred schema nanme" as well, but be aware that some steps use that option and others don't. You can't, for example, expect it to add schema names to SQL you type into a Table Input step.
The only other issues you might run into are the limits of Pg's JDBC driver. It's not as good as SQL Server's or DB2's, but the only thing I've every had trouble with was sending error rows from a Table Output step to another step when the Table Output step was in batch mode.
Have fun learning PDI. It makes a great complement to your DBA skills.
Brian
Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.
I did a quick test using PDI and our 8.4 Postgres instance and was able to explore, read from and write to tables in different schemas.
So, I think this is a reasonable direction. Hope this helps.

Jira using enterprise architecture by OfBiz

The 'open for business project' is an enterprise framework.
It so happens Jira uses this, and I was pretty shocked at how much work is involved to pull data for a particular entity (say a issue/bug in Jira's case).
Imagine getting a list of all the issues, it has to first get all the columns (or properties) to display for the table column, then pull in the values for each. For an enterprise solution this sounds like a sub-optimal solution (but I understand how it adds flexibility).
You can read how its used in Jira practically: http://confluence.atlassian.com/display/JIRA/Database+Schema
main site: http://ofbiz.apache.org/docs/entity.html
I'm just confused as to how to list all issues. Meaning, what would the sql queries look like?
Its one thing to pull a single issue, but to get a list you have to do allot of work to get the values. I don't think it can be done with a singl query using joins now can it?
(Disclaimer: I work for Atlassian, but I'm not on the JIRA team)
OFBiz EE is just an abstraction layer for moving between database tables and fancy maps called GenericValues. It has no influence over the database schema itself. Your real issue here seems to be that JIRA's database schema is complicated.
The reason it's complicated is because it has to support a data model where an issue is an arbitrary collection of arbitrary fields, at some point in an arbitrary workflow. The fields themselves can be defined by third-party plugins. It's very hard to produce a friendly-looking RDBMS schema to fit this kind of dynamic data model, and JIRA tries as best it can.
You can get information directly out of the database if you want, the database schema is documented in the link above, or you can go up a layer or twelve of abstraction and talk through one of JIRAs many APIs.
A good place to ask questions about getting data out of JIRA is the forums on http://forums.atlassian.com/
The entity engine used in jira is a database abstraction layer ( with a very rich and easy to use API ) that connects your application with one or more datasources. But the databases are still relational, so you can use SQL if you want to. About the issue info you want to pull I'd say it wouldn't be very easy only with joins. I'd recommend you use the scripting language of the RDBMS ( i.e. PL/SQL, pgPL/SQL ).
SELECT * FROM jiraissue;