Accessing non-public schema in PostgreSQL with Pentaho - postgresql

Let me start by saying, what I know about Pentaho wouldn't fill up a single paragraph. I'm more knowledgeable about PostgreSQL. I'm working with some contractors that are building a set of monthly reports in Pentaho (v. 4.5) for my company. Some of the data needs to go through a ETL process and get rolled up for reporting purposes. From a dba(ish) point of view, I would like to move these tables into a separate PostgreSQL schema.
I know that Pentaho is often times used with MySQL (which doesn't have schemas) and I'm concerned this might cause problems. I've done some "googlin'" and I don't turn up a lot of hits on the topic, but I did find a closed bug from a few years ago - thus implying that the functionality should be supported.
before I do this, I would like to see if anyone knows of a reason this will fail or be a bad idea. (or if you've done it an it works great, please let me know that, too).
Final notes: I'm using PostgreSQL 9.1.5, and I don't have access to a Pentaho instance to even test this myself. And I'm hoping the good folks in the Stackoverflow community will share their expertise and save me from having to install one and the hours of playing/testing to get an idea of this is a bad idea.
EDIT:
I sort of knew this question was a bit vague, but I was hoping that some one would read it and share any experience they have. So, Let me spell it out more clearly and ask more explicit questions.
I have not done anything. I don't know Pentaho. I don't want to learn Pentaho (not that there is anything wrong with Pentaho... It's just not where my interests are right now). My company hired contractors (I did not hire them). They have experience with Pentaho, but with MySQL. They don't really know anything about PostgreSQL. There are some important difference between PostgreSQL and MySQL. Including the fact that PostgreSQL supports schemas (whereas MySQL uses separate database... similar in concept be behave differently in some ways). Some ORMs (and tools) don't really like this... for example, the Django framework still doesn't really fully support schemas in Postgresql (I know this because I use Python and Django often and my life is much better when I keep things in the "public" schema). Because of my experience with Django and PostgreSQL schemas, I'm a bit leery of moving this data to a new schema.
I do understand that where ever the tables are, they will need permissions to be able to access the data.
My explicit questions:
Do you use Pentaho to access a PostgreSQL database to access tables in schemas other than "public" (the default).
If so, does it just work (no problems)?
If you had problems, would you please be willing to share with me (and the Stackoverflow community) any online resources that helped you? Or would you be willing to detail what you remember here?
Do you know of anything that just won't work correctly? For example, an open bug in Pentaho related to this topic.
Again, it's not your standard kind of question. I'm hoping that someone out there has experience and is willing to share it here and save me from having to spend time setting up a new Pentaho instance and trying to learn Pentaho well enough to test it, etc.
Thanks.

Two paths you can take:
1) What previous post said ("Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.")
2) In database connection, advanced tab, "The preferred schema name".
If you're working with different schemas, you can create one database connection per schema. With this approach you can leave schema field in input/output steps empty.

We use MS SQL server and I can tell you that Pentaho does struggle with the idea of a schema. Many of their apps allow you to select a schema but Pentaho, like you said, is built to use something like mySQL.
Make you pentaho database user work like it would be working in mySQL.
We made the database user default to dbo then we structured our tables like dbo.dimDimension,
dbo.factFactTable etc. Basically, only use dbo for Pentaho purposes. (Or whatever schema you want to default to.)

I use PDI and PgSQL extensively every day with a bunch of different schemas. It works fine. The only trouble you might run into is Pg's troublesome practice of forcing unquoted identifiers to lower instead of upper case. I soon realized everything was easier when I set the Advanced connection property to "Quote all in database".
Yes, you have to quote everything when you type SQL if PDI doesn't do it for you, but it works quite well. Haven't experimented with forcing all identifiers to lower case, but I expect that would work as well.
And yes, use the "Preferred schema nanme" as well, but be aware that some steps use that option and others don't. You can't, for example, expect it to add schema names to SQL you type into a Table Input step.
The only other issues you might run into are the limits of Pg's JDBC driver. It's not as good as SQL Server's or DB2's, but the only thing I've every had trouble with was sending error rows from a Table Output step to another step when the Table Output step was in batch mode.
Have fun learning PDI. It makes a great complement to your DBA skills.
Brian

Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.
I did a quick test using PDI and our 8.4 Postgres instance and was able to explore, read from and write to tables in different schemas.
So, I think this is a reasonable direction. Hope this helps.

Related

Postgresql tuning to be used as DatawareHouse

I am assembling a Business Intelligence solution using the Pentaho software as a BI engine. Within this solution, I had to set up a requirements for a PostgreSQL database server.
The current situation is very easy, since no ETL process is being carried out for data extraction, so the PostgreSQL configuration has not changed it much, and it is practically as it is configured as "factory".
I would like to know what Postgres configuration parameters have to be touched and modified to optimize it as a Datawarehouse. I have seen a lot of documentation, but it is not clear to me at all, since one documentation says that such values have to be modified, and other documentation, other completely different values.
I would like to know just that, if there is a clearer and more precise documentation to optimize a postgres 9.6 to be used as a Pentaho DW.
Thank you very much

How to set up background worker for PostgreSQL 11 server?

Recently I've been asigned to migrate part of the database from Oracle to PostgreSQL enviroment, as testing experiment. During that process, major drawback that occured to me was lack of simple way to implement parallelism which is required due to multiple design reasons, which aren't so relevant here. Recently I've discovered https://www.postgresql.org/docs/11/bgworker.html following process, which occured to me as some way to solve my problems.
Yet not so truly, as I couldn't easly find any tutorial or example how to implement it even for a simple task as writing debugmessages into logger, while the process is running. I've tried some old ways, presented in some plugin specifications from version 9.3, but they weren't much of help.
I would like to know how to set up those workers properly. Any help would be appriciated.
PS: Also if some good soul found workaround to implement bulk collect for cursors into PostgreSQL it would be most kind of you, to share it.
The documentation for bgworker that you linked to is for writing C code, which is probably not what you want. You can use the pg_background extension, which will do what you want. ora2pg will optionally use pg_background when converting oracle procedures with the autonomous transaction pragma. The other option is to use dblink to open a connection to the current db.
Neither solution is great, but it's the only way to go if you need to store data in a table whether or not the enclosing transaction succeeds. If you can get by with just putting stuff into the logs, you can use RAISE NOTICE instead.
As far as bulk collect for cursors go, I'm not sure exactly how you are using them, but set returning functions may help you. Functions in postgres can return multiple rows without fiddling with cursors.

Is there an alternative to temporary tables I can use on a Hot Standby copy?

as suggested here's a TL;DR bit. I'm looking for an alternative to temporary tables I could use on a Hot Standby copy of a database. Is there anything or do I have to re-write everything and try and do it all in subqueries?
When I joined our company last year, our ERP was hosted locally and although I didn’t have admin access to the Postgres database I at least had read write access to the tables.
I wrote a number of reports (using SQL Command option in Crystal Reports)/SQL scripts that use temporary tables however, we’ve just migrated to a hosted version of the ERP and rather than access the live database we have been given access to a Hot Standby copy, mainly due to load balancing issues.
Unfortunately, the software company didn’t warn us that this would be the case or that it would be read only access. I found this out when I was testing some scripts when obviously anything with a temporary table failed.
I use temporary tables for things like storing dates and bank holiday information, holding temporary calculations and so on.
So I'm looking for an alternative to temporary tables I could use on the Hot Standby copy, or do I have to re-write everything and try and do it all in subqueries?
I’ve looked at using CTE (WITH) but the scope is far too small as I’d need access throughout the script.
Then I thought maybe I could read the data from the Hot Standby but create temporary tables in a different database/schema, but I don’t think that’s viable. If it is I might have to speak to the software house to be given access to another database/schema. postgres_fdw would seem the most likely candidate as you can update the external table but I can't see anywhere about dropping and creating tables.
I’ve only been using Postgres since last July having previously used MSSQL which I could probably have used a table variable, but I can’t find an equivalent for that.
I've tried looking at the Postgres documentation but, to some embarrassment, I do find a lot of documentation hard to follow without relatable examples, so I might well have missed something.
Sorry for the long post!
Thanks

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?

Jira using enterprise architecture by OfBiz

The 'open for business project' is an enterprise framework.
It so happens Jira uses this, and I was pretty shocked at how much work is involved to pull data for a particular entity (say a issue/bug in Jira's case).
Imagine getting a list of all the issues, it has to first get all the columns (or properties) to display for the table column, then pull in the values for each. For an enterprise solution this sounds like a sub-optimal solution (but I understand how it adds flexibility).
You can read how its used in Jira practically: http://confluence.atlassian.com/display/JIRA/Database+Schema
main site: http://ofbiz.apache.org/docs/entity.html
I'm just confused as to how to list all issues. Meaning, what would the sql queries look like?
Its one thing to pull a single issue, but to get a list you have to do allot of work to get the values. I don't think it can be done with a singl query using joins now can it?
(Disclaimer: I work for Atlassian, but I'm not on the JIRA team)
OFBiz EE is just an abstraction layer for moving between database tables and fancy maps called GenericValues. It has no influence over the database schema itself. Your real issue here seems to be that JIRA's database schema is complicated.
The reason it's complicated is because it has to support a data model where an issue is an arbitrary collection of arbitrary fields, at some point in an arbitrary workflow. The fields themselves can be defined by third-party plugins. It's very hard to produce a friendly-looking RDBMS schema to fit this kind of dynamic data model, and JIRA tries as best it can.
You can get information directly out of the database if you want, the database schema is documented in the link above, or you can go up a layer or twelve of abstraction and talk through one of JIRAs many APIs.
A good place to ask questions about getting data out of JIRA is the forums on http://forums.atlassian.com/
The entity engine used in jira is a database abstraction layer ( with a very rich and easy to use API ) that connects your application with one or more datasources. But the databases are still relational, so you can use SQL if you want to. About the issue info you want to pull I'd say it wouldn't be very easy only with joins. I'd recommend you use the scripting language of the RDBMS ( i.e. PL/SQL, pgPL/SQL ).
SELECT * FROM jiraissue;