Using variables for schema and table names in a Redshift query - amazon-redshift

I want to be able to use the variable names in Redshift which refers to my DB Objects (like schema and table names). Something like...
SET my_schema="schema":
SET my_table="table";
SELECT * from #my_schema.#my_table;
But looks like Redshift doesn't have such feature. Is there any workaround possible to achieve this?

There are a few ways you try to attack this. But first trying to use a database engine for functions beyond querying the database is a waste of horsepower and the road to db lock-in. So I'm going to focus on ways to do this before the database.
The most complete way is to use a front-end system that clients connect to and then this system in turn connects to the db. The one I've used in the past is pgbounce-rr which pools connections to the the db but also allow for modifications to the SQL before being sent on. This will do what you want but you will need a computer to perform this work.
If you use Redshift data-api you could put a Lambda function in series which performs the SQL modifications you desire (but make sure you get your API permissions right). However, I expect it is unlikely that you are looking to move to an API access model.
Many benches support variable substitution and simple replacements in the SQL can be done by the bench. However, this is very dependent on which bench you use and having all users' benches configured correctly.
Bottom line - if you want something to modify your SQL do if before it goes to Redshift.

Related

Is it possible to evaluate a Postgres expression without connecting to a database?

PostgreSQL has excellent support for evaluating JSONPath expressions against JSON data.
For example, this query returns true because the value of the nested field is indeed "foo".
select '{"header": {"nested": "foo"}}'::jsonb #? '$.header ? (#.nested == "foo")'
Notably this query does not reference any schemas or tables. Ideally, I would like to use this functionality of PostgreSQL without creating or connecting to a full database instance. Is it possible to run PostgreSQL in such a way that it doesn't have schemas or tables, but is still able to evaluate "standalone" queries?
Some other context on the project, we need to evaluate JSONPath expressions against JSON data in both a Postgres database and Python application. Unfortunately, Python does not have any JSONPath libraries that support enough of the spec to be useful to us.
Ideally, I would like to use this functionality of PostgreSQL without creating or connecting to a full database instance.
Well, it is open source. You can always pull out the source code for this functionality you want and adapt it to compile by itself. But that seems like a large and annoying undertaking, and I probably wouldn't do it. And short of that, no.
Why do you need this? Are you worried about scalability or ease of installation or performance or what? If you are already using PostgreSQL anyway, firing up a dummy connection to just fire some queries at the JSONB engine doesn't seem too hard.

Should I filter data in PostgreSQL or server backend?

I am working on a project which uses graphql and PostgreSQL where we want to select data from the database with a value after a certain date. It is currently selecting all data from the database and then filtering it on the server:
.filter(({time}) => moment(time).isAfter(startTime))
However I would have thought it would be best to do this filtering in the database query as the full dataset is never used.
Is there any benefit to doing it on the server rather than in the database query?
Barring some unusual edge case -- such as other parts of your backend code really do need all the data for some reason -- it would definitely be more efficient to filter everything on the Postgres side via the SQL that is being used to fetch the data in the first place.
This is true for several reasons:
Assuming the table is properly indexed, the filtering will be able to occur much faster within the database.
The unneeded data will not need to be serialized and sent over the wire to the backend, only to then be discarded by the backend's own filtering.
The memory footprint should be reduced on both the Postgres and server end due to needing to process only a portion of the results.
I've not worked with GraphQL myself, but from doing a bit of poking around through its docs, it appears GraphQL often uses other mechanisms in different layers (outside of the database) to try to improve performance.
It would be worth seeing what the actual SQL is that your GraphQL query is generating (that may be possible via a function in GraphQL; it could also be done by enabling certain log settings on the Postgres server and correlating the log output to the query). That may lead to further optimization possibilities if you want to keep things purely GraphQL.
Jumping down to a raw query seems like it would be a good possibility though. Certainly that is something that is often done with ORMs like Django and ActiveRecord.

Is it a bad idea to let laptops directly perform CRUD operations on databases?

I have developed an Excel add-in that I pitched to my employer's IT department. The add-in creates SELECT, INSERT, DELETE, and UPDATE SQL statements that are sent to a PostgreSQL database and any results (in the case of a SELECT statement) are returned to Excel to report on.
My team has been very impressed with this, but IT said that they don't allow laptops to perform CRUD operations directly on databases. Instead IT has set up certain environments to do this.
Can someone tell me if IT's concern around laptops directly connecting to a database and performing CRUD operations makes sense? Is this a valid concern?
If the laptops, their users and anybody else with access to them, the network connection, and the client software are all trusted, and you can always immediately push an update to the clients when the database structure inevitably changes in the future, then it's OK.
Otherwise it's not. The standard way would be to put some kind of service between the two that acts as a gatekeeper and defines the allowed operations on the database and who is allowed to do them. REST (or if you're enterprisey, SOAP) are two popular options. And don't send SQL over the wire in those cases.
With some database engines it might be possible in theory to let the users directly authenticate with the database and use the database's permission model to limit what they can do. For instance you could only allow users to execute certain stored procedures. But in practice that's probably more trouble than it's worth.
To be honest in practice it's probably not OK. That's too many things to trust at once.
Yes this is a valid concern. Someone could easily inject an SQL command into your database. They might be able to perform an operation that erases the entire database.
Say your software has this coed into itself: "SELECT $var1 FROM TEST WHERE $var2" and the user can modify var1 and var2. If they put "date > 10; DROP *" into var2 now your statement becomes "SELECT $var1 FROM TEST WHERE date > 10; DROP *;"
It is a little more complicated than that, but you should read up on SQL Injection.

How does data.stackexchange.com allow queries securely?

https://data.stackexchange.com/ lets me query some (all?) of
stackexchange's data/tables using arbitrary SQL queries, including
parametrization.
What program do they use to do this and is it published?
I want to create something like this myself (different data), but am
constantly worried that I'll miss an injection attack or set
permissions incorrectly.
Obviously, data.stackexchange.com has figured out how to do this
securely. How do I replicate what they've done?
This follows up my earlier question: Existing solution to share database data usefully but safely?

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?