How does data.stackexchange.com allow queries securely? - sql-injection

https://data.stackexchange.com/ lets me query some (all?) of
stackexchange's data/tables using arbitrary SQL queries, including
parametrization.
What program do they use to do this and is it published?
I want to create something like this myself (different data), but am
constantly worried that I'll miss an injection attack or set
permissions incorrectly.
Obviously, data.stackexchange.com has figured out how to do this
securely. How do I replicate what they've done?
This follows up my earlier question: Existing solution to share database data usefully but safely?

Related

Using variables for schema and table names in a Redshift query

I want to be able to use the variable names in Redshift which refers to my DB Objects (like schema and table names). Something like...
SET my_schema="schema":
SET my_table="table";
SELECT * from #my_schema.#my_table;
But looks like Redshift doesn't have such feature. Is there any workaround possible to achieve this?
There are a few ways you try to attack this. But first trying to use a database engine for functions beyond querying the database is a waste of horsepower and the road to db lock-in. So I'm going to focus on ways to do this before the database.
The most complete way is to use a front-end system that clients connect to and then this system in turn connects to the db. The one I've used in the past is pgbounce-rr which pools connections to the the db but also allow for modifications to the SQL before being sent on. This will do what you want but you will need a computer to perform this work.
If you use Redshift data-api you could put a Lambda function in series which performs the SQL modifications you desire (but make sure you get your API permissions right). However, I expect it is unlikely that you are looking to move to an API access model.
Many benches support variable substitution and simple replacements in the SQL can be done by the bench. However, this is very dependent on which bench you use and having all users' benches configured correctly.
Bottom line - if you want something to modify your SQL do if before it goes to Redshift.

Can A Postgres Replication Publication And Subscription Exist On The Same Server

I have a request asking for a read only schema replica for a role in postgresql. After reading documentation and better understanding replication in postgresql, I'm trying to identify whether or not I can create the publisher and subscriber within the same database.
Any thoughts on the best approach without having a second server would be greatly appreciated.
You asked two different question. Same database? No. Since Pub/Sub requires tables to have the same name (including schema) on both ends, you would be trying to replicate a table onto itself. Using logical replication plugins other than the built-in one might get around this restriction.
Same server? Yes. You can replicate between two databases of the same instance (but see the note in the docs about some extra hoops you need to jump through) or between two instances on the same host. So whichever of those things you meant by "same server", yes, you can.
But it seems like an odd way to do this. If the access is read only, why does it matter whether it is to a replica of the real data or to the real data itself?

Should I filter data in PostgreSQL or server backend?

I am working on a project which uses graphql and PostgreSQL where we want to select data from the database with a value after a certain date. It is currently selecting all data from the database and then filtering it on the server:
.filter(({time}) => moment(time).isAfter(startTime))
However I would have thought it would be best to do this filtering in the database query as the full dataset is never used.
Is there any benefit to doing it on the server rather than in the database query?
Barring some unusual edge case -- such as other parts of your backend code really do need all the data for some reason -- it would definitely be more efficient to filter everything on the Postgres side via the SQL that is being used to fetch the data in the first place.
This is true for several reasons:
Assuming the table is properly indexed, the filtering will be able to occur much faster within the database.
The unneeded data will not need to be serialized and sent over the wire to the backend, only to then be discarded by the backend's own filtering.
The memory footprint should be reduced on both the Postgres and server end due to needing to process only a portion of the results.
I've not worked with GraphQL myself, but from doing a bit of poking around through its docs, it appears GraphQL often uses other mechanisms in different layers (outside of the database) to try to improve performance.
It would be worth seeing what the actual SQL is that your GraphQL query is generating (that may be possible via a function in GraphQL; it could also be done by enabling certain log settings on the Postgres server and correlating the log output to the query). That may lead to further optimization possibilities if you want to keep things purely GraphQL.
Jumping down to a raw query seems like it would be a good possibility though. Certainly that is something that is often done with ORMs like Django and ActiveRecord.

Is it a bad idea to let laptops directly perform CRUD operations on databases?

I have developed an Excel add-in that I pitched to my employer's IT department. The add-in creates SELECT, INSERT, DELETE, and UPDATE SQL statements that are sent to a PostgreSQL database and any results (in the case of a SELECT statement) are returned to Excel to report on.
My team has been very impressed with this, but IT said that they don't allow laptops to perform CRUD operations directly on databases. Instead IT has set up certain environments to do this.
Can someone tell me if IT's concern around laptops directly connecting to a database and performing CRUD operations makes sense? Is this a valid concern?
If the laptops, their users and anybody else with access to them, the network connection, and the client software are all trusted, and you can always immediately push an update to the clients when the database structure inevitably changes in the future, then it's OK.
Otherwise it's not. The standard way would be to put some kind of service between the two that acts as a gatekeeper and defines the allowed operations on the database and who is allowed to do them. REST (or if you're enterprisey, SOAP) are two popular options. And don't send SQL over the wire in those cases.
With some database engines it might be possible in theory to let the users directly authenticate with the database and use the database's permission model to limit what they can do. For instance you could only allow users to execute certain stored procedures. But in practice that's probably more trouble than it's worth.
To be honest in practice it's probably not OK. That's too many things to trust at once.
Yes this is a valid concern. Someone could easily inject an SQL command into your database. They might be able to perform an operation that erases the entire database.
Say your software has this coed into itself: "SELECT $var1 FROM TEST WHERE $var2" and the user can modify var1 and var2. If they put "date > 10; DROP *" into var2 now your statement becomes "SELECT $var1 FROM TEST WHERE date > 10; DROP *;"
It is a little more complicated than that, but you should read up on SQL Injection.

Accessing non-public schema in PostgreSQL with Pentaho

Let me start by saying, what I know about Pentaho wouldn't fill up a single paragraph. I'm more knowledgeable about PostgreSQL. I'm working with some contractors that are building a set of monthly reports in Pentaho (v. 4.5) for my company. Some of the data needs to go through a ETL process and get rolled up for reporting purposes. From a dba(ish) point of view, I would like to move these tables into a separate PostgreSQL schema.
I know that Pentaho is often times used with MySQL (which doesn't have schemas) and I'm concerned this might cause problems. I've done some "googlin'" and I don't turn up a lot of hits on the topic, but I did find a closed bug from a few years ago - thus implying that the functionality should be supported.
before I do this, I would like to see if anyone knows of a reason this will fail or be a bad idea. (or if you've done it an it works great, please let me know that, too).
Final notes: I'm using PostgreSQL 9.1.5, and I don't have access to a Pentaho instance to even test this myself. And I'm hoping the good folks in the Stackoverflow community will share their expertise and save me from having to install one and the hours of playing/testing to get an idea of this is a bad idea.
EDIT:
I sort of knew this question was a bit vague, but I was hoping that some one would read it and share any experience they have. So, Let me spell it out more clearly and ask more explicit questions.
I have not done anything. I don't know Pentaho. I don't want to learn Pentaho (not that there is anything wrong with Pentaho... It's just not where my interests are right now). My company hired contractors (I did not hire them). They have experience with Pentaho, but with MySQL. They don't really know anything about PostgreSQL. There are some important difference between PostgreSQL and MySQL. Including the fact that PostgreSQL supports schemas (whereas MySQL uses separate database... similar in concept be behave differently in some ways). Some ORMs (and tools) don't really like this... for example, the Django framework still doesn't really fully support schemas in Postgresql (I know this because I use Python and Django often and my life is much better when I keep things in the "public" schema). Because of my experience with Django and PostgreSQL schemas, I'm a bit leery of moving this data to a new schema.
I do understand that where ever the tables are, they will need permissions to be able to access the data.
My explicit questions:
Do you use Pentaho to access a PostgreSQL database to access tables in schemas other than "public" (the default).
If so, does it just work (no problems)?
If you had problems, would you please be willing to share with me (and the Stackoverflow community) any online resources that helped you? Or would you be willing to detail what you remember here?
Do you know of anything that just won't work correctly? For example, an open bug in Pentaho related to this topic.
Again, it's not your standard kind of question. I'm hoping that someone out there has experience and is willing to share it here and save me from having to spend time setting up a new Pentaho instance and trying to learn Pentaho well enough to test it, etc.
Thanks.
Two paths you can take:
1) What previous post said ("Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.")
2) In database connection, advanced tab, "The preferred schema name".
If you're working with different schemas, you can create one database connection per schema. With this approach you can leave schema field in input/output steps empty.
We use MS SQL server and I can tell you that Pentaho does struggle with the idea of a schema. Many of their apps allow you to select a schema but Pentaho, like you said, is built to use something like mySQL.
Make you pentaho database user work like it would be working in mySQL.
We made the database user default to dbo then we structured our tables like dbo.dimDimension,
dbo.factFactTable etc. Basically, only use dbo for Pentaho purposes. (Or whatever schema you want to default to.)
I use PDI and PgSQL extensively every day with a bunch of different schemas. It works fine. The only trouble you might run into is Pg's troublesome practice of forcing unquoted identifiers to lower instead of upper case. I soon realized everything was easier when I set the Advanced connection property to "Quote all in database".
Yes, you have to quote everything when you type SQL if PDI doesn't do it for you, but it works quite well. Haven't experimented with forcing all identifiers to lower case, but I expect that would work as well.
And yes, use the "Preferred schema nanme" as well, but be aware that some steps use that option and others don't. You can't, for example, expect it to add schema names to SQL you type into a Table Input step.
The only other issues you might run into are the limits of Pg's JDBC driver. It's not as good as SQL Server's or DB2's, but the only thing I've every had trouble with was sending error rows from a Table Output step to another step when the Table Output step was in batch mode.
Have fun learning PDI. It makes a great complement to your DBA skills.
Brian
Pentaho steps (table inputs, outputs, etc.) usually allow you to specify a database schema.
I did a quick test using PDI and our 8.4 Postgres instance and was able to explore, read from and write to tables in different schemas.
So, I think this is a reasonable direction. Hope this helps.