I am designing a rest api where users can pass in queries using a search query language I will define.
The language will allow a number of operators eq, ne, gt, lt (equals, not equals, greater than, less than) etc etc.
The language will allow grouping and logical operators AND and OR.
So for example a query about companies may look like the following
/api/companies?q=(CompanyName eq Microsoft Or CompanyName eq Apple) And State eq California
So this should give me all companies where company name equals 'Microsoft' or 'Apple' and the state is California.
So this all works fine except for the fact that the system that I am writing the api against is extremely flexible and allows almost any character to be inserted into fields values. Additionally, I also must support custom fields and those are able to have special characters in the field name.
Initially my main concern was fields that contained parentheses. I will be converting this query into a SQL server query and I need a way to ensure that I do not confuse a parentheses in a field value with one that is intended for grouping. My second thought was to force field values to be quoted, but I think this will also cause similar problems.
I was also considering that there may be a simple approach involving html encoding, but I am unable to see exactly how that would work.
What I am looking for is any advice or examples of reasonable approaches to handle a rest search query with such flexible data.
You should use percent encoding to escape characters in your query string, see RFC 3986. This previous StackOverflow post contains some useful background information about URI encoding.
Initially my main concern was fields that contained parentheses. I will be converting this
query into a SQL server query and I need a way to ensure that I do not confuse a parentheses
in a field value with one that is intended for grouping
If this might be a problem then it sounds like your application will be susceptible to SQL injection. You should be escaping any external data before constructing an SQL query.
/api/companies?q=(CompanyName eq Microsoft Or CompanyName eq Apple) And State eq California
Based on this example you could take advantage of the URI query string to better represent your query:
/api/companies?CompanyName=Microsoft%20OR%20Apple&State=California
Here is an example.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/
Related
I have a case where addresses and country names have special characters. For eg:
People's Republic of Korea
De'Paul & Choice Street
etc..
This data get send as JSON payload to backend to be inserted in a JSONB column in postgres.
The insert statement gets messed up because of the "single quote" and ends up erroring out.
The front-end developers are saying that they are using popular libraries to get country names etc and don't want to touch the data. They just want to pass as is.
Any tips on how to process such data with special characters especially something that contradicts with JSON formatted data and safely insert into postgres?
Your developers are using the popular libraries, whatever they may be, in the wrong fashion. The application is obviously vulnerable to SQL injection, the most popular way to attack a database application.
Use prepared statements, then the problem will go away. If you cannot do that, use the popular library's functions to escape the input string for use as an SQL string literal.
We're developing a REST API for our platform. Let's say we have organisations and projects, and projects belong to organisations.
After reading this answer, I would be inclined to use numerical ID's in the URL, so that some of the URLs would become (say with a prefix of /api/v1):
/organisations/1234
/organisations/1234/projects/5678
However, we want to use the same URL structure for our front end UI, so that if you type these URLs in the browser, you will get the relevant webpage in the response instead of a JSON file. Much in the same way you see relevant names of persons and organisations in sites like Facebook or Github.
Using this, we could get something like:
/organisations/dutchpainters
/organisations/dutchpainters/projects/nightwatch
It looks like Github actually exposes their API in the same way.
The advantages and disadvantages I can come up with for using names instead of IDs for URL definitions, are the following:
Advantages:
More intuitive URLs for end users
1 to 1 mapping of front end UI and JSON API
Disadvantages:
Have to use unique names
Have to take care of conflict with reserved names, such as count, so later on, you can still develop an API endpoint like /organisations/count and actually get the number of organisations instead of the organisation called count.
Especially the latter one seems to become a potential pain in the rear. Still, after reading this answer, I'm almost convinced to use the string identifier, since it doesn't seem to make a difference from a convention point of view.
My questions are:
Did I miss important advantages / disadvantages of using strings instead of numerical IDs?
Did Github develop their string-based approach after their platform matured, or did they know from the start that it would imply some limitations (like the one I mentioned earlier, it seems that they did not implement such functionality)?
It's common to use a combination of both:
/organisations/1234/projects/5678/nightwatch
where the last part is simply ignored but used to make the url more readable.
In your case, with multiple levels of collections you could experiment with this format:
/organisations/1234/dutchpainters/projects/5678/nightwatch
If somebody writes
/organisations/1234/germanpainters/projects/5678/wanderer
it would still map to the rembrandt, but that should be ok. That will leave room for editing the names without messing up url:s allready out there. Also, names doesn't have to be unique if you don't really need that.
Reserved HTTP characters: such as “:”, “/”, “?”, “#”, “[“, “]” and “#” – These characters and others are “reserved” in the HTTP protocol to have “special” meaning in the implementation syntax so that they are distinguishable to other data in the URL. If a variable value within the path contains one or more of these reserved characters then it will break the path and generate a malformed request. You can workaround reserved characters in query string parameters by URL encoding them or sometimes by double escaping them, but you cannot in path parameters.
https://www.serviceobjects.com/blog/path-and-query-string-parameter-calls-to-a-restful-web-service
Numerical consecutive IDs are not recommended anymore because it is very easy to guess records in your database and some might use that to obtain info they do not have access to.
Numerical IDs are used because the in the database it is a fixed length storage which makes indexing easy for the database. For example INT has 4 bytes in MySQL and BIGINT is 8 bytes so the number have the same length in memory (100 in INT has the same length as 200) so it is very easy to index and search for records.
If you have a lot of entries in the database then using a VARCHAR field to index is a bad idea. You should use a fixed width field like CHAR(32) and fill the difference with spaces but you have to add logic in your program to treat the differences when searching the database.
Another idea would be to use slugs but here you should take into consideration the fact that some records might have the same slug, depends on what are you using to form that slug. https://en.wikipedia.org/wiki/Semantic_URL#Slug
I would recommend using UUIDs since they have the same length and resolve this issue easily.
Just curious why I really have to specify a group by clause since if I use a function that requiers a group by clause(can't remember the general name of those functions), eg. SUM().
Because if I use one of those I have to specify every column that doesn't use one in the group by clause.
Why doesn't sql just automatically group on all columns that isn't using an aggregation function? It seems redundant since as soon as I'm using an aggregation I'm grouping on all other columns that is not using it.
Probably for the same reason a C compiler would not automatically assume and insert a variable declaration if you are using one that has not been previously declared. There are programming languages which do that sort of things, SQL is not one of them.
Editors, on the other hand, may be aware of this and at least auto-complete functionally dependent parts of the syntax for you. Oracle SQL developer will by default automatically append a GROUP BY clause as soon as it detects you're writing a select column list that needs it. IMO this is a pain, and I usually keep it turned off, but it will be as far as you get - on an IDE/editor level.
Edit: Based on your last comment, there is an option in MySQL (not Microsoft's T-SQL) meant to relax the rule by implementing optional feature T301 of the standard SQL99. I think this is exactly what you're after:
MySQL 5.7.5 and up implements detection of functional dependence. If the ONLY_FULL_GROUP_BY SQL mode is enabled (which it is by default), MySQL rejects queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are neither named in the GROUP BY clause nor are functionally dependent on them.
Source: https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
Could not find much information on the status of this feature in future versions of T-SQL, though. The only reference is this, with the very cryptic remark that T-SQL would "partially support this feature".
When designing a REST API, following guidance such as 10 Best Practices for Better RESTful API, there seem to be all sorts of ways to provide a query syntax, pagination, selecting fields to return, etc.
For example, some ways to do pagination:
/orders?max=20&start=100
/orders?per_page=20&page=5
Some ways to provide a query interface:
/orders?q=value>20
/orders?q={'value': 'gt 20'}
Are there any standards for how to design an API that offers these features? If not, standards in development or best practice guidelines would be useful.
When researching this for the Watson Discovery and Assistant APIs, we weren't able to find any widely adopted conventions for filtering or paging, although there are many different conventions.
Some considerations for which convention you use:
Do you need compound clauses in your query? If you want to be able to express a > 10 || b < 10, then you need a string syntax or structured JSON structure to represent the more complicated queries, which will likely be a usability challenge for your users, and so is preferable to avoid if you don't really need the flexibility. In general, the simpler you can keep the requirements, the easier the API will be to learn and use, while potentially at the expense of flexibility. For example, if it turns out that the created date is the only field that users actually care about doing inequality filtering on, you could have explicit begin_date and end_date filter parameters instead of allowing inequality comparisons on all fields.
For pagination, do you have frequently changing data? If so, paging by offset may give you unstable results. For example, paging through logs that are actively being created, sorted by most recent, would cause you to see duplicate items. To avoid this, the server can return a token that represents the next page. This token can either be a lookup value or directly encode the information necessary to identify the values of the next item in the potentially changing list. Microsoft's API guidelines contain examples of both token and offset based paging, and are one of many sets of conventions to follow: https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#98-pagination
I'm trying to create a more advanced query mechanism for REST. Assume I have the following:
GET /data/users
and it returns a list of users. Then to filter the users returned for example I'd say:
GET /data/users?age=30
to get a list of 30 year old users. Now lets say I want users aged 30 - 40. I'd like to have essentially a set of reusable operators such as:
GET /data/users?greaterThan(age)=30&lessThan(age)=40
The greaterThan and lessThan would be reusable on other numeric, date, etc fields. This would also allow me to add other operators (contains, starts with, ends with, etc). I'm a REST noob so I'm not sure if this violates any of the core principles REST follows. Any thoughts?
Alternately, you might simply be better off with optional parameters "minAge" and "maxAge".
Alternative 2: encode the value(s) for parameters to indicate the test to be performed: inequalities, pattern matching etc.
This gets messy no matter what you do for complex boolean expressions. At some point, you almost want to make a document format for the query description itself, but it's hard to think of it as a "GET" anymore.
I would look into setting the value of the query parameter to include syntax for operators and such .. something like this for a range of values
/data/users?age=[30,40]
or
/data/users?age=>30&age=<40
would make it a little easier to read, just make sure to url encode if you are using any reserved characters