Underscores vs dashes in HTTP parameter names - forms

I'm familiar with the convention of using hyphens to separate words in URL paths. What about parameter names, such as within a <form>:
<form>
<input name="my_special_field">
</form>
is that better or my-special-field? I've seen Google use underscores in analytics with utm_campaign and other parameter names. Underscores read a little better and allow for the occasional hyphen within the name (field_for_5-16-17). But hyphens are certainly the convention for URL paths.
What's the convention for separating words in an HTTP parameter name?

What's the convention for separating words in an HTTP parameter name?
Well, I think there is none. I hear hyphens perform a bit better SEO-wise. But as long as you are compliant with RFC 3986 (especially section 3.4), everything is okay.
If it really interests you, part of the dilemma is that the query string has never been formalized. There is only a consensus on which characters are supposed to be allowed in it.

Related

Any way to extend org-mode tags to include characters like "-"?

BRIEF: can the characters permitted in org-mode tags be extenced in any way?
E.g. to include -, dashes?
DETAIL:
I see
http://orgmode.org/org.html#Tags
Tags are normal words containing letters, numbers, ‘_’, and ‘#’. Tags
must be preceded and followed by a single colon, e.g., ‘:work:’.
I am somewhat surprised that this is not extensible. Is it, and I have missed it?
TODO keywords can include dashes. Occasionally I would like to treat TODOs as intermiscible wth tags - e.g. move a TODO to a tag, and vice versa - but this syntax difference gets in the way.
Before I start coding, does anyone know why dashes are not allowed? I conjecture confusion with timestamps.
Not certain, but it is likely because certain agenda search strings use "-" as an operator, and there is not yet a way to escape those using "-".
I have used hyphens for properties and spaces for todo kw with no problems for my purposes, but I have not tried them in searches.
The new exporter will be included in 8.0 and with it a detailed description of syntax. If you want to file a feature request to allow hyphens, perhaps by allowing escaping in search strings, now is the time to do it.

Hyphen, underscore, or camelCase as word delimiter in URIs?

I'm designing an HTTP-based API for an intranet app. I realize it's a pretty small concern in the grand scheme of things, but: should I use hyphens, underscores, or camelCase to delimit words in the URIs?
Here are my initial thoughts:
camelCase
possible issues if server is case-insensitive
seems to have fairly widespread use in query string keys (http://api.example.com?**searchQuery**=...), but not in other URI parts
Hyphen
more aesthetically pleasing than the other alternatives
seems to be widely used in the path portion of the URI
never seen hyphenated query string key in the wild
possibly better for SEO (this may be a myth)
Underscore
potentially easier for programming languages to handle
several popular APIs (Facebook, Netflix, StackExchange, etc.) are using underscores in all parts of the URI.
I'm leaning towards underscores for everything. The fact that most of the big players are using them is compelling (see https://stackoverflow.com/a/608458/360570).
You should use hyphens in a crawlable web application URL. Why? Because the hyphen separates words (so that a search engine can index the individual words), and a hyphen is not a word character. Underscore is a word character, meaning it should be considered part of a word.
Double-click this in Chrome: camelCase
Double-click this in Chrome: under_score
Double-click this in Chrome: hyphen-ated
See how Chrome (I hear Google makes a search engine too) only thinks one of those is two words?
camelCase and underscore also require the user to use the shift key, whereas hyphenated does not.
So if you should use hyphens in a crawlable web application, why would you bother doing something different in an intranet application? One less thing to remember.
The standard best practice for REST APIs is to have a hyphen, not camelcase or underscores.
This comes from Mark Masse's "REST API Design Rulebook" from Oreilly.
In addition, note that Stack Overflow itself uses hyphens in the URL: .../hyphen-underscore-or-camelcase-as-word-delimiter-in-uris
As does WordPress: http://inventwithpython.com/blog/2012/03/18/how-much-math-do-i-need-to-know-to-program-not-that-much-actually
Short Answer:
lower-cased words with a hyphen as separator
Long Answer:
What is the purpose of a URL?
If pointing to an address is the answer, then a shortened URL is also doing a good job. If we don't make it easy to read and maintain, it won't help developers and maintainers alike. They represent an entity on the server, so they must be named logically.
Google recommends using hyphens
Consider using punctuation in your URLs. The URL http://www.example.com/green-dress.html is much more useful to us than http://www.example.com/greendress.html. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.
Coming from a programming background, camelCase is a popular choice for naming joint words.
But RFC 3986 defines URLs as case-sensitive for different parts of the URL.
Since URLs are case sensitive, keeping it low-key (lower cased) is always safe and considered a good standard. Now that takes a camel case out of the window.
Source: https://metamug.com/article/rest-api-naming-best-practices.html#word-delimiters
Whilst I recommend hyphens, I shall also postulate an answer that isn't on your list:
Nothing At All
My company's API has URIs like /quotationrequests/, /purchaseorders/ and so on.
Despite you saying it was an intranet app, you listed SEO as a benefit. Google does match the pattern /foobar/ in a URL for a query of ?q=foo+bar
I really hope you do not consider executing a PHP call to any arbitrary string the user passes in to the address bar, as #ServAce85 suggests!
In general, it's not going to have enough of an impact to worry about, particularly since it's an intranet app and not a general-use Internet app. In particular, since it's intranet, SEO isn't a concern, since your intranet shouldn't be accessible to search engines. (and if it is, it isn't an intranet app).
And any framework worth it's salt either already has a default way to do this, or is fairly easy to change how it deals with multi-word URL components, so I wouldn't worry about it too much.
That said, here's how I see the various options:
Hyphen
The biggest danger for hyphens is that the same character (typically) is also used for subtraction and numerical negation (ie. minus or negative).
Hyphens feel awkward in URL components. They seem to only make sense at the end of a URL to separate words in the title of an article. Or, for example, the title of a Stack Overflow question that is added to the end of a URL for SEO and user-clarity purposes.
Underscore
Again, they feel wrong in URL components. They break up the flow (and beauty/simplicity) of a URL, since they essentially add a big, heavy apparent space in the middle of a clean, flowing URL.
They tend to blend in with underlines. If you expect your users to copy-paste your URLs into MS Word or other similar text-editing programs, or anywhere else that might pick up on a URL and style it with an underline (like links traditionally are), then you might want to avoid underscores as word separators. Particularly when printed, an underlined URL with underscores tends to look like it has spaces in it instead of underscores.
CamelCase
By far my favorite, since it makes the URLs seem to flow better and doesn't have any of the faults that the previous two options do.
Can be slightly harder to read for people that have a hard time differentiating upper-case from lower-case, but this shouldn't be much of an issue in a URL, because most "words" should be URL components and separated by a / anyways. If you find that you have a URL component that is more than 2 "words" long, you should probably try to find a better name for that concept.
It does have a possible issue with case sensitivity, but most platforms can be adjusted to be either case-sensitive or case-insensitive. Any it's only really an issue for 2 cases: a.) humans typing the URL in, and b.) Programmers (since we are not human) typing the URL in. Typos are always a problem, regardless of case sensitivity, so this is no different that all one case.
It is recommended to use the spinal-case (which is highlighted by
RFC3986), this case is used by Google, PayPal, and other big
companies.
source:- https://blog.restcase.com/5-basic-rest-api-design-guidelines/
EDIT: Although the highlight on the RFC is nowhere to be found, the recommendation on spinal case is still valid (as already noted in other answers)
We should use hyphens in web page URLs to convince search engines to index each keyword in the URL separately.
here's the best of both worlds.
I also "like" underscores, besides all your positive points about them, there is also a certain old-school style to them.
So what I do is use underscores and simply add a small rewrite rule to your Apache's .htaccess file to re-write all underscores to hyphens.
https://yoast.com/apache-rewrite-dash-underscore/

What Unicode characters are dangerous?

What Unicode characters (more precisely codepoints) are dangerous and should be blacklisted and prohibited for the users to use?
I know that BIDI override characters and the "zero width space" are very prone to make problems, but what others are there?
Thanks
Characters aren’t dangerous: only inappropriate uses of them are.
You might consider reading things like:
Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
RFC 3454: Preparation of Internationalized Strings (“stringprep”)
It is impossible to guess what you mean by dangerous.
A Golden Rule in security is to whitelist instead of blacklist, instead of trying to cover all bad characters, it is a much better idea to validate based on ensuring the user only use known good characters.
There are solutions that help you build the large whitelist that is required for international whitelisting. For example, in .NET there is UnicodeCategory.
The idea is that instead of whitelisting thousands of individual characters, the library assigns them into categories like alphanumeric characters, punctuations, control characters, and such.
Tutorial on whitelisting international characters in .NET
Unicode Regex: Categories
'HANGUL FILLER' (U+3164)
Since Unicode 1.1 in 1993, there is an empty wide, zero space character.
We can't see it, neither copy/paste it alone because we can't select it!
It need to be generated, by the unix keyboard shortcut: CTRL + SHIFT + u + 3164
It can pretty much 💩 up anything: variables, function name, url, file names, mimic DNS, invalidate hash strings, database entries, blog posts, logins, allow to fake identical accounts, etc.
DEMO 1: Altering variables
The variable hijacked contains a Hangul Filler char, the console log call the variable without the char:
const normal = "Hello w488ld"
const hijaㅤcked = "Hello w488ld"
console.log(normal)
console.log(hijacked)
DEMO 2: Hijack URL's
Those 3 url will lead to xn--stackoverflow-fr16ea.com:
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
See Unicode Security Considerations Report.
It covers various aspects, from spoofing of rendered strings to dangers of processing UTF encodings in unsafe languages.
U+2800 BRAILLE PATTERN BLANK - a Braille character without any "dots". It looks like a regular "space" but is not classified as one.

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.
The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)