Scala variable naming rules with underscore - scala

What are the rules to name methods and variables in Scala, especially when mixing symbols and letters using _? For instance, why _a_, a_+, __a, __a__a__a__+, ___ are valid names, but _a_+_a or _a_+_ are not?

It's in the very first section of the Scala Language Specification:
There are three ways to form an identifier. First, an identifier can start with a letter which can be followed by an arbitrary sequence of letters and digits. This may be followed by underscore ‘_‘ characters and another string composed of either letters and digits or of operator characters.
It's not entirely clear from this, but the operator characters cannot be followed by anything else. Seen here (the pattern for the end of the identifier):
idrest ::= {letter | digit} [‘_’ op]
_a_+_a and _a_+_ are illegal because they have another letter or underscore following the operator characters. However, they are legal if you surround them with back quotes.
scala> val `_a_+_` = 1
_a_+_: Int = 1
scala> val `_a_+_a` = 1
_a_+_a: Int = 1

From here:
There are three ways to form an identifier. First, an identifier can
start with a letter which can be followed by an arbitrary sequence of
letters and digits. This may be followed by underscore ‘_‘ characters
and another string composed of either letters and digits or of
operator characters. Second, an identifier can start with an operator
character followed by an arbitrary sequence of operator characters.
The preceding two forms are called plain identifiers. Finally, an
identifier may also be formed by an arbitrary string between
back-quotes (host systems may impose some restrictions on which
strings are legal for identifiers). The identifier then is composed of
all characters excluding the backquotes themselves.
You can also see in the link the grammar of the language.

Related

Regex match invalid pattern ios swift 4 [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Sed's regex to eliminate a very specific string

Disclaimer:
I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have a list of servers (VMs) that have it's UUID embedded as part of the name. I need to get rid of that in order to obtain the "pure/clean" server name. Now, the problem is precisely that: I need to get rid of the UUID (which has a very specific and constant format, more details on this below) and ONLY that, nothing else.
The UUID - as you might already know or have noticed - has a specific and constant format which consists of the following parts:
It starts with a dash (-).
Which is followed by a subset of 8 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 12 alphanumeric characters (letters are always lowercase).
Samples of results achieved using "my" """"code"""":
In this case the result is the expected one:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
In this case the result is the expected one too:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f_OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022_OLD
Expected result: PRODSERVER0022-OLD
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f-OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
Expected result: PRODSERVER00-22
echo PRODSERVER00-22-872151c8-1a75-43fb-9b63-e77652931d3f-old | sed 's/-[a-z0-9]*//g'
PRODSERVER00
I know that, within the sed universe, a . means "any character", while a * means "any number of the preceding character". However, what I would need in this case, as I see it at least, is a way to tell sed to do the replacement only if this specific sequence is present (8 alphanumeric characters [any, but specifically 8, not more, not less]; followed by a dash, then followed by 4 alphanumeric characters [any, but specifically 4, not more, not less], etc..). So, the question would be: Is there a regex construction (or a combination [through piping I guess] of several of them, if it has to be the case) that can achieve the expected results in this case?
Note that: Even though servers may have additional dashes (-) as part of their names, the resulting sub-strings will never consist of 8 characters, neither of 4. They might, however, end up having 12 characters, which, even though would initially match up with the last sub-string in the UUID, it will not be at the end of the string, so we have that to discriminate between these two 12-chars substrings (and also it will not be a problem if there is indeed a regex combination that can get rid of the UUID as a whole).
Try this to match the UUID.
-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}
Embed it in the sed command line in the usual way. As Benjamin W. has said, we need to use extended regular expressiongs.
sed -E 's/-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}//g'

When are double quotes required to create a KDB/q symbol?

Normally, for simple character strings, a leading backtick does the trick.
Example: `abc
However, if the string has some special characters, such as space, this will not work.
Example: `$"abc def"
Example: `$"BAT-3Kn.BK"
What are the rules when $"" is required?
Simple syntax for symbols can be used when the symbol consists of alphanumeric characters, dots (.), colons (:), and (non-leading) underscores (_). In addition, slashes (/) are allowed when there is a colon before it. Everything else requires the `$"" syntax.
The book 'Q for mortals', which is available online, has a section discussing datatypes. For symbols it states:
A symbol can include arbitrary text, including text that cannot be
directly entered from the console – e.g., embedded blanks and special
characters such as back-tick. You can manufacture a symbol from any
text by casting the corresponding list of char to a symbol. (You will
need to escape special characters into the string.) See §6.1.5 for
more on casting.
q)`$"A symbol with blanks and `"
`A symbol with blanks and `
The essential takeaway here is that converting a string to a symbol is required when special characters are involved. In the examples you have given both space " " and hyphen "-" are characters that cannot be directly placed into a symbol type.

What constitutes a Regular Indentifier

On page 34 of 70-461 Querying Microsoft SQL Server 2012 it says that an indentifier is regular if:
The rules say that the first character must be a letter in the range
A through Z (lower or uppercase), underscore (_), at sign (#), or
number sign (#). Subsequent characters can include letters, decimal
numbers, at sign, dollar sign ($), number sign, or underscore.
However on pg 271 it says:
Even though you can embed special characters such as #, #, and $ in
an identifier for a schema, table, or column name, that action makes
the identifier delimited, no longer regular.
So to clarify would having special characters like the '$' an identifier regular or not
Having $ after the first character is part of the specification that defines a regular identifier and will not require the use of a delimiter.
I found the definition in SQL Server 2008 R2 Identifiers to be clearer than the one from page 34. It is essentially the same as the one on page 271, but with more detail.
Either you have misquoted pg 271 of the book, or your version is different than mine and has an error:
If you embed special characters other than #, #, and $ in an
identifier for a schema, table, or column, name, that action makes the
identifier delimited, no longer regular.
Here is a regular expression that will match a string that complies with the definition:
^[\p{letter}_##][\p{Letter}\p{Number}_##$]*$
Regex for flavors without unicode support:
^[a-zA-Z_##][a-zA-Z\d_##$]*$

valid characters for lisp symbols

First of all, as I understand it variable identifiers are called symbols in common lisp.
I noted that while in languages like C variable identifiers can only be alphanumberics and underscores, Common Lisp allows many more characters to be used like "*" and (at least scheme does) "?"
So, what I want to know is: what exactly is the full set of characters that Common Lisp allows to have in a symbol (or variable identifier if I'm wrong)? is that the same for Scheme?
Also, is the set of characters different for function names?
I've been googling, looking in the CLHS, and in Practical Common Lisp, and for the life of me, something must be wrong because I can't seem to find the answer.
A detailed answer is a bit tricky. There is the ANSI standard for Common Lisp. It defines the set of available characters. Basically you can use all those defined characters for symbols. See also Symbols as Tokens.
For example
|Polynom 2 * x ** 3 - 5 * x ** 2 + 10|
is a valid symbol. Note that the vertical bars mark the symbol and do not belong to the symbol name.
Then there are the existing implementations of Common Lisp and their support of various character sets and string types. So several support Unicode (or similar) and allow Unicode characters in symbol names.
LispWorks:
CL-USER 1 > (list 'δ 'ψ 'σ '\|)
(δ ψ σ \|)
[From a Schemer's perspective. Even though some concepts in Scheme and Common Lisp have the same name, it does not mean that the mean the same thing in the two languages.]
First note that symbols and identifiers are two different things.
Symbols can be thought of as strings which support fast equality comparision.
Two symbols s and t are equal (more or less) if they are spelled the same way. The operation string=? needs to loop over the characters in the and see if they are all alike. This take time proportional to the length of the shortest string. Symbols on the other hand are automatically (ny the runtime system) put into a (typically) hash table. Therefore symbol=? boils down to a simple pointer comparison and is thus very fast. Symbols are often used in cases where one in C would use enumerations.
Symbols are values that can be present at runtime.
Identifiers are simply names of variables in a program.
Now if said program is to be represented as a Scheme value, one choice would be to use symbols to represent identifiers - but that does not mean symbols are identifiers (or vice versa). A better representation of identifiers (still in Scheme) is syntax objects which besides the name of the identifier also records the where the identifier was read (or constructed). Say you encounter an undefined variable and want to signal where in the program the undefined variable is, then is very convenient that the source location is part of the representation of the identifier.
Last but not least. What are the legal characters of an identifer? Here it is best to quote chapter and version from R6RS:
4.2.4 Identifiers
Most identifiers allowed by other programming languages are also
acceptable to Scheme. In general, a sequence of letters, digits, and
“extended alphabetic characters” is an identifier when it begins with
a character that cannot begin a representation of a number object. In
addition, +, -, and ... are identifiers, as is a sequence of letters,
digits, and extended alphabetic characters that begins with the
two-character sequence ->. Here are some examples of identifiers:
lambda q soup
list->vector + V17a
<= a34kTMNs ->-
the-word-recursion-has-many-meanings
Extended alphabetic characters may be used within identifiers as if
they were letters. The following are extended alphabetic characters:
! $ % & * + - . / : < = > ? # ^ _ ~
Moreover, all characters whose Unicode scalar values are greater than
127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within
identifiers. In addition, any character can be used within an
identifier when specified via an <inline hex escape>. For
example, the identifier H\x65;llo is the same as the identifier
Hello, and the identifier \x3BB; is the same as the identifier
λ.
Any identifier may be used as a variable or as a syntactic keyword
(see sections 5.2 and 9.2) in a Scheme program. Any identifier may
also be used as a syntactic datum, in which case it represents a
symbol (see section 11.10).
From: http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4
See Chapter 2 of the CLHS, which describes the reader algorithm in detail. But the simple answer is that if a token isn't a readmacro invocation (section 2.4), and isn't a number or all dots, it defaults to being interpreted as a symbol.