I desperatly tried to find out what symbol '\nquit' is... and I couldnt find any reference in the web.
What I tried to find is a complete list of all of those characters (\n, \p, \0, ...) but I couldn't find any.
cheers usche
Wikipedia has a list of C language escapes here.
As noted in my comment, I believe this represents the newline (linefeed) character \n followed by the word quit (which would be forced by the newline to the beginning of the next line of output). But in that case the string should be "double"-quoted rather than 'single'-quoted.
Related
I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.
I found some "funny" characters (e.g. ḓ̵̙͎̖̯̞̜̞̪̠ and •̩̩̩̩̩̩̩̩̩̩) in social media that takes more than one line. First I think it is the bug of Firefox. I tried this in Gedit and LibreOffice Writer, they are all the same. So, what is this actually? Actually I am asking about the character encoding and rendering.
I tried to find the character in GNOME Character Map, they could not be found.
I tried to check the character code of both of them in unicode (probably UTF-8). It seems they takes more than one character. How come one character is more than one character? This is the result by using Python.
Character ḓ̵̙͎̖̯̞̜̞̪̠
u'\u2022\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329
\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329'
Character •̩̩̩̩̩̩̩̩̩̩
u'\u1e13\u0335\u0319\u034e\u0316\u032f\u031e\u031c\u031e\u032a\u0320\u033c\u031e
\u0320\u034e\u033c\u0353\u034b\u036e\u034c\u0346\u0300\u035c\u0345'
U+0329 is COMBINING VERTICAL LINE BELOW. It is a combining character (and so are all the others in there except U+2022 and U+1E13), meaning that it combines with the previous one. What you see here is merely the result of someone stacking way too many combining characters on the same base.
In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.
My assignment is to open and read a file, remove all commas, periods, spaces, and exclamation points from it. Furthermore, I must display the number of word occurrences for each word by placing the word as a hash and the number of occurrences as the value and the words are the keys. For example, in a document that says," Perl Program, Perl Program." Perl and program are the keys, where as the values are the n
Words-----Count
Perl------2
Program---2
The instructor already posted the directions, but in them he mentions, "split the line into tokens and store the array". I think I could do this if I knew what tokens were, so could someone explain what tokens are please?
According to Wikipedia
A token is a string of characters, categorized according to the rules
as a symbol (e.g., IDENTIFIER, NUMBER, COMMA).
There is no special meaning of token in Perl.
in this context a token is most likely a word/symbol that is broken up by a special character, which would be all the characters you are supposed to ignore.
That means in your example the tokens you'd have would be (in order)
Perl
Program
Perl
Program
But in another example that wasn't spaced out like
"Perl!ProgramHello,Name.GoodBye>ASFDKLDJ"
The tokens would be
Perl
ProgramHello (even though this is two english words)
Name
GoodBye
ASFDKLDJ
You should clarify with your Professor as to what you have to split the tokens on.
Starting with some text file with space as a standard word delimiter, the instructions do not say that while removing space and punctuation some other delimiter cannot be substituted.
I am getting problem while parsing xml files that contains some special characters like single quote,double quote (', "")etc.I am using NSXMLParser's parser:foundCharacters:method to collect characters in my code.
<synctext type = "word" >They raced to the park Arthur pointed to a sign "Whats that say" he asked Zoo said DW Easy as pie</synctext>
When i parse and save the text from above tag of my xml file,the resultant string is appearing,in GDB, as
"\n\t\tThey raced to the park Arthur pointed to a sign \"Whats that say\" he asked Zoo said DW Easy as pie";
Observe there are 2 issues:
1)Unwanted characters at the beginning of the string.
2)The double quotes around Whats that say.
Can any one please help me how to get rid of these unwanted characters and how to read special characters properly.
NSString*string =[string stringByTrimmingCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#" \n\t"]];
The parser is apparently returning exactly what's in the string. That is, the XML was coded with the starting tag on one line, a newline, two tabs, and the start of the string. And quotes in the string are obviously there in the original (and it's not clear in at least this example why you'd want to delete them).
But if you want these characters gone then you need to post-process the string. You can use Rams' statement to eliminate the newline and tabs, and stringByReplacingOccurrencesOfString:WithString: to zap the quotes.
(Note that some XML parsers can be instructed to return strings like this with the leading/trailing stuff stripped, but I'm not sure about this one. The quotes will always be there, though.)