What is a token in perl? - perl

My assignment is to open and read a file, remove all commas, periods, spaces, and exclamation points from it. Furthermore, I must display the number of word occurrences for each word by placing the word as a hash and the number of occurrences as the value and the words are the keys. For example, in a document that says," Perl Program, Perl Program." Perl and program are the keys, where as the values are the n
Words-----Count
Perl------2
Program---2
The instructor already posted the directions, but in them he mentions, "split the line into tokens and store the array". I think I could do this if I knew what tokens were, so could someone explain what tokens are please?

According to Wikipedia
A token is a string of characters, categorized according to the rules
as a symbol (e.g., IDENTIFIER, NUMBER, COMMA).
There is no special meaning of token in Perl.

in this context a token is most likely a word/symbol that is broken up by a special character, which would be all the characters you are supposed to ignore.
That means in your example the tokens you'd have would be (in order)
Perl
Program
Perl
Program
But in another example that wasn't spaced out like
"Perl!ProgramHello,Name.GoodBye>ASFDKLDJ"
The tokens would be
Perl
ProgramHello (even though this is two english words)
Name
GoodBye
ASFDKLDJ
You should clarify with your Professor as to what you have to split the tokens on.

Starting with some text file with space as a standard word delimiter, the instructions do not say that while removing space and punctuation some other delimiter cannot be substituted.

Related

Why does fasttext yield <\s> as first entry in VSM?

I am using a large German corpus, which I have cleaned of all special characters/numbers/inter-punctuation signs.
Each line contains one sentence.
Running
fastText/./fasttext skipgram -input input.txt -output output.txt
-minCount 2 -minn 2 -maxn 8 -dim 300 -ws 5
returns a VSM with <\s> as first entry.
From how I understand it, there are white spaces left in the document that are interpreted as a token.
Is that correct?
And how can I get rid of them and/or the <\s> in the VSM?
Thank you.
By convention the fasttext tool converts any newlines in the input file to a pseudoword token '<\s>', to represent an end-of-string ('EOS').
See the discussion in the Python binding Markdown docs:
https://github.com/facebookresearch/fastText/blob/main/python/README.md#important-preprocessing-data--encoding-conventions
The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX_LINE_SIZE constant as defined in the Dictionary header. This means
if you have text that is not separate by newlines, such as the fil9
dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens
and the EOS token is not appended.
The length of a token is the number of UTF-8 characters by considering
the leading two bits of a byte to identify subsequent bytes of a
multi-byte sequence. Knowing this is especially important when
choosing the minimum and maximum length of subwords. Further, the EOS
token (as specified in the Dictionary header) is considered a
character and will not be broken into subwords.
(Though only mentioned in that doc about the Python bindings, it's definitely defined/implemented in the core C++ code, especially the dictionary.cc file.)
To eliminate that word-token, you'd have to strip all newlines from your input file.

crystal reports attempting to link two tables by matching string with no luck

As stated in the title, I have two tables I'm attempting to link. Both Strings appear to be a match, however Crystal Reports is not picking it up. The only thing I can think is that that length of the field is different, even though the strings are the same. could that cause a discrepancy? If so how can I correct for it? Thank you
Length of the string will prevent a match. If you are using the Trim(string) function, that only removes spaces found at the beginning or end of your string, so the two strings could still be of different lengths after using this function. You will need to use another function to capture a substring of the original string. To do this you can use the Left(string, length) function to ensure both strings are the same length.
If they still do not match then you may have non-printable characters in one or both of your strings. Carriage Return and Line Feed tend to be the most commonly found non-printable characters. A Carriage Return is represented as Chr(10), while a Line Feed is represented as Chr(13). These are Built In Constants similar to those found in VBA and Visual Basic.
You can use a find and replace to remove them with the following formula. Its not a bad idea to also include the trim and left functions in this as well to ensure you get the best match possible.
Replace(Replace(Left(Trim({YourStringField}), 10),Chr(10), ""),Chr(13), "")
There are a few additional Built In Constants you may need to check for if this doesn't work. A Tab is represented as Chr(9) for example. Its very rare for strings to contain the other Built In Constants though. In most cases Carriage Return and Line Feed are the only ones that are typically found in Plain Text. Tabs and the other constants should only be found in Rich Text and are very rare in string data.

IRC (RFC 1459) message prefix

This question seems fairly pedantic, however it feels reasonably important when trying to follow the RFC. I am trying to write an IRC client and I am using the RFC to follow how the protocol should be written. I came across the section for message prefixes and was slightly confused by what was written.
Each IRC message may consist of up to three main parts: the prefix
(optional), the command, and the command parameters (of which there
may be up to 15). The prefix, command, and all parameters are
separated by one (or more) ASCII space character(s) (0x20).
The presence of a prefix is indicated with a single leading ASCII
colon character (':', 0x3b), which must be the first character of the
message itself. There must be no gap (whitespace) between the colon
and the prefix.
My question concerns the first sentence in the second paragraph; ASCII colon character (':', 0x3b). With (to my understanding) 0x3bbeing the ASCII character for a semi-colon, does this mean that the prefix may be either semi-colon or a colon, or is it simply a typo in the document? I'm going ahead with using a colon for now, however my curiosity is nagging away at me.
The colon : (0x3a) is correct.
This is the first errata listed for RFC1459.

How to parse special characters in XML for iPad?

I am getting problem while parsing xml files that contains some special characters like single quote,double quote (', "")etc.I am using NSXMLParser's parser:foundCharacters:method to collect characters in my code.
<synctext type = "word" >They raced to the park Arthur pointed to a sign "Whats that say" he asked Zoo said DW Easy as pie</synctext>
When i parse and save the text from above tag of my xml file,the resultant string is appearing,in GDB, as
"\n\t\tThey raced to the park Arthur pointed to a sign \"Whats that say\" he asked Zoo said DW Easy as pie";
Observe there are 2 issues:
1)Unwanted characters at the beginning of the string.
2)The double quotes around Whats that say.
Can any one please help me how to get rid of these unwanted characters and how to read special characters properly.
NSString*string =[string stringByTrimmingCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#" \n\t"]];
The parser is apparently returning exactly what's in the string. That is, the XML was coded with the starting tag on one line, a newline, two tabs, and the start of the string. And quotes in the string are obviously there in the original (and it's not clear in at least this example why you'd want to delete them).
But if you want these characters gone then you need to post-process the string. You can use Rams' statement to eliminate the newline and tabs, and stringByReplacingOccurrencesOfString:WithString: to zap the quotes.
(Note that some XML parsers can be instructed to return strings like this with the leading/trailing stuff stripped, but I'm not sure about this one. The quotes will always be there, though.)

How to use '^#' in Vim scripts?

I'm trying to work around a problem with using ^# (i.e., <ctrl-#>) characters in Vim scripts. I can insert them into a script, but when the script runs it seems the line is truncated at the point where a ^# was located.
My kludgy solution so far is to have a ^# stored in a variable, then reference the variable in the script whenever I would have quoted a literal ^#. Can someone tell me what's going on here? Is there a better way around this problem?
That is one reason why I never use raw special character values in scripts. While ^# does not work, string <C-#> in mappings works as expected, so you may use one of
nnoremap <C-#> {rhs}
nnoremap <Nul> {rhs}
It is strange, but you cannot use <Char-0x0> here. Some notes about null byte in strings:
Inserting null byte into string truncates it: vim uses old C-style strigs that end with null byte, thus it cannot appear in strings. These strings are very inefficient, so if you want to generate a very large text, try accumulating it into a list of lines (using setline is very fast as buffer is represented as a list of lines).
Most functions that return list of strings (like readfile, getline(start, end)) or take list of strings (like writefile, setline, append) treat \n (NL) as Null. It is also the internal representation of buffer lines, see :h NL-used-for-Nul.
If you try to insert \n character into the command-line, you will get Null shown (but this is really a newline). If you want to edit a file that has \n in a filename (it is possible on *nix), you will need to prepend newline with backslash.
The byte ctrl-# is also known as '\0'. Many languages, programs, etc. use it as an "end of string" marker, so it's not surprising that vim gets confused there. If you must use this byte in the middle of a script string, it sounds like your workaround is a decent one.