Why does fasttext yield <\s> as first entry in VSM? - fasttext

I am using a large German corpus, which I have cleaned of all special characters/numbers/inter-punctuation signs.
Each line contains one sentence.
Running
fastText/./fasttext skipgram -input input.txt -output output.txt
-minCount 2 -minn 2 -maxn 8 -dim 300 -ws 5
returns a VSM with <\s> as first entry.
From how I understand it, there are white spaces left in the document that are interpreted as a token.
Is that correct?
And how can I get rid of them and/or the <\s> in the VSM?
Thank you.

By convention the fasttext tool converts any newlines in the input file to a pseudoword token '<\s>', to represent an end-of-string ('EOS').
See the discussion in the Python binding Markdown docs:
https://github.com/facebookresearch/fastText/blob/main/python/README.md#important-preprocessing-data--encoding-conventions
The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX_LINE_SIZE constant as defined in the Dictionary header. This means
if you have text that is not separate by newlines, such as the fil9
dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens
and the EOS token is not appended.
The length of a token is the number of UTF-8 characters by considering
the leading two bits of a byte to identify subsequent bytes of a
multi-byte sequence. Knowing this is especially important when
choosing the minimum and maximum length of subwords. Further, the EOS
token (as specified in the Dictionary header) is considered a
character and will not be broken into subwords.
(Though only mentioned in that doc about the Python bindings, it's definitely defined/implemented in the core C++ code, especially the dictionary.cc file.)
To eliminate that word-token, you'd have to strip all newlines from your input file.

Related

Any subtleties in source map line/column definitions when using source-maps for non-JS languages

I am getting a code generator to produce source maps that tracks line/column based on Unicode code-points with lines broken by (LF, CR, CRLF). I worry about other line-breaks embedded in comments and supplementary characters might cause source-map consumers to disagree on which part of a source text a (line, column) pair references.
Specifically, I am confused by the source-map v3 specification's use of terms like
the zero-based starting line in the original source
the zero-based starting column of the line in the source represented
I imagine that these numbers are used by programmer tools to navigate to/highlight chunks of code in the original source.
Problems due to different line break definitions
JavaScript recognizes CR, LF, CRLF, U+2028, U+2029 as line breaks
Rust recognizes those and U+85 though or only recognizes LF and CRLF as breaks depending on how you slice the grammar.
Java only treats CR, LF, CRLF as line breaks.
Languages that follow Unicode TR #13 might additionally treat VT and FF as line breaks.
So if a source-map generator treats U+0085 (for example) as a newline and a source-map consumer does not, might they disagree about where a (source-line, source-column) pair points?
Problems due to differing column definitions.
Older versions of JavaScript defined a source text as UTF-16 code-units, suggesting that a column count is the number of UTF-16 code-units since the end of the last line-break.
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 2.1 or later, using the UTF-16 transformation format.
But the current spec does not describe source texts in terms of UTF-16:
SourceCharacter::
    any Unicode code point
Are column counts likely to be thrown off if source-map consumers differently treat supplementary characters as occupying one code-point column or two UTF-16 columns?
For example, since '𝄬' is U+1d12C, a supplementary code-point encoded using two UTF-16 code-units, might column counts disagree for a line like
let 𝄬 = "𝄬" /* 𝄬 */ + ""
Is the + symbol at column 20 (zero-indexed by code-point) or column 23 (zero-indexed by UTF-16 code-unit)?
Am I missing something in the specification that clarifies this, or is there a de facto rule used by most source-map producers/consumers?
If these are problems, are there known workarounds or best practices when tracking line/column counts for source languages that translate to JS?
I will probably have to reverse-engineer what implementations like Mozilla's source-map.js or Chrome's dev console do, but thought I'd try and find a spec reference so I know whom to file bugs against and who is correct.

Are newlines in MIME headers using encoded-words legal?

RFC 2047 defines the encoded-words mechanism for encoding non-ASCII character in MIME documents. It specifies that whitespace characters (space and tabs) are not allowed inside the encoded-word.
However, RFC 5322 for parsing email MIME documents specifies that long header lines should be "folded". Should this folding take place before or after encoded-words decoding?
I recently received an email where encoded-text part of the header had a newline in it, like this:
Header: =?UTF-8?Q?=C3=A5
=C3=A4?=
Would this be valid?
Of course emails can be invalid in lots of exciting ways and the parser needs to handle that, but it's interesting to know the "correct" way. :)
I misread the question and answered as if it was a different sort of whitespace. In this case the white space appears inside the MIME word, not multiple ones separated by white space.
This sort of thing is explicitly disallowed. From the introduction to the format in RFC2047:
2. Syntax of encoded-words
An 'encoded-word' is defined by the following ABNF grammar. The
notation of RFC 822 is used, with the exception that white space
characters MUST NOT appear between components of an 'encoded-word'.
And then later on in the same section:
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
=?iso-8859-1?q?this=20is=20some=20text?=
The characters which may appear in 'encoded-text' are further
restricted by the rules in section 5.
Earlier answer
This sort of thing is explicitly allowed. Headers with MIME words should be 76 characters or less and folded if needed. RFC822 folded headers are indented second and any additional lines. RFC2047 headers are supposed to only indent one space. The whitespace between ?= on the first line and =? should be suppressed from output.
See the example on the bottom of page 12 of the RFC:
encoded form displayed as
---------------------------------------------------------------------
(=?ISO-8859-1?Q?a?= (ab)
=?ISO-8859-1?Q?b?=)
Any amount of linear-space-white between 'encoded-word's,
even if it includes a CRLF followed by one or more SPACEs,
is ignored for the purposes of display.

grep for emojis in linux

I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f
You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.
You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html

What is a token in perl?

My assignment is to open and read a file, remove all commas, periods, spaces, and exclamation points from it. Furthermore, I must display the number of word occurrences for each word by placing the word as a hash and the number of occurrences as the value and the words are the keys. For example, in a document that says," Perl Program, Perl Program." Perl and program are the keys, where as the values are the n
Words-----Count
Perl------2
Program---2
The instructor already posted the directions, but in them he mentions, "split the line into tokens and store the array". I think I could do this if I knew what tokens were, so could someone explain what tokens are please?
According to Wikipedia
A token is a string of characters, categorized according to the rules
as a symbol (e.g., IDENTIFIER, NUMBER, COMMA).
There is no special meaning of token in Perl.
in this context a token is most likely a word/symbol that is broken up by a special character, which would be all the characters you are supposed to ignore.
That means in your example the tokens you'd have would be (in order)
Perl
Program
Perl
Program
But in another example that wasn't spaced out like
"Perl!ProgramHello,Name.GoodBye>ASFDKLDJ"
The tokens would be
Perl
ProgramHello (even though this is two english words)
Name
GoodBye
ASFDKLDJ
You should clarify with your Professor as to what you have to split the tokens on.
Starting with some text file with space as a standard word delimiter, the instructions do not say that while removing space and punctuation some other delimiter cannot be substituted.

What is the email subject length limit?

How many characters are allowed to be in the subject line of Internet email?
I had a scan of The RFC for email but could not see specifically how long it was allowed to be.
I have a colleague that wants to programmatically validate for it.
If there is no formal limit, what is a good length in practice to suggest?
See RFC 2822, section 2.1.1 to start.
There are two limits that this
standard places on the number of
characters in a line. Each line of
characters MUST be no more than 998
characters, and SHOULD be no more than
78 characters, excluding the CRLF.
As the RFC states later, you can work around this limit (not that you should) by folding the subject over multiple lines.
Each header field is logically a
single line of characters comprising
the field name, the colon, and the
field body. For convenience however,
and to deal with the 998/78 character
limitations per line, the field body
portion of a header field can be split
into a multiple line representation;
this is called "folding". The general
rule is that wherever this standard
allows for folding white space (not
simply WSP characters), a CRLF may be
inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
The recommendation for no more than 78 characters in the subject header sounds reasonable. No one wants to scroll to see the entire subject line, and something important might get cut off on the right.
RFC2322 states that the subject header "has no length restriction"
but to produce long headers but you need to split it across multiple lines, a process called "folding".
subject is defined as "unstructured" in RFC 5322
here's some quotes ([...] indicate stuff i omitted)
3.6.5. Informational Fields
The informational fields are all optional. The "Subject:" and
"Comments:" fields are unstructured fields as defined in section
2.2.1, [...]
2.2.1. Unstructured Header Field Bodies
Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions. These are referred to as unstructured field bodies.
Semantically, unstructured field bodies are simply to be treated as a
single line of characters with no further processing (except for
"folding" and "unfolding" as described in section 2.2.3).
2.2.3 [...] An unfolded header field has no length restriction and
therefore may be indeterminately long.
after some test: If you send an email to an outlook client, and the subject is >77 chars, and it needs to use "=?ISO" inside the subject (in my case because of accents) then OutLook will "cut" the subject in the middle of it and mesh it all that comes after, including body text, attaches, etc... all a mesh!
I have several examples like this one:
Subject: =?ISO-8859-1?Q?Actas de la obra N=BA.20100154 (Expediente N=BA.20100182) "NUEVA RED FERROVIARIA.=
TRAMO=20BEASAIN=20OESTE(Pedido=20PC10/00123-125),=20BEASAIN".?=
To:
As you see, in the subject line it cutted on char 78 with a "=" followed by 2 or 3 line feeds, then continued with the rest of the subject baddly.
It was reported to me from several customers who all where using OutLook, other email clients deal with those subjects ok.
If you have no ISO on it, it doesn't hurt, but if you add it to your subject to be nice to RFC, then you get this surprise from OutLook. Bit if you don't add the ISOs, then iPhone email will not understand it(and attach files with names using such characters will not work on iPhones).
Limits in the context of Unicode multi-byte character capabilities
While RFC5322 defines a limit of 1000 (998 + CRLF) characters, it does so in the context of headers limited to ASCII characters only.
RFC 6532 explains how to handle multi-byte Unicode characters.
Section 3.4 ( Effects on Line Length Limits ) states:
Section 2.1.1 of [RFC5322] limits lines to 998 characters and
recommends that the lines be restricted to only 78 characters. This
specification changes the former limit to 998 octets. (Note that, in
ASCII, octets and characters are effectively the same, but this is
not true in UTF-8.) The 78-character limit remains defined in terms
of characters, not octets, since it is intended to address display
width issues, not line-length issues.
So for example, because you are limited to 998 octets, you can't have 998 smiley faces in your subject line as each emoji of this type is 4 octets.
Using PHP to demonstrate:
Run php -a for an interactive terminal.
// Multi-byte string length:
var_export(mb_strlen("\u{0001F602}",'UTF-8'));
// 1
// ASCII string length:
var_export(strlen("\u{0001F602}"));
// 4
// ASCII substring of four octet character:
var_export(substr("\u{0001F602}",0,4));
// '😂'
// ASCI substring of four octet character truncated to 3 octets, mutating character:
var_export(substr("\u{0001F602}",0,3));
// '▒'
I don't believe that there is a formal limit here, and I'm pretty sure there isn't any hard limit specified in the RFC either, as you found.
I think that some pretty common limitations for subject lines in general (not just e-mail) are:
80 Characters
128 Characters
256 Characters
Obviously, you want to come up with something that is reasonable. If you're writing an e-mail client, you may want to go with something like 256 characters, and obviously test thoroughly against big commercial servers out there to make sure they serve your mail correctly.
Hope this helps!
What's important is which mechanism you are using the send the email. Most modern libraries (i.e. System.Net.Mail) will hide the folding from you. You just put a very long email subject line in without (CR,LF,HTAB). If you start trying to do your own folding all bets are off. It will start reporting errors. So if you are having this issue just filter out the CR,LF,HTAB and let the library do the work for you. You can usually also set the encoding text type as a separate field. No need for iso encoding in the subject line.