Why is this LSEP symbol showing up on Chrome and not Firefox or Edge? - unicode

So this web page is rendering with these symbols and they are found throughout this website/application but on no other sites. Can anyone tell me
What this symbol is?
Why it is showing up only in one browser?

That character is U+2028 Line Separator, which is a kind of newline character. Think of it as the Unicode equivalent of HTML’s <br>.
As to why it shows up here: my guess would be that an internal database uses LSEP to not conflict with literal newlines or HTML tags (which might break the database or cause security errors), and either:
The server-side scripts that convert the database to HTML neglected to replace LSEP with <br>
Chrome just breaks standards by displaying LSEP as a printing (visible) character, or
You have a font installed that displays LSEP as a printing character that only Chrome detects. To figure out which font it is, right click on the offending text and click “Inspect”, then switch to the “Computed” tab on the right-hand panel. At the very bottom you should see a section labeled “Rendered Fonts” which will help you locate the offending font.
More information on the line separator, excerpted from the Unicode standard, Chapter 5.8, Newline Guidelines (on p. 12 of this PDF):
Line Separator and Paragraph Separator
A paragraph separator—independent of how it is encoded—is used to indicate a
separation between paragraphs. A line separator indicates where a line break
alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word “causing” to appear on a different line, but not causing
the typical paragraph indentation, sentence breaking, line spacing, or
change in flush (right, center, or left paragraphs).
For comparison, line separators basically correspond to HTML <BR>, and
paragraph separators to older usage of HTML <P> (modern HTML delimits
paragraphs by enclosing them in <P>...</P>). In word processors, paragraph
separators are usually entered using a keyboard RETURN or ENTER; line
separators are usually entered using a modified RETURN or ENTER, such as
SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging
tabular data, a common format is to tab-separate the cells and to use a CRLF
at the end of a line of cells. This function is not precisely the same as line
separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record
separator). It is still used as a line separator in simple text editors such as
program editors. As platforms and programs started to handle word processing
with automatic line-wrap, these characters were reinterpreted to stand for
paragraph separators. For example, even such simple programs as the Windows
Notepad program and the Mac SimpleText program interpret their platform’s NLF
as a paragraph separator, not a line separator. Once NLF was reinterpreted to
stand for a paragraph separator, in some cases another control character was
pressed into service as a line separator. For example, vertical tabulation VT
is used in Microsoft Word. However, the choice of character for line separator
is even less standardized than the choice of character for NLF. Many Internet
protocols and a lot of existing text treat NLF as a line separator, so an
implementer cannot simply treat NLF as a paragraph separator in all
circumstances.
Further reading:
Unicode Technical Report #13: Newline Guidelines
General Punctuation (U+2000–U+206F) chart PDF
SE: Why are there so many spaces and line breaks in Unicode?
SO: What is unicode character 2028 (LS / Line Separator) used for?
U+2028 on codepoints.net A misprint here says that U+2028 was added in v. 1.1 of the Unicode standard, which is false — it was added in 1.0

I found that in WordPress the easiest way to remove "L SEP" and "P SEP" characters is to execute this two SQL queries:
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a9'), '')
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a8'), '')
The javascript way (mentioned in some of the answers) can break some things (in my case some modal windows stopped working).

You can use this tool...
http://www.nousphere.net/cleanspecial.php
...to remove all the special characters that Chrome displays.
Steps:
Paste your HTML and Clean using HTML option.
You can manually delete the characters in the editor on this page and see the result.
Paste back your HTML in file and save :)

I recently ran into this issue, tried a number of fixes but ultimately I had to paste the text into VIM and there was an extra space I had to delete. I tried a number of HTML cleaners but none of them worked, VIM was the key!

9999years answers is great.
In case you use Symfony with Twig template I would recommend to check for an empty Twig block. In my case it was an empty Twig block with an invisible char inside.
The LSEP char was only displayed on certain device / browser.
On the other I had a blank space above the header and I could not see any invisible char.
I had to inspect the GET request to see that the value 1f18 was before the open html tag.
Once I removed an empty Twig block it was gone.
hope this can help someone one day ...

My problem was similar, it was "PSEP" or "P SEP". Similar issue, an invisible character in my file.
I replaced \x{2029} with a normal space. Fixed. This problem only appeared on Windows Chrome. Not on my Mac.

I agree with #Kapil Bathija - Basically you can copy & paste your HTML code into http://www.nousphere.net/cleanspecial.php and convert it.
Then it will convert the special characters for you - Just remove the spaces in between the words and you will realize you have to press backspace 2x meaning there is an invalid character that can't be translated.
I had the same issue and it worked just fine afterwards.

You can also copy the text, paste it into a HTML editor such as Coda, remove the linebreak, copy it and paste it back into your site.
Video here: https://www.loom.com/share/501498afa7594d95a18382f1188f33ce

Looks like my client pasted HTML into Wordpress after initially creating it with MS-Word. Even deleting the and visible spaces did not fix the issue. The extended characters became visible in vi/vim.
If you don't have vi/vim available, try highlighting from 2 chars before the LSEP to 2 chars after the LSEP; delete that chunk, and re-type the correct characters.

Related

Why do I get extra characters (arrow symbol) while fetching text through Tesseract?

Whenever I fetch text in any language, the output has this extra character (arrow symbol), which is not there in the image. I'd like to understand, why it is present, and how to avoid these extra characters in the output.
That's most likely the implicit page separator \f, which Notepad shows as that arrow. For some details on that topic, see: What page separators are used in txt output by Tesseract 4.0.0?
You can try to add -c page_separator="" to your config. You shouldn't see that symbol in your output then. Please notice, page breaks are entirely disabled then also.

TinyMCE converting space to

I am using TinyMCE 4 and in that, if I insert a space in the textarea between two word or characters and then check the source, the space converted to .
I have tried this solution, but that only resolves the issue partially. This is because, if I enter a single space between two characters or words, then TinyMCE doesn't add , but if I add two consecutive spaces between two characters or words, then it makes the second space .
Any work around on this?
TinyMCE is adding hard spaces when you type multiple spaces into the editor - HTML does not show multiple normal whitespace characters so you can't get (per your example) two spaces between letters with just regular spaces. Using hard spaces for every other space allows content authors to use spaces within content and get a rendered result that matches what they type in the editor.
If you render that HTML without hard spaces there would just be one space between each set of characters regardless of how many spaces you put in the HTML source.
The net is that the editor is doing what it needs to do to allow you to see multiple spaces.

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

How can I strip out "file separator" characters from CSS/text files?

My CSS files have become contaminated with "file separator" characters (AKA "INFORMATION SEPARATOR FOUR" or ALT/028 characters). How can I get rid of them?
This is the character:
http://www.fileformat.info/info/unicode/char/1c/index.htm
Background
I manage a number of .CSS text files that are fairly similar. Unfortunately a number of these file have somehow got "file separator" characters pasted into them. Although they do still seem to work in browsers any file that has one of these characters anywhere within it can not be indexed by my desktop search utility (X1 Search). And this is making them extremely hard for me to manage because I need to compare CSS files contantly.
[Bizarrely X1 Search ignores the character if the filename extension is .TXT but files to index the entire file if the filename extension is .CSS]
Worse this "file separator" character is almost invisible within my text editor (TextPad 7.2). The only way I can detect it is to make spaces and carriage returns visible and then it appears as blank space. Worse still it appears to be impossible to search for using text search.
To make it clear what I mean an example that I have pasted into this page. The "file separator" character is on LineB below
LineA
LineB
LineC
LineD
Is there any way to remove this character from multiple text (in this case CSS) files at once?
NB I do NOT want to remove the whole line, just the one character(!)
Thanks
J
P.S. I am running on Windows7 (x64). I am using TextPad 7.3.
I have eventually managed to answer my own question.
Text Crawler and the use of a regular expression of "\x1c" appears to be the answer.
Fwiw, both Agent Ransack and FileLocator Pro filter out any characters in the ASCII range 0-31 (excluding 0x09 - tab) from the input field.

Unicode character for marking

We are going to digitize a lot of books. We want to mark place of line break in original book without influencing the flow of digital book. Which invisible Unicode charter can be used to mark some special places in a raw file?
(\n will used to indicate end of paragraph)
This is a sentence
in the original book that
I want to mark line
break places.
What is the proper character to replace *:
This is a sentence * in the original book that * I want to mark line *break places.
Unicode has no concept of a hidden character that represents a line break in some original but does not cause line break in rendering. Unicode encodes plain text data, and its control characters for line breaks have an effect when plain text is rendered.
What matters here is how the files will be used. If they need to be processable with plain text editors, then you need to decide: either the line breaks are replicated in default rendering, or they are omitted when creating the file. You can’t make them invisible. And different text editors, like Notepad and Emacs, may well use different line control conventions; one program’s end of line is another program’s end of paragraph.
If the files will only be processed by programs that you create, then you can use whatever conventions you like. The most logical one is this:
“Line and Paragraph Separator. The Unicode Standard provides two unambiguous characters,
U+2028 line separator and U+2029 paragraph separator, to separate lines and
paragraphs. They are considered the default form of denoting line and paragraph boundaries
in Unicode plain text. A new line is begun after each line separator. A new paragraph
is begun after each paragraph separator. As these characters are separator codes, it is not necessary either to start the first line or paragraph or to end the last line or paragraph with them. Doing so would indicate that there was an empty paragraph or line following. The paragraph separator can be inserted between paragraphs of text. Its use allows the creation of plain text files, which can be laid out on a different line width at the receiving end. The line separator can be used to indicate an unconditional end of line.”
http://www.unicode.org/versions/Unicode6.1.0/ch16.pdf (pages 6 and 7 in the PDF)
Beware that U+2028 and U+2029 are generally not understood by text editors. They are suitable for storing data in plain text format. When the text is to be rendered, the rendering software has the option of ignoring the original division into lines and treating U+2028 as equivalent to a space, except if preceded by a hyphen (which poses a problem that cannot be resolved without higher level information: a line that ends with “foo-” and is follod by a line beginning with “bar” could represent the word “foobar” as hyphenated for line breaking, or a hyphenated compound “foo-bar” or, in some cases, the combination “foo- bar”).
Use the line feed character (LF, "\n", 0x0A) and/or maybe carriage return (CR, "\r", 0x0D).
I.e., the regular characters for this purpose.