When copying code, U+00a0 (nbsp) characters creating problem with VS Code

When copying code, U+00a0 (nbsp) characters creating problem with VS Code - visual-studio-code

In my Ubuntu machine recently i have installed vs code. But when i try to paste any code copied from php.net website all the spaces and tabs are replaced by "U+00a0 (nbsp) characters"
It looks like this
Currently i'm doing: using Regular Expression (.*) and search for the unicode character with regex [\xa0] and replace by space. Can anyone suggest me how to solve this problem.

Related

Show special characters in VS Code

I have a problem with my VS Code. When trying to modify a file that contains special characters like "á", "ñ", "ó" etc., the special characters are replaced with a question mark. (See image below.)
Although, this can be solved easily from the back of Visual Studio Code, changing the language type to "Windows 1252", because at first it worked for me. But now, even if I change it to that language, the signs are still there.

the files that you opened before you made the changes to the encoding have been auto-overwritten and the original characters were replaced with the unknown-character character

Character Encoding Issue - Characters Being Replaced with Random Characters after Saving in Textarea

I'm working with a third-party company and I'm trying/hoping to determine the cause of a character encoding issue before I bring it up with them.
This company has a custom drag and drop editor for designing websites on their platform. Within the editor they have a Raw HTML widget that I can drag in and add my own content too. The problem is that when I copy HTML from someones old website, using the inspector tool, and paste it into this widget of theirs, all of the apostrophe's & double quotes get replaced with 'jibberish'. I also have the same issue when I try pasting the content into notepad, notepad++, sublime editors and then pasting it into their Raw HTML editor.
Here's a recording of the issue and a few examples:
https://streamable.com/phwn2
Here are the known characters that get replaced and what they get replaced
’ turns into â™
“ turns into âœ
” turns into â
&plus; turns into (a space)
Å turns into Ã…
" stays as "
' stays as '
Does anyone see a pattern with these characters or know what could be the cause of these characters being replaced?

The website probably has UTF-8 encoding, and the company's editor might be using something like Windows-1252 encoding. In your first example, the right single quote has UTF-8 encoding e2 80 99. When each of those bytes is read by a program using Windows-1252, you get "small latin letter a with circumflex" (e2), [undefined] 80 and "trademark" (99). I haven't checked the other transformations. If this is the problem, then you could do a workaround by first converting the copied characters to the destination encoding with iconv, before pasting into the company's editor.

Why is this LSEP symbol showing up on Chrome and not Firefox or Edge?

So this web page is rendering with these symbols and they are found throughout this website/application but on no other sites. Can anyone tell me
What this symbol is?
Why it is showing up only in one browser?

That character is U+2028 Line Separator, which is a kind of newline character. Think of it as the Unicode equivalent of HTML’s .
As to why it shows up here: my guess would be that an internal database uses LSEP to not conflict with literal newlines or HTML tags (which might break the database or cause security errors), and either:
The server-side scripts that convert the database to HTML neglected to replace LSEP with 
Chrome just breaks standards by displaying LSEP as a printing (visible) character, or
You have a font installed that displays LSEP as a printing character that only Chrome detects. To figure out which font it is, right click on the offending text and click “Inspect”, then switch to the “Computed” tab on the right-hand panel. At the very bottom you should see a section labeled “Rendered Fonts” which will help you locate the offending font.
More information on the line separator, excerpted from the Unicode standard, Chapter 5.8, Newline Guidelines (on p. 12 of this PDF):
Line Separator and Paragraph Separator
A paragraph separator—independent of how it is encoded—is used to indicate a
separation between paragraphs. A line separator indicates where a line break
alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word “causing” to appear on a different line, but not causing
the typical paragraph indentation, sentence breaking, line spacing, or
change in flush (right, center, or left paragraphs).
For comparison, line separators basically correspond to HTML , and
paragraph separators to older usage of HTML (modern HTML delimits
paragraphs by enclosing them in ...). In word processors, paragraph
separators are usually entered using a keyboard RETURN or ENTER; line
separators are usually entered using a modified RETURN or ENTER, such as
SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging
tabular data, a common format is to tab-separate the cells and to use a CRLF
at the end of a line of cells. This function is not precisely the same as line
separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record
separator). It is still used as a line separator in simple text editors such as
program editors. As platforms and programs started to handle word processing
with automatic line-wrap, these characters were reinterpreted to stand for
paragraph separators. For example, even such simple programs as the Windows
Notepad program and the Mac SimpleText program interpret their platform’s NLF
as a paragraph separator, not a line separator. Once NLF was reinterpreted to
stand for a paragraph separator, in some cases another control character was
pressed into service as a line separator. For example, vertical tabulation VT
is used in Microsoft Word. However, the choice of character for line separator
is even less standardized than the choice of character for NLF. Many Internet
protocols and a lot of existing text treat NLF as a line separator, so an
implementer cannot simply treat NLF as a paragraph separator in all
circumstances.
Further reading:
Unicode Technical Report #13: Newline Guidelines
General Punctuation (U+2000–U+206F) chart PDF
SE: Why are there so many spaces and line breaks in Unicode?
SO: What is unicode character 2028 (LS / Line Separator) used for?
U+2028 on codepoints.net A misprint here says that U+2028 was added in v. 1.1 of the Unicode standard, which is false — it was added in 1.0

I found that in WordPress the easiest way to remove "L SEP" and "P SEP" characters is to execute this two SQL queries:
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a9'), '')
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a8'), '')
The javascript way (mentioned in some of the answers) can break some things (in my case some modal windows stopped working).

You can use this tool...
http://www.nousphere.net/cleanspecial.php
...to remove all the special characters that Chrome displays.
Steps:
Paste your HTML and Clean using HTML option.
You can manually delete the characters in the editor on this page and see the result.
Paste back your HTML in file and save :)

I recently ran into this issue, tried a number of fixes but ultimately I had to paste the text into VIM and there was an extra space I had to delete. I tried a number of HTML cleaners but none of them worked, VIM was the key!

9999years answers is great.
In case you use Symfony with Twig template I would recommend to check for an empty Twig block. In my case it was an empty Twig block with an invisible char inside.
The LSEP char was only displayed on certain device / browser.
On the other I had a blank space above the header and I could not see any invisible char.
I had to inspect the GET request to see that the value 1f18 was before the open html tag.
Once I removed an empty Twig block it was gone.
hope this can help someone one day ...

My problem was similar, it was "PSEP" or "P SEP". Similar issue, an invisible character in my file.
I replaced \x{2029} with a normal space. Fixed. This problem only appeared on Windows Chrome. Not on my Mac.

I agree with #Kapil Bathija - Basically you can copy & paste your HTML code into http://www.nousphere.net/cleanspecial.php and convert it.
Then it will convert the special characters for you - Just remove the spaces in between the words and you will realize you have to press backspace 2x meaning there is an invalid character that can't be translated.
I had the same issue and it worked just fine afterwards.

You can also copy the text, paste it into a HTML editor such as Coda, remove the linebreak, copy it and paste it back into your site.
Video here: https://www.loom.com/share/501498afa7594d95a18382f1188f33ce

Looks like my client pasted HTML into Wordpress after initially creating it with MS-Word. Even deleting the and visible spaces did not fix the issue. The extended characters became visible in vi/vim.
If you don't have vi/vim available, try highlighting from 2 chars before the LSEP to 2 chars after the LSEP; delete that chunk, and re-type the correct characters.

Odd characters in Sublime Text 2: `SOH` and `ACK`

In converting old notes from org syntax to mmd, I used the Clean Text app to remove extra line breaks and non-unicode characters, and to convert to safe line-endings. When pasting the text back into Sublime Text 2, I noticed several odd characters. I don't really care too much about why they're there, I'd just like to know what the characters are, and if they're searchable using a regex?

They are control characters, they don't have a printable representation. I don't know how they ended up in your file.
In a regex, you can search for SOH with \u0001 and ACK with \u0006

I said I didn't care about the reason at the time. I did eventually find out what the problem was the next time this happened. Turned out it had to do with running a LaTeX auto-build tool in the background. It was already being compiled in one shell and hung due to an error. The next time I tried to compile, these control characters started showing up in my editor.
I no longer make assumptions about the number of processes I have running. ps aux | grep <whatever> is your friend.

What made many of the coding websites converting standard " into non standard ”?

This question is about standard double quote " and non-standard double quote “ & ”
Yesterday when I searched for some sample facebook serverfbml codes, and came upon to this
http://mahmudahsan.wordpress.com/2008/11/22/facebook-fbml-rendering-in-iframe-application/
okay so it has got what I want, so I copied the code to my project and run it... bah... lots of errors
Why? Because the site turned the standard double quote " inside his script into “ or ” ,
or single quote from ' into ’
This is not the first time I faced this problem when copying codes from the Internet, and I believe many of the code writers haven't expected that the site turned their single/double quotes into strange ones.
Any explanation to this strange phenomenon ?
edited: I notice the title converted my " into “ & ” too... let me edit it... oh and I failed

At least in the title or in the text, it looks much better to have typographic double quotes (i.e. is more pleasant to the eye). Coding sites should not do this for actual code, i.e. in StackOverflow code that is indented by four spaces. If a double quote in text is converted to typographic, it's fine.
This gets really worse when you paste typographic quotes into a console that tries to display the character and falls back to a standard quote, because the console font does not have a typographic quote. Because then it looks like it's a standard one, but it isn't. Not much you can do about it, other than use a code display plugin on your website that does not change code.

The problem is in the underlying blog engine. Wordpress does that by default, and there is AFAIK no way to turn it off (Without changing the code). Given the fact that there are only relatively few really great blog engines, there may not always be a choice to switch to something "better".
Also in the same category: Fancy dashes, aka. turning - into –

the source shows that the quote char is sometimes ”
that's the quote that is the good looking quote which will cause problem in a program.
i think either the WordPress text editor or storage/retrieval converted the ordinary quote into that one.
You can use the replace function in your program editor to replace those characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse