Converted to Junk Character - When Copy Paste in Text Box - copy

Whenever i Copy and paste any Below Mention CHARACTER in text Box
Below are Copied character ( test this in notepad )
…
”
‘
Below are Typed Character
...
"
'
then that was converted to Junk Character. How can i Block this .
When i Type those character from keybord then it works but when copy paste it converted to Junk.
How can i detect and delete all this character before processing because ..user dont know about this issue ..
I want to delete that character wen user press Submit button.

” and ’ are not junk characters. They are perfectly good Unicode characters (U+201C LEFT DOUBLE QUOTATION MARK and U+2018 LEFT SINGLE QUOTATION MARK). Modern applications should be capable of dealing with all Unicode characters; if you can't handle the smart quotes you probably also can't handle accents, Greek, Cyrillic, Chinese or any of the other characters users are likely to want to use. You should concentrate on ensuring that your application supports Unicode, rather than trying to fix this one visible symptom.
Pasting ' and " (ASCII straight quote) characters into a text box should not turn them into non-ASCII ‘smart’ quotes. Where they typically tend to come from is Microsoft Word's misguided ‘AutoReplace’ feature, which replaces straight quotes with smart quotes as you type. This is an annoyance, but ultimately it's limited to Office and there's not really much you can do about it. Whilst you can manually replace “ and ” with " by doing a trivial string replacement (and how you do that depends on what language/environment you are talking about), you'll also be removing correct usage of those characters, and you won't be fixing all the other sad broken auto-replacements that MS Office does.
The … single-character ellipsis is a slightly different case, and arguably ‘junk’: to Unicode, U+2026 HORIZONTAL ELLIPSIS is a ‘compatibility character’ which is only intended to round-trip nicely to existing encodings that include it as a separate characters. Normally three dot characters should be used instead. You can replace compatibility characters by using Unicode normalisation, in particular Normal Form KC. Again, how you access normalisation is something that depends on your programming language/environment. For example in Python, unicodedata.normalize('NFKC', u'…') gives you u'...'.

Is your vnc client / server ON, try to exit (shutdown) all vnc server / clients and try again - if your copy paste works.

Related

Character Encoding Issue - Characters Being Replaced with Random Characters after Saving in Textarea

I'm working with a third-party company and I'm trying/hoping to determine the cause of a character encoding issue before I bring it up with them.
This company has a custom drag and drop editor for designing websites on their platform. Within the editor they have a Raw HTML widget that I can drag in and add my own content too. The problem is that when I copy HTML from someones old website, using the inspector tool, and paste it into this widget of theirs, all of the apostrophe's & double quotes get replaced with 'jibberish'. I also have the same issue when I try pasting the content into notepad, notepad++, sublime editors and then pasting it into their Raw HTML editor.
Here's a recording of the issue and a few examples:
https://streamable.com/phwn2
Here are the known characters that get replaced and what they get replaced
’ turns into â™
“ turns into âœ
” turns into â
+ turns into (a space)
Å turns into Ã…
" stays as "
' stays as '
Does anyone see a pattern with these characters or know what could be the cause of these characters being replaced?
The website probably has UTF-8 encoding, and the company's editor might be using something like Windows-1252 encoding. In your first example, the right single quote has UTF-8 encoding e2 80 99. When each of those bytes is read by a program using Windows-1252, you get "small latin letter a with circumflex" (e2), [undefined] 80 and "trademark" (99). I haven't checked the other transformations. If this is the problem, then you could do a workaround by first converting the copied characters to the destination encoding with iconv, before pasting into the company's editor.

What is this character: 🔖 ? Where can I see the similar characters?

🔖
I am not sure whether everyone can see the above character, but I can see it. I got it when I input "booknote" in Chinese on my iPhone. To my surprise, this character seems "platform-insensative", it can be seen on my phones, chrome on laptop, and even in MacOS terminal.
Is it an ASCII character? I've never seen colorful characters like this before. Since when these have been around? And where I can get a list of similar characters?
Here: http://www.unicode.org/charts/nameslist/index.html
You put the character on an HTML page. All characters on an HTML page are from the Unicode character set. Characters that are not in the Unicode character set either soon will be or are too specialized to be of general use.
The Unicode Consortium occasionally publishes a new version of the character set. Since you ask about the kind of character, the common partitions of the character set are blocks, categories, and—stretching a bit—which version the character was added in. Some characters are in a script (for a language writing system), some are not. You see the block and category of 🔖 at http://www.fileformat.info/info/unicode/char/1f516/index.htm.
The Unicode character set is published in text files called the Unicode Character Database (UCD), as well as many supplementary documents and webpages. The data includes important information about usage and relationships. For example, for applicable characters, which character is considered the uppercase form of another in a particular language.
To see any character, you have to use a font that presents it. This can be a problem for some characters. There is probably no one font that presents every Unicode character as it was meant to be.
You mentioned ASCII. Although it used every day in HTTP headers and other specialized and historical applications, ASCII is such a limited character set that it hasn't generally been used in decades.

Why is this LSEP symbol showing up on Chrome and not Firefox or Edge?

So this web page is rendering with these symbols and they are found throughout this website/application but on no other sites. Can anyone tell me
What this symbol is?
Why it is showing up only in one browser?
That character is U+2028 Line Separator, which is a kind of newline character. Think of it as the Unicode equivalent of HTML’s <br>.
As to why it shows up here: my guess would be that an internal database uses LSEP to not conflict with literal newlines or HTML tags (which might break the database or cause security errors), and either:
The server-side scripts that convert the database to HTML neglected to replace LSEP with <br>
Chrome just breaks standards by displaying LSEP as a printing (visible) character, or
You have a font installed that displays LSEP as a printing character that only Chrome detects. To figure out which font it is, right click on the offending text and click “Inspect”, then switch to the “Computed” tab on the right-hand panel. At the very bottom you should see a section labeled “Rendered Fonts” which will help you locate the offending font.
More information on the line separator, excerpted from the Unicode standard, Chapter 5.8, Newline Guidelines (on p. 12 of this PDF):
Line Separator and Paragraph Separator
A paragraph separator—independent of how it is encoded—is used to indicate a
separation between paragraphs. A line separator indicates where a line break
alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word “causing” to appear on a different line, but not causing
the typical paragraph indentation, sentence breaking, line spacing, or
change in flush (right, center, or left paragraphs).
For comparison, line separators basically correspond to HTML <BR>, and
paragraph separators to older usage of HTML <P> (modern HTML delimits
paragraphs by enclosing them in <P>...</P>). In word processors, paragraph
separators are usually entered using a keyboard RETURN or ENTER; line
separators are usually entered using a modified RETURN or ENTER, such as
SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging
tabular data, a common format is to tab-separate the cells and to use a CRLF
at the end of a line of cells. This function is not precisely the same as line
separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record
separator). It is still used as a line separator in simple text editors such as
program editors. As platforms and programs started to handle word processing
with automatic line-wrap, these characters were reinterpreted to stand for
paragraph separators. For example, even such simple programs as the Windows
Notepad program and the Mac SimpleText program interpret their platform’s NLF
as a paragraph separator, not a line separator. Once NLF was reinterpreted to
stand for a paragraph separator, in some cases another control character was
pressed into service as a line separator. For example, vertical tabulation VT
is used in Microsoft Word. However, the choice of character for line separator
is even less standardized than the choice of character for NLF. Many Internet
protocols and a lot of existing text treat NLF as a line separator, so an
implementer cannot simply treat NLF as a paragraph separator in all
circumstances.
Further reading:
Unicode Technical Report #13: Newline Guidelines
General Punctuation (U+2000–U+206F) chart PDF
SE: Why are there so many spaces and line breaks in Unicode?
SO: What is unicode character 2028 (LS / Line Separator) used for?
U+2028 on codepoints.net A misprint here says that U+2028 was added in v. 1.1 of the Unicode standard, which is false — it was added in 1.0
I found that in WordPress the easiest way to remove "L SEP" and "P SEP" characters is to execute this two SQL queries:
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a9'), '')
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a8'), '')
The javascript way (mentioned in some of the answers) can break some things (in my case some modal windows stopped working).
You can use this tool...
http://www.nousphere.net/cleanspecial.php
...to remove all the special characters that Chrome displays.
Steps:
Paste your HTML and Clean using HTML option.
You can manually delete the characters in the editor on this page and see the result.
Paste back your HTML in file and save :)
I recently ran into this issue, tried a number of fixes but ultimately I had to paste the text into VIM and there was an extra space I had to delete. I tried a number of HTML cleaners but none of them worked, VIM was the key!
9999years answers is great.
In case you use Symfony with Twig template I would recommend to check for an empty Twig block. In my case it was an empty Twig block with an invisible char inside.
The LSEP char was only displayed on certain device / browser.
On the other I had a blank space above the header and I could not see any invisible char.
I had to inspect the GET request to see that the value 1f18 was before the open html tag.
Once I removed an empty Twig block it was gone.
hope this can help someone one day ...
My problem was similar, it was "PSEP" or "P SEP". Similar issue, an invisible character in my file.
I replaced \x{2029} with a normal space. Fixed. This problem only appeared on Windows Chrome. Not on my Mac.
I agree with #Kapil Bathija - Basically you can copy & paste your HTML code into http://www.nousphere.net/cleanspecial.php and convert it.
Then it will convert the special characters for you - Just remove the spaces in between the words and you will realize you have to press backspace 2x meaning there is an invalid character that can't be translated.
I had the same issue and it worked just fine afterwards.
You can also copy the text, paste it into a HTML editor such as Coda, remove the linebreak, copy it and paste it back into your site.
Video here: https://www.loom.com/share/501498afa7594d95a18382f1188f33ce
Looks like my client pasted HTML into Wordpress after initially creating it with MS-Word. Even deleting the and visible spaces did not fix the issue. The extended characters became visible in vi/vim.
If you don't have vi/vim available, try highlighting from 2 chars before the LSEP to 2 chars after the LSEP; delete that chunk, and re-type the correct characters.

Can a combining character be used alone in Unicode?

Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?
Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).
Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle

What is the difference between \r and \n?

How are \r and \n different? I think it has something to do with Unix vs. Windows vs. Mac, but I'm not sure exactly how they're different, and which to search for/match in regexes.
They're different characters. \r is carriage return, and \n is line feed.
On "old" printers, \r sent the print head back to the start of the line, and \n advanced the paper by one line. Both were therefore necessary to start printing on the next line.
Obviously that's somewhat irrelevant now, although depending on the console you may still be able to use \r to move to the start of the line and overwrite the existing text.
More importantly, Unix tends to use \n as a line separator; Windows tends to use \r\n as a line separator and Macs (up to OS 9) used to use \r as the line separator. (Mac OS X is Unix-y, so uses \n instead; there may be some compatibility situations where \r is used instead though.)
For more information, see the Wikipedia newline article.
EDIT: This is language-sensitive. In C# and Java, for example, \n always means Unicode U+000A, which is defined as line feed. In C and C++ the water is somewhat muddier, as the meaning is platform-specific. See comments for details.
In C and C++, \n is a concept, \r is a character, and \r\n is (almost always) a portability bug.
Think of an old teletype. The print head is positioned on some line and in some column. When you send a printable character to the teletype, it prints the character at the current position and moves the head to the next column. (This is conceptually the same as a typewriter, except that typewriters typically moved the paper with respect to the print head.)
When you wanted to finish the current line and start on the next line, you had to do two separate steps:
move the print head back to the beginning of the line, then
move it down to the next line.
ASCII encodes these actions as two distinct control characters:
\x0D (CR) moves the print head back to the beginning of the line. (Unicode encodes this as U+000D CARRIAGE RETURN.)
\x0A (LF) moves the print head down to the next line. (Unicode encodes this as U+000A LINE FEED.)
In the days of teletypes and early technology printers, people actually took advantage of the fact that these were two separate operations. By sending a CR without following it by a LF, you could print over the line you already printed. This allowed effects like accents, bold type, and underlining. Some systems overprinted several times to prevent passwords from being visible in hardcopy. On early serial CRT terminals, CR was one of the ways to control the cursor position in order to update text already on the screen.
But most of the time, you actually just wanted to go to the next line. Rather than requiring the pair of control characters, some systems allowed just one or the other. For example:
Unix variants (including modern versions of Mac) use just a LF character to indicate a newline.
Old (pre-OSX) Macintosh files used just a CR character to indicate a newline.
VMS, CP/M, DOS, Windows, and many network protocols still expect both: CR LF.
Old IBM systems that used EBCDIC standardized on NL--a character that doesn't even exist in the ASCII character set. In Unicode, NL is U+0085 NEXT LINE, but the actual EBCDIC value is 0x15.
Why did different systems choose different methods? Simply because there was no universal standard. Where your keyboard probably says "Enter", older keyboards used to say "Return", which was short for Carriage Return. In fact, on a serial terminal, pressing Return actually sends the CR character. If you were writing a text editor, it would be tempting to just use that character as it came in from the terminal. Perhaps that's why the older Macs used just CR.
Now that we have standards, there are more ways to represent line breaks. Although extremely rare in the wild, Unicode has new characters like:
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
Even before Unicode came along, programmers wanted simple ways to represent some of the most useful control codes without worrying about the underlying character set. C has several escape sequences for representing control codes:
\a (for alert) which rings the teletype bell or makes the terminal beep
\f (for form feed) which moves to the beginning of the next page
\t (for tab) which moves the print head to the next horizontal tab position
(This list is intentionally incomplete.)
This mapping happens at compile-time--the compiler sees \a and puts whatever magic value is used to ring the bell.
Notice that most of these mnemonics have direct correlations to ASCII control codes. For example, \a would map to 0x07 BEL. A compiler could be written for a system that used something other than ASCII for the host character set (e.g., EBCDIC). Most of the control codes that had specific mnemonics could be mapped to control codes in other character sets.
Huzzah! Portability!
Well, almost. In C, I could write printf("\aHello, World!"); which rings the bell (or beeps) and outputs a message. But if I wanted to then print something on the next line, I'd still need to know what the host platform requires to move to the next line of output. CR LF? CR? LF? NL? Something else? So much for portability.
C has two modes for I/O: binary and text. In binary mode, whatever data is sent gets transmitted as-is. But in text mode, there's a run-time translation that converts a special character to whatever the host platform needs for a new line (and vice versa).
Great, so what's the special character?
Well, that's implementation dependent, too, but there's an implementation-independent way to specify it: \n. It's typically called the "newline character".
This is a subtle but important point: \n is mapped at compile time to an implementation-defined character value which (in text mode) is then mapped again at run time to the actual character (or sequence of characters) required by the underlying platform to move to the next line.
\n is different than all the other backslash literals because there are two mappings involved. This two-step mapping makes \n significantly different than even \r, which is simply a compile-time mapping to CR (or the most similar control code in whatever the underlying character set is).
This trips up many C and C++ programmers. If you were to poll 100 of them, at least 99 will tell you that \n means line feed. This is not entirely true. Most (perhaps all) C and C++ implementations use LF as the magic intermediate value for \n, but that's an implementation detail. It's feasible for a compiler to use a different value. In fact, if the host character set is not a superset of ASCII (e.g., if it's EBCDIC), then \n will almost certainly not be LF.
So, in C and C++:
\r is literally a carriage return.
\n is a magic value that gets translated (in text mode) at run-time to/from the host platform's newline semantics.
\r\n is almost always a portability bug. In text mode, this gets translated to CR followed by the platform's newline sequence--probably not what's intended. In binary mode, this gets translated to CR followed by some magic value that might not be LF--possibly not what's intended.
\x0A is the most portable way to indicate an ASCII LF, but you only want to do that in binary mode. Most text-mode implementations will treat that like \n.
"\r" => Return
"\n" => Newline or Linefeed
(semantics)
Unix based systems use just a "\n" to end a line of text.
Dos uses "\r\n" to end a line of text.
Some other machines used just a "\r". (Commodore, Apple II, Mac OS prior to OS X, etc..)
\r is used to point to the start of a line and can replace the text from there, e.g.
main()
{
printf("\nab");
printf("\bsi");
printf("\rha");
}
Produces this output:
hai
\n is for new line.
In short \r has ASCII value 13 (CR) and \n has ASCII value 10 (LF).
Mac uses CR as line delimiter (at least, it did before, I am not sure for modern macs), *nix uses LF and Windows uses both (CRLF).
In addition to #Jon Skeet's answer:
Traditionally Windows has used \r\n, Unix \n and Mac \r, however newer Macs use \n as they're unix based.
\r is Carriage Return; \n is New Line (Line Feed) ... depends on the OS as to what each means. Read this article for more on the difference between '\n' and '\r\n' ... in C.
in C# I found they use \r\n in a string.
\r used for carriage return. (ASCII value is 13)
\n used for new line. (ASCII value is 10)