Micropython automatically converting carriage returns to newlines - micropython

I was looking to modify some terminal code to make it portable and work nicely with windows clients and encountered this issue.
When I press enter in a windows terminal, MicroPython (v1.19.1 on a Raspberry Pi Pico) appears to recieve 2 linefeed characters (0x0A + 0x0A) instead of a carriage return then a linefeed (0x0D + 0x0A).
Code:
import sys
while(True):
b = sys.stdin.read(1)
sys.stdout.write("(" + hex(ord(b)) + ")")
Run the code, connect to port using a modified PuTTY configured to send a CRLF, hit enter and you get:
I get similar results when using the Thonny Shell or the REPL shell included in the Micropython plugin for PyCharm.
Am I doing something wrong here or is this a quirk (or bug) with MicroPython?

It turns out this is standard behaviour for Python, so has been implemented in MicroPython.
It's related to PEP 278 – Universal Newline Support, which translates all newlines to a single linefeed "\n".
In this particular case, because I'm reading the characters one-by-one it has no chance to detect the CR + LF together and translate them to a single LF, so each is read, and the CR gets turned into an LF, and the LF makes it through unmodified.
You can use sys.stdin.buffer.read(1) to get the unmodified stream, in which case my test code becomes:
import sys
while(True):
b = sys.stdin.buffer.read(1)
print("(" + hex(ord(b)) + ")")
This produces the expected output:

Related

What is the difference between \n and \r in swift and when to use each? [duplicate]

What’s the difference between \n (newline) and \r (carriage return)?
In particular, are there any practical differences between \n and \r? Are there places where one should be used instead of the other?
In terms of ascii code, it's 3 -- since they're 10 and 13 respectively;-).
But seriously, there are many:
in Unix and all Unix-like systems, \n is the code for end-of-line, \r means nothing special
as a consequence, in C and most languages that somehow copy it (even remotely), \n is the standard escape sequence for end of line (translated to/from OS-specific sequences as needed)
in old Mac systems (pre-OS X), \r was the code for end-of-line instead
in Windows (and many old OSs), the code for end of line is 2 characters, \r\n, in this order
as a (surprising;-) consequence (harking back to OSs much older than Windows), \r\n is the standard line-termination for text formats on the Internet
for electromechanical teletype-like "terminals", \r commands the carriage to go back leftwards until it hits the leftmost stop (a slow operation), \n commands the roller to roll up one line (a much faster operation) -- that's the reason you always have \r before \n, so that the roller can move while the carriage is still going leftwards!-) Wikipedia has a more detailed explanation.
for character-mode terminals (typically emulating even-older printing ones as above), in raw mode, \r and \n act similarly (except both in terms of the cursor, as there is no carriage or roller;-)
In practice, in the modern context of writing to a text file, you should always use \n (the underlying runtime will translate that if you're on a weird OS, e.g., Windows;-). The only reason to use \r is if you're writing to a character terminal (or more likely a "console window" emulating it) and want the next line you write to overwrite the last one you just wrote (sometimes used for goofy "ascii animation" effects of e.g. progress bars) -- this is getting pretty obsolete in a world of GUIs, though;-).
Historically a \n was used to move the carriage down, while the \r was used to move the carriage back to the left side of the page.
Two different characters.
\n is used as an end-of-line terminator in Unix text files
\r Was historically (pre-OS X) used as an end-of-line terminator in Mac text files
\r\n (ie both together) are used to terminate lines in Windows and DOS text files.
Since nobody else mentioned it specifically (are they too young to know/remember?) - I suspect the use of \r\n originated for typewriters and similar devices.
When you wanted a new line while using a multi-line-capable typewriter, there were two physical actions it had to perform: slide the carriage back to the beginning (left, in US) of the page, and feed the paper up one notch.
Back in the days of line printers the only way to do bold text, for example, was to do a carriage return WITHOUT a newline and print the same characters over the old ones, thus adding more ink, thus making it appear darker (bolded). When the mechanical "newline" function failed in a typewriter, this was the annoying result: you could type over the previous line of text if you weren't paying attention.
Two different characters for different Operating Systems. Also this plays a role in data transmitted over TCP/IP which requires the use of \r\n.
\n Unix
\r Mac
\r\n Windows and DOS.
To complete,
In a shell (bash) script, you can use \r to send cursor, in front on line and, of course \n to put cursor on a new line.
For example, try :
echo -en "AA--AA" ; echo -en "BB" ; echo -en "\rBB"
The first "echo" display AA--AA
The second : AA--AABB
The last : BB--AABB
But don't forget to use -en as parameters.
In windows, the \n moves to the beginning of the next line. The \r moves to the beginning of the current line, without moving to the next line. I have used \r in my own console apps where I am testing out some code and I don't want to see text scrolling up my screen, so rather than use \n after printing out some text, of say, a frame rate (FPS), I will printf("%-10d\r", fps); This will return the cursor to the beginning of the line without moving down to the next line and allow me to have other information on the screen that doesn't get scrolled off while the framerate constantly updates on the same line (the %-10 makes certain the output is at least 10 characters, left justified so it ends up padded by spaces, overwriting any old values for that line). It's quite handy for stuff like this, usually when I have debugging stuff output to my console screen.
A little history
The /r stands for return or carriage return which owes it's history to the typewriter. A carriage return moved your carriage all the way to the right so you were typing at the start of the line.
The /n stands for new line, again, from typewriter days you moved down to a new line. Not nessecarily to the start of it though, which is why some OSes adopted the need for both a /r return followed by a /n newline, as that was the order a typewriter did it in. It also explains the old 8bit computers that used to have Return rather than Enter, from carriage return, which was familiar.
\r: Carriage R̲eturn CR—returns the carriage to the start of the line
\n: Line feed (N̲ew Line) LF—feed the paper up one line
In the context of screen output:
CR: returns the cursor to the start of the current line
LF: moves the cursor down one line
For example:
Hello,\nworld\r!
is supposed to render on your terminal as:
Hello,
! world
Some operating systems may break compatibility with the intended behavior, but that doesn't change the answer to the question "Difference between \n and \r?".
What’s the difference between \n (newline) and \r (carriage return)?
In particular, are there any practical differences between \n and \r? Are there places where one should be used instead of the other?
I would like to make a short experiment with the respective escape sequences of \n for newline and \r for carriage return to illustrate where the distinct difference between them is.
I know, that this question was asked as language-independent. Nonetheless, We need a language at least in order to fulfill the experiment. In my case, I`ve chosen C++, but the experiment shall generally be applicable in any programming language.
The program simply just iterates to print a sentence into the console, done by a for-loop iteration.
Newline program:
#include <iostream>
int main(void)
{
for(int i = 0; i < 7; i++)
{
std::cout << i + 1 <<".Walkthrough of the for-loop \n"; // Notice `\n` at the end.
}
return 0;
}
Output:
1.Walkthrough of the for-loop
2.Walkthrough of the for-loop
3.Walkthrough of the for-loop
4.Walkthrough of the for-loop
5.Walkthrough of the for-loop
6.Walkthrough of the for-loop
7.Walkthrough of the for-loop
Notice, that this result will not be provided on any system, you are executing this C++ code. But it shall work for the most modern systems. Read below for more details.
Now, the same program, but with the difference, that \n is replaced by \r at the end of the print sequence.
Carriage return program:
#include <iostream>
int main(void)
{
for(int i = 0; i < 7; i++)
{
std::cout << i + 1 <<".Walkthrough of the for-loop \r"; // Notice `\r` at the end.
}
return 0;
}
Output:
7.Walkthrough of the for-loop
Noticed where the difference is? The difference is simply as that, when you using the Carriage return escape sequence \r at the end of each print sequence, the next iteration of this sequence do not getting into the following text line - At the end of each print sequence, the cursor did not jumped to the *beginning of the next line.
Instead, the cursor jumped back to the beginning of the line, on which he has been at the end of, before using the \r character. - The result is that each following iteration of the print sequence is replacing the previous one.
*Note: A \n do not necessarily jump to the beginning of following text line. On some, in general more elder, operation systems the result of the \n newline character can be, that it jumps to anywhere in the following line, not just to the beginning. That is why, they rquire to use \r \n to get at the start of the next text line.
This experiment showed us the difference between newline and carriage return in the context of the output of the iteration of a print sequence.
When discussing about the input in a program, some terminals/consoles may convert a carriage return into a newline implicitly for better portability, compatibility and integrity.
But if you have the choice to choose one for another or want or need to explicitly use only a specific one, you should always operate with the one, which fits to its purpose and strictly distinguish between.
Just to add to the confusion, I've been working on a simple text editor using a TextArea element in an HTML page in a browser. In anticipation of compatibility woes with respect to CR/LF, I wrote the code to check the platform, and use whichever newline convention was applicable to the platform.
However, I discovered something interesting when checking the actual characters contained in the TextArea, via a small JavaScript function that generates the hex data corresponding to the characters.
For the test, I typed in the following text:
Hello, World[enter]
Goodbye, Cruel World[enter]
When I examined the text data, the byte sequence I obtained was this:
48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 0a 47 6f 6f 64 62 79 65 2c 20 43
72 75 65 6c 20 57 6f 72 6c 64 0a
Now, most people looking at this, and seeing 0a but no 0d bytes, would think that this output was obtained on a Unix/Linux platform. But, here's the rub: this sequence I obtained in Google Chrome on Windows 7 64-bit.
So, if you're using a TextArea element and examining the text, CHECK the output as I've done above, to make sure what actual character bytes are returned from your TextArea. I've yet to see if this differs on other platforms or other browsers, but it's worth bearing in mind if you're performing text processing via JavaScript, and you need to make that text processing platform independent.
The conventions covered in above posts apply to console output, but HTML elements, it appears, adhere to the UNIX/Linux convention. Unless someone discovers otherwise on a different platform/browser.
#include <stdio.h>
void main()
{
int countch=0;
int countwd=1;
printf("Enter your sentence in lowercase: ");
char ch='a';
while(ch!='\r')
{
ch=getche();
if(ch==' ')
countwd++;
else
countch++;
}
printf("\n Words = ",countwd);
printf("Characters = ",countch-1);
getch();
}
lets take this example try putting \n in place of \r it will not work and try to guess why?

SyntaxError:(unicode error) 'unicodeescape' codec' can't decode bytes in position 0-5: truncated \UXXXXXXXX escape

Using Autokey 95.8, Python 3 version in Linux Mint 19.3 and I have a series of keyboard macros which generate Unicode characters. This example works:
# alt+shift+a = á
import sys
char = "\u00E1"
keyboard.send_keys(char)
sys.exit()
But the attempt to print an mdash [—] generates the following error:
SyntaxError:(unicode error) 'unicodeescape' codec' can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
# alt+shift+- = —
import sys
char = "\u2014"
keyboard.send_keys(char)
sys.exit()
Any idea how to overcome this problem in Autokey is greatly appreciated.
The code you posted above would not generated the error you ae getting - "truncated \UXXXXXXXX" needs an uppercase \U - and 8 hex-digits - if you try putting in the Python source char = "\U2014", you will get that error message (and probably it you got it when experimenting with the file in this way).
The sequence char = "\u2014" will create an mdash unicode character on the Python side - but that does not mean it is possible to send this as a Keyboard sybo via autokey to Windows. That is the point your program is likely failing (and since there is no programing error, you won't get a Python error message - it is just that it won't work - although Autokey might be nice and print out some apropriate error message in this case).
You'd have to look around on how to type an arbitrary unicode character on your S.O. config (on Linux mint it should be on the docs for "wayland" I guess), and send the character composign sequence to Autokey instead. If there is no such a sequence, then finding a way to copy the desired character to the window environment clipboard, and then send Autokey the "paste" sequence (usually ctrl + v - but depending on the app it could change. Terminal emulators use ctrl + shift + v, for example)
When you need to emit non-English US characters in AutoKey, you have two choices. The simplest is to put them into the clipboard with clipboard.fill_clipboard(your characters) and paste them into the window using keyboard.send_keys("<ctrl>+v"). This almost always works.
If you need to define a phrase with multibyte characters in it, select the Paste using Clipboard (Ctrl+V) option. (I'm trying to get that to be the default option in a future release.)
The other choice, that I'm still not quite sure of, is directly sending the Unicode escape sequence to the window, letting it convert that into the actual Unicode character. Something like keyboard.send_keys("\U2014"). Assigning that to a variable first, as in the question, creates the actual Unicode character which that API call can't handle correctly.
The problem being that the underlying code for keyboard.send_keys() wants to send keycodes that actually exist on your keyboard or that it can add to an unused key in your layout. Most of the time that doesn't work for anything multibyte.

I need to remove a specific unicode in my existing subtitle text file

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.
Notepad ++
SUBTITLE EDIT
EXCEL
WORD
288
00:24:41,960 --> 00:24:43,840
‫أتعلم، قللنا من شأنك فعلاً‬
289
00:24:44,000 --> 00:24:47,120
‫كان علينا تجنيدك لتكون جاسوساً‬
‫مكان (كاي سي)‬
290
00:24:47,280 --> 00:24:51,520
‫لا تعلمون كم أنا سعيد‬
‫لسماع ذلك‬
291
00:24:54,800 --> 00:24:58,160
‫لا تقلق، سيستيقظ نشيطاً غداً‬
292
00:24:58,320 --> 00:25:00,800
‫ولن يتذكر ما حصل‬
‫في الساعات الـ٦‬
the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.
The issue with this is that it doesn't display subtitles correctly on ps4 app.
I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php
The PILCROW ¶ is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.
The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING'
(code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) -
these are used in the text to indicate that the included text should be rendered
right-to-left instead of the ocidental left-to-right direction.
Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.
Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.
Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.
A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.
You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.
That, of course, after installing Python3 runtime from http://python.org
(if you are on Mac or Linux that is already pre-installed)
import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")
You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

Show info about current character in status bar in Sublime Text 2

I'm missing one useful feature which others text editors often offer. In bottom status bar they show ASCII and UTF code of current character - character before or after current position (not sure now). I cannot find package doing that or native feature that does that.
Thank you for your help.
I made a plugin for this :)
Create a anyname.py file in your Packages/User/ directory.
import sublime, sublime_plugin, textwrap, unicodedata
class utfcodeCommand(sublime_plugin.EventListener):
def on_selection_modified(self, view):
# some test chars = $ €
sublime.status_message('Copying with pretty format')
selected = view.substr(view.sel()[0].a)
char = str(selected)
view.set_status('Charcode', "ASCII: " + str(ord(selected)) + " UTF: " + str(char.encode("unicode_escape"))[2:-1])
This should show you the ASCII and Unicode code in the status bar of the character to the right of the caret.
Tell me if this works for you, tested with ST3 on Kubuntu Linux 12.04 x64.
Probably won't work on ST2 because of the different Python versions.
Here is one such plugin, it displays the character code in decimal: Show Character Code
Simple Sublime Text plugin for displaying decimal code of the current character in the status bar
Although it shows only the decimal value for the character code
I ran into several issues with the code posted by Sergey Telshevsky in ST2 / Python 2.7:
I got a SyntaxError: Non-ASCII character '\xe2' in file ./display_character_code.py on line 7 because of the # some test chars = $ € - removing this commented out code, or declaring a character encoding at the top of the Python code, e.g. # -*- coding: UTF-8 -*- gets rid of the error. I also got UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' when selecting the sample "€" (because it is not an ASCII character). And even after fixing these, the Unicode key was never displayed; e.g. the status bar showed ASCII: 123 UTF:. So I reworked his example and came up with the following:
import sublime_plugin
class statusCharCodes(sublime_plugin.EventListener):
def on_selection_modified(self, view):
selected = view.substr(view.sel()[0].a)
try:
ascii = str(ord(selected.encode("ascii"))).zfill(3)
except:
ascii = "n/a"
try:
utf = "U+" + str(format(ord(selected),"x")).zfill(4).upper()
except:
utf = "n/a"
view.set_status("Charcode", "ASCII: " + ascii + " UTF: " + utf)
Example output:

What is the difference between \r and \n?

How are \r and \n different? I think it has something to do with Unix vs. Windows vs. Mac, but I'm not sure exactly how they're different, and which to search for/match in regexes.
They're different characters. \r is carriage return, and \n is line feed.
On "old" printers, \r sent the print head back to the start of the line, and \n advanced the paper by one line. Both were therefore necessary to start printing on the next line.
Obviously that's somewhat irrelevant now, although depending on the console you may still be able to use \r to move to the start of the line and overwrite the existing text.
More importantly, Unix tends to use \n as a line separator; Windows tends to use \r\n as a line separator and Macs (up to OS 9) used to use \r as the line separator. (Mac OS X is Unix-y, so uses \n instead; there may be some compatibility situations where \r is used instead though.)
For more information, see the Wikipedia newline article.
EDIT: This is language-sensitive. In C# and Java, for example, \n always means Unicode U+000A, which is defined as line feed. In C and C++ the water is somewhat muddier, as the meaning is platform-specific. See comments for details.
In C and C++, \n is a concept, \r is a character, and \r\n is (almost always) a portability bug.
Think of an old teletype. The print head is positioned on some line and in some column. When you send a printable character to the teletype, it prints the character at the current position and moves the head to the next column. (This is conceptually the same as a typewriter, except that typewriters typically moved the paper with respect to the print head.)
When you wanted to finish the current line and start on the next line, you had to do two separate steps:
move the print head back to the beginning of the line, then
move it down to the next line.
ASCII encodes these actions as two distinct control characters:
\x0D (CR) moves the print head back to the beginning of the line. (Unicode encodes this as U+000D CARRIAGE RETURN.)
\x0A (LF) moves the print head down to the next line. (Unicode encodes this as U+000A LINE FEED.)
In the days of teletypes and early technology printers, people actually took advantage of the fact that these were two separate operations. By sending a CR without following it by a LF, you could print over the line you already printed. This allowed effects like accents, bold type, and underlining. Some systems overprinted several times to prevent passwords from being visible in hardcopy. On early serial CRT terminals, CR was one of the ways to control the cursor position in order to update text already on the screen.
But most of the time, you actually just wanted to go to the next line. Rather than requiring the pair of control characters, some systems allowed just one or the other. For example:
Unix variants (including modern versions of Mac) use just a LF character to indicate a newline.
Old (pre-OSX) Macintosh files used just a CR character to indicate a newline.
VMS, CP/M, DOS, Windows, and many network protocols still expect both: CR LF.
Old IBM systems that used EBCDIC standardized on NL--a character that doesn't even exist in the ASCII character set. In Unicode, NL is U+0085 NEXT LINE, but the actual EBCDIC value is 0x15.
Why did different systems choose different methods? Simply because there was no universal standard. Where your keyboard probably says "Enter", older keyboards used to say "Return", which was short for Carriage Return. In fact, on a serial terminal, pressing Return actually sends the CR character. If you were writing a text editor, it would be tempting to just use that character as it came in from the terminal. Perhaps that's why the older Macs used just CR.
Now that we have standards, there are more ways to represent line breaks. Although extremely rare in the wild, Unicode has new characters like:
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
Even before Unicode came along, programmers wanted simple ways to represent some of the most useful control codes without worrying about the underlying character set. C has several escape sequences for representing control codes:
\a (for alert) which rings the teletype bell or makes the terminal beep
\f (for form feed) which moves to the beginning of the next page
\t (for tab) which moves the print head to the next horizontal tab position
(This list is intentionally incomplete.)
This mapping happens at compile-time--the compiler sees \a and puts whatever magic value is used to ring the bell.
Notice that most of these mnemonics have direct correlations to ASCII control codes. For example, \a would map to 0x07 BEL. A compiler could be written for a system that used something other than ASCII for the host character set (e.g., EBCDIC). Most of the control codes that had specific mnemonics could be mapped to control codes in other character sets.
Huzzah! Portability!
Well, almost. In C, I could write printf("\aHello, World!"); which rings the bell (or beeps) and outputs a message. But if I wanted to then print something on the next line, I'd still need to know what the host platform requires to move to the next line of output. CR LF? CR? LF? NL? Something else? So much for portability.
C has two modes for I/O: binary and text. In binary mode, whatever data is sent gets transmitted as-is. But in text mode, there's a run-time translation that converts a special character to whatever the host platform needs for a new line (and vice versa).
Great, so what's the special character?
Well, that's implementation dependent, too, but there's an implementation-independent way to specify it: \n. It's typically called the "newline character".
This is a subtle but important point: \n is mapped at compile time to an implementation-defined character value which (in text mode) is then mapped again at run time to the actual character (or sequence of characters) required by the underlying platform to move to the next line.
\n is different than all the other backslash literals because there are two mappings involved. This two-step mapping makes \n significantly different than even \r, which is simply a compile-time mapping to CR (or the most similar control code in whatever the underlying character set is).
This trips up many C and C++ programmers. If you were to poll 100 of them, at least 99 will tell you that \n means line feed. This is not entirely true. Most (perhaps all) C and C++ implementations use LF as the magic intermediate value for \n, but that's an implementation detail. It's feasible for a compiler to use a different value. In fact, if the host character set is not a superset of ASCII (e.g., if it's EBCDIC), then \n will almost certainly not be LF.
So, in C and C++:
\r is literally a carriage return.
\n is a magic value that gets translated (in text mode) at run-time to/from the host platform's newline semantics.
\r\n is almost always a portability bug. In text mode, this gets translated to CR followed by the platform's newline sequence--probably not what's intended. In binary mode, this gets translated to CR followed by some magic value that might not be LF--possibly not what's intended.
\x0A is the most portable way to indicate an ASCII LF, but you only want to do that in binary mode. Most text-mode implementations will treat that like \n.
"\r" => Return
"\n" => Newline or Linefeed
(semantics)
Unix based systems use just a "\n" to end a line of text.
Dos uses "\r\n" to end a line of text.
Some other machines used just a "\r". (Commodore, Apple II, Mac OS prior to OS X, etc..)
\r is used to point to the start of a line and can replace the text from there, e.g.
main()
{
printf("\nab");
printf("\bsi");
printf("\rha");
}
Produces this output:
hai
\n is for new line.
In short \r has ASCII value 13 (CR) and \n has ASCII value 10 (LF).
Mac uses CR as line delimiter (at least, it did before, I am not sure for modern macs), *nix uses LF and Windows uses both (CRLF).
In addition to #Jon Skeet's answer:
Traditionally Windows has used \r\n, Unix \n and Mac \r, however newer Macs use \n as they're unix based.
\r is Carriage Return; \n is New Line (Line Feed) ... depends on the OS as to what each means. Read this article for more on the difference between '\n' and '\r\n' ... in C.
in C# I found they use \r\n in a string.
\r used for carriage return. (ASCII value is 13)
\n used for new line. (ASCII value is 10)