perl replace non UTF-8 characters or binary contents with whitespace

perl replace non UTF-8 characters or binary contents with whitespace - perl

I have a file with non-ascii characters.
$ org od -t c -t x1 -A d tmp.txt
0000000 S o - c a l l e d 217 204 l a b
53 6f 2d 63 61 6c 6c 65 64 f4 8f b1 84 6c 61 62
0000016 e l e d 217 204 p a t t e r n s
65 6c 65 64 f4 8f b1 84 70 61 74 74 65 72 6e 73
0000032 217 204 c a n b e 217 204 u s
f4 8f b1 84 63 61 6e 20 62 65 f4 8f b1 84 75 73
0000048 e d 217 204 w i t h 217 204 s i
65 64 f4 8f b1 84 77 69 74 68 f4 8f b1 84 73 69
0000064 n g l e , 217 204 d o u b l e
6e 67 6c 65 2c 20 f4 8f b1 84 64 6f 75 62 6c 65
0000080 , 217 204 a n d 217 204 t r i
2c 20 f4 8f b1 84 61 6e 64 f4 8f b1 84 74 72 69
0000096 p l e 217 204 b l a n k s .
70 6c 65 f4 8f b1 84 62 6c 61 6e 6b 73 2e
As you can see, \x{f4}\x{8f}\x{b1}\x{84} has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84} with whitespace. According to this, I try:
s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;
It doesn't work.
But if I remove this two lines in the script:
use utf8;
use open qw( :std :encoding(UTF-8) );
It works. Why?
I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84} is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84} or any other binary contents or non UTF-8 characters with perl?

While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}", your string contains "\x{10FC44}" — "\N{U+10FC44}" if you prefer — because you decoded what you read. As such, you'd need
tr/\N{U+10FC44}/ /
It's a private-use Code Point. To replace all 137,468 private-use Code Points, you can use
s/\p{General_Category=Private_Use}/ /g
General_Category can be abbreviated to Gc.
Private_Use can be abbreviated to Co.
General_Category= can be omitted.
So these are equivalent:
s/\p{Gc=Private_Use}/ /g
s/\p{Private_Use}/ /g
s/\p{Co}/ /g
Co makes me think of "control", so maybe it's best to avoid that one. (Controls characters are identified by the Control aka Cc general category.)

Related

How to read currency symbol in a perl script

I have a Perl script where we are reading data from a .csv file which is having some different currency symbol . When we are reading that file and write the content I can see it is printing
Get <A3>50 or <80>50 daily
Actual value is
Get £50 or €50 daily
With Dollar sign it is working fine if there is any other currency code it is not working
I tried
open my $in, '<:encoding(UTF-8)', 'input-file-name' or die $!;
open my $out, '>:encoding(latin1)', 'output-file-name' or die $!;
while ( <$in> ) {
print $out $_;
}
$ od -t x1 input-file-name
0000000 47 65 74 20 c2 a3 35 30 20 6f 72 20 e2 82 ac 35
0000020 30 20 64 61 69 6c 79 0a
0000030
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034
but that is also not helping .Output I am getting
Get \xA350 or \x8050 daily
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034

Unicode Code Point
Glyph
UTF-8
Input File
ISO-8859-1
Output File
U+00A3 POUND SIGN
£
C2 A3
C2 A3
A3
A3
U+20AC EURO SIGN
€
E2 82 AC
E2 82 AC
N/A
5C 78 7B 32 30 61 63 7D
("LATIN1" is an alias for "ISO-8859-1".)
There are no problems with the input file.
£ is correctly encoded in your input file.
€ is correctly encoded in your input file.
As for the output file,
£ is correctly encoded in your output file.
€ isn't found in the latin1 charset, so \x{20ac} is used instead.
Your program is working as expected.
You say you see <A3> instead of £. That's probably because the program you are using is expecting a file encoded using UTF-8, but you provided a file encoded using ISO-8859-1.
You also say you see <80> instead of €. But there's no way you'd see that for the file you provided.

Can someone tell me why I am getting this error, is it because of the spacing (I know quotations matter)? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
create table product(
productid int, description varchar(20)
);
 insert into product ( productid, description ) Values ( 42 , ' tv');
ERROR: column "description" of relation "product" does not exist

As several people pointed out in comments, there are invisible characters (sometimes called "gremlins") in your SQL that make it invalid. Here's a hex dump of the contents (after copying the code from the question, using macOS commands):
$ pbpaste | xxd -g1
00000000: 63 72 65 61 74 65 20 74 61 62 6c 65 20 70 72 6f create table pro
00000010: 64 75 63 74 28 0a 70 72 6f 64 75 63 74 69 64 20 duct(.productid
00000020: 69 6e 74 2c e2 80 a8 0a 64 65 73 63 72 69 70 74 int,....descript
^^ ^^ ^^ ^^^
00000030: 69 6f 6e 20 76 61 72 63 68 61 72 28 32 30 29 0a ion varchar(20).
00000040: 29 3b 0a e2 80 a8 69 6e 73 65 72 74 20 69 6e 74 );....insert int
00000050: 6f 20 70 72 6f 64 75 63 74 20 28 e2 80 a8 70 72 o product (...pr
00000060: 6f 64 75 63 74 69 64 2c e2 80 a8 64 65 73 63 72 oductid,...descr
^^ ^^ ^^ ^^^
00000070: 69 70 74 69 6f 6e 20 29 e2 80 a8 56 61 6c 75 65 iption )...Value
^^ ^^ ^^ ^^^
00000080: 73 20 28 20 34 32 20 2c 20 27 20 74 76 27 29 3b s ( 42 , ' tv');
00000090: 0a 45 52 52 4f 52 3a 20 20 63 6f 6c 75 6d 6e 20 .ERROR: column
000000a0: 22 64 65 73 63 72 69 70 74 69 6f 6e 22 20 6f 66 "description" of
000000b0: 20 72 65 6c 61 74 69 6f 6e 20 22 70 72 6f 64 75 relation "produ
000000c0: 63 74 22 20 64 6f 65 73 20 6e 6f 74 20 65 78 69 ct" does not exi
000000d0: 73 74 st
(Note that xxd represents bytes that don't correspond to printable ASCII characters as "." in the text display on the right. The "."s that correspond to 0a in hex are newline characters.)
The hex codes e2 80 a8 correspond to the UTF-8 encoding of the unicode "line separator" character. I don't know how that character got in there; you'd have to trace back the origin of that code snippet to figure out where they were added.
I'd avoid using TextEdit for source code (and config files, etc) . Instead, I'd recommend using BBEdit or some other code-oriented editor. I think even in BBEdit's free-demo mode it can show (and let you remove) normally-invisible characters by choosing View menu -> Text Display -> Show Invisibles.
You can also remove non-plain-ASCII characters from a text file from the macOS Terminal with:
LC_ALL=C tr -d '\n\t -~' <infile.txt >cleanfile.txt
(Replacing infile.txt and cleanfile.txt with the paths/names of the input file and where you want to store the output.) Warning: do not try to write the cleaned contents back to the original file, that won't work. Also, don't use this to clean anything except plain text files (if the file has any sections that aren't supposed to be text sections, this may mangle those sections). Keep the original file as a backup until you've verified that the "clean" version works right.
You can also "clean" the paste buffer with:
pbpaste | LC_ALL=C tr -d '\n\t -~' | pbcopy
...so just copy the relevant code from your text editor, run that in Terminal, then paste the cleaned version back into the editor.

dd: copy N first bytes of each M-sized block

Given a binary input file made of data blocks which are N bytes long, how can I extract the first M first bytes of each block using dd?
for example, with M=10 and N=8, the data could look like this:
$ M=10
$ head -c $(( M * 5 )) /dev/urandom \
| tee inputfile.bin \
| hexdump -e '"%07.7_Ax\n"' -e "\"%07.7_ax \" ${M}/1 \"%02x \" \"\n\""
0000000 c0 07 5d 59 dc 03 2e 38 49 c4
000000a ca ad 44 6d 09 61 2b 6c 7c ba
0000014 c4 96 c6 73 8b ed 42 cf d9 9c
000001e 49 b7 bb ea 32 dc 35 6a 5c d8
0000028 55 15 a0 aa d5 aa 60 2c 30 de
0000032
and I would like to extract this from the input:
$ N=8
$ hexdump -e '"%07.7_Ax\n"' -e "\"%07.7_ax \" ${N}/1 \"%02x \" \"\n\"" output.bin
0000000 c0 07 5d 59 dc 03 2e 38
0000008 ca ad 44 6d 09 61 2b 6c
0000010 c4 96 c6 73 8b ed 42 cf
0000018 49 b7 bb ea 32 dc 35 6a
0000020 55 15 a0 aa d5 aa 60 2c
0000028

Random symbols in Source window instead of Russian characters in RStudio

I have been googling and stackoverflowing (yes, that is the word now) on how to fix the problem with wrong encoding. However, I could not find the solution.
I am trying to load .Rmd file with UTF-8 encoding which basically has Russian characters in it. They do not show properly. Instead, the code lines in the Source window look like so:
Initially, I created this .Rmd file long ago on my previous laptop. Now, I am using another one and I cannot spot the issue here.
I have already tried to use some Sys.setlocale() commands with no success whatsoever.
I run RStudio on Windows 10.
Edited
This is the output of readBin('raw[1].Rmd', raw(), 10000). Slice from 2075 to 2211:
[2075] 64 31 32 2c 20 71 68 35 20 3d 3d 20 22 d0 a0 d1 9a d0 a0 d0 88 d0 a0
e2 80 93 d0 a0 d0 8e d0 a0 d1 99
[2109] d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20 64 31 32 6d 24 71 68 35 20 3d
20 4e 55 4c 4c 0d 0a 64 31 35 6d
[2143] 20 3d 20 66 69 6c 74 65 72 28 64 31 35 2c 20 74 68 35 20 3d 3d 20 22
d0 a0 d1 9a d0 a0 d0 88 d0 a0 e2
[2177] 80 93 d0 a0 d0 8e d0 a0 d1 99 d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20
64 31 35 6d 24 74 68 35 20 3d 20
Thank you.

Windows doesn't have very good support for UTF-8. Likely your local encoding is something else.
RStudio normally reads files using the system encoding. If that is wrong, you can use "File | Reopen with encoding..." to re-open the file using a different encoding.
Edited to add:
The first line of the sample output looks like UTF-8 encoding with some Cyrillic letters, but not Russian-language text. I decode it as "d12, qh5 == \"РњРЈР–РЎРљ". Is that what RStudio gave you when you re-opened the file, declaring it as UTF-8?

Identifying an unknown data encoding?

I'm trying understand an undocumented API I have discovered, and I can't get past the format of the data that is being returned.
Here is an example of what I get back when I perform a GET on the url I'm looking at:
A+uZL4258wXdnWztlEPJNXtdl3Tu4hRITtW2AUwQHUK5c6BATSBU/XsQEVIttCpI7wrW/oXWiBloT8+cdtUWBag3mzk3cLohKPvi7PWpf7jqCSbjNGh+5Iv5Gb8by2k31kp62sfwZ+i8r/3TA6nGrnJb6edOB7d0c6F34RTFRrrZSeJtiWYXAJ5JeD3yJY+C
At first I thought this was base64 encoded, but that just gives me back gibberish:
echo -n "<above snippet>" | base64 -D
?/???ݝl?C?5Vy??????,?8?s?#M T?{R-?*H?
???ֈhOϜv??7?97p?!(??????? &?4h~???i7?Jz???g輯???Ʈr[??N?ts?w??F??I?m?f?Ix=?%?
When I strip the URL down to just the domain, I get a website with cyrillic text. Maybe the data could be converted to cyrillic somehow?
Does this data format look familiar to you?
I'll continue to keep trying and report back if I make any progress.

This is definitely base64, because of the / and + characters.
When you decode that string using base64, you get this hexdump:
00000000 03 eb 99 2f 8d b9 f3 05 dd 9d 6c ed 94 43 c9 35 |.../......l..C.5|
00000010 7b 5d 97 74 ee e2 14 48 4e d5 b6 01 4c 10 1d 42 |{].t...HN...L..B|
00000020 b9 73 a0 40 4d 20 54 fd 7b 10 11 52 2d b4 2a 48 |.s.#M T.{..R-.*H|
00000030 ef 0a d6 fe 85 d6 88 19 68 4f cf 9c 76 d5 16 05 |........hO..v...|
00000040 a8 37 9b 39 37 70 ba 21 28 fb e2 ec f5 a9 7f b8 |.7.97p.!(.......|
00000050 ea 09 26 e3 34 68 7e e4 8b f9 19 bf 1b cb 69 37 |..&.4h~.......i7|
00000060 d6 4a 7a da c7 f0 67 e8 bc af fd d3 03 a9 c6 ae |.Jz...g.........|
00000070 72 5b e9 e7 4e 07 b7 74 73 a1 77 e1 14 c5 46 ba |r[..N..ts.w...F.|
00000080 d9 49 e2 6d 89 66 17 00 9e 49 78 3d f2 25 8f 82 |.I.m.f...Ix=.%..|
This just looks like 128 bytes of random data. And whenever you call this API URL again, you get a different string, although it starts with the same few characters.
Perhaps you should ask the maintainers of that website how to use their API. Maybe this string is some session ID that you should use in further calls.