How to read currency symbol in a perl script - perl

I have a Perl script where we are reading data from a .csv file which is having some different currency symbol . When we are reading that file and write the content I can see it is printing
Get <A3>50 or <80>50 daily
Actual value is
Get £50 or €50 daily
With Dollar sign it is working fine if there is any other currency code it is not working
I tried
open my $in, '<:encoding(UTF-8)', 'input-file-name' or die $!;
open my $out, '>:encoding(latin1)', 'output-file-name' or die $!;
while ( <$in> ) {
print $out $_;
}
$ od -t x1 input-file-name
0000000 47 65 74 20 c2 a3 35 30 20 6f 72 20 e2 82 ac 35
0000020 30 20 64 61 69 6c 79 0a
0000030
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034
but that is also not helping .Output I am getting
Get \xA350 or \x8050 daily
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034

Unicode Code Point
Glyph
UTF-8
Input File
ISO-8859-1
Output File
U+00A3 POUND SIGN
£
C2 A3
C2 A3
A3
A3
U+20AC EURO SIGN
€
E2 82 AC
E2 82 AC
N/A
5C 78 7B 32 30 61 63 7D
("LATIN1" is an alias for "ISO-8859-1".)
There are no problems with the input file.
£ is correctly encoded in your input file.
€ is correctly encoded in your input file.
As for the output file,
£ is correctly encoded in your output file.
€ isn't found in the latin1 charset, so \x{20ac} is used instead.
Your program is working as expected.
You say you see <A3> instead of £. That's probably because the program you are using is expecting a file encoded using UTF-8, but you provided a file encoded using ISO-8859-1.
You also say you see <80> instead of €. But there's no way you'd see that for the file you provided.

Related

Powershell Script does not Write Correct Umlauts - Powershell itself does

I want to dump (and later work with) the paths of the locally changed files in my SVN repository. Problem is, there are umlauts in some filenames (like ä, ö, ü).
When I open a powershell window in my lokal trunk folder, I can do svn status and get the result with correct umlauts ("ü" in this case):
PS C:\trunk> svn status -q
M Std\ClientComponents\Prüfung.xaml
M Std\ClientComponents\Prüfung.xaml.cs
M Std\ClientComponents\PrüfungViewModel.cs
When I do the same in my powershell script, the results are different.
Script "DumpChangedFiles.ps1":
foreach ( $filename in svn status -q )
{
Write-Host $filename
}
Results:
PS C:\trunk> .\DumpChangedFiles.ps1
M Std\ClientComponents\Pr³fung.xaml
M Std\ClientComponents\Pr³fung.xaml.cs
M Std\ClientComponents\Pr³fungViewModel.cs
Question: Why are the umlauts wrong? How do I get to the correct results?
Hex-Dump:
ef bb bf 4d 20 20 20 20 20 20 20 53 74 64 5c 43 6c 69 65 6e 74 43 6f 6d 70 6f 6e 65 6e 74 73 5c 50 72 c2 b3 66 75 6e 67 2e 78 61 6d 6c 0d 0a 4d 20 20 20 20 20 20 20 53 74 64 5c 43 6c 69 65 6e 74 43 6f 6d 70 6f 6e 65 6e 74 73 5c 50 72 c2 b3 66 75 6e 67 2e 78 61 6d 6c 2e 63 73 0d 0a 4d 20 20 20 20 20 20 20 53 74 64 5c 43 6c 69 65 6e 74 43 6f 6d 70 6f 6e 65 6e 74 73 5c 50 72 c2 b3 66 75 6e 67 56 69 65 77 4d 6f 64 65 6c 2e 63 73
Here's the output of the script DumpChangedFiles.ps1 compared to the output of your desired command:
PS C:\trunk> .\DumpChangedFiles.ps1
M Std\ClientComponents\Pr³fung.xaml
M Std\ClientComponents\Pr³fung.xaml.cs
M Std\ClientComponents\Pr³fungViewModel.cs
PS C:\trunk> $PSDefaultParameterValues['*:Encoding'] = 'utf8'; svn status -q
M Std\ClientComponents\Prüfung.xaml
M Std\ClientComponents\Prüfung.xaml.cs
M Std\ClientComponents\PrüfungViewModel.cs
Output of SVN--version is:
PS C:\trunk> svn --version
svn, version 1.14.0 (r1876290)
compiled May 24 2020, 17:07:49 on x86-microsoft-windows
Copyright (C) 2020 The Apache Software Foundation.
This software consists of contributions made by many people;
see the NOTICE file for more information.
Subversion is open source software, see http://subversion.apache.org/
The following repository access (RA) modules are available:
* ra_svn : Module for accessing a repository using the svn network protocol.
- with Cyrus SASL authentication
- handles 'svn' scheme
* ra_local : Module for accessing a repository on local disk.
- handles 'file' scheme
* ra_serf : Module for accessing a repository via WebDAV protocol using serf.
- using serf 1.3.9 (compiled with 1.3.9)
- handles 'http' scheme
- handles 'https' scheme
The following authentication credential caches are available:
* Wincrypt cache in C:\Users\reichert\AppData\Roaming\Subversion
The problem comes from PowerShell ISE, the svn command in your script is executed through PowerShell ISE which encode its output with Windows-1252 (or your default windows locales).
You can go with the following to get a correct output (check your Windows locales) :
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1252)
foreach ( $filename in svn status -q )
{
Write-Host $filename
}
It seems a previous unanswered question relates to the same problem with ISE :
Powershell ISE has different codepage from Powershell and I can not change it

Can someone tell me why I am getting this error, is it because of the spacing (I know quotations matter)? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
create table product(
productid int,
description varchar(20)
);

insert into product (
productid,
description )
Values ( 42 , ' tv');
ERROR: column "description" of relation "product" does not exist
As several people pointed out in comments, there are invisible characters (sometimes called "gremlins") in your SQL that make it invalid. Here's a hex dump of the contents (after copying the code from the question, using macOS commands):
$ pbpaste | xxd -g1
00000000: 63 72 65 61 74 65 20 74 61 62 6c 65 20 70 72 6f create table pro
00000010: 64 75 63 74 28 0a 70 72 6f 64 75 63 74 69 64 20 duct(.productid
00000020: 69 6e 74 2c e2 80 a8 0a 64 65 73 63 72 69 70 74 int,....descript
^^ ^^ ^^ ^^^
00000030: 69 6f 6e 20 76 61 72 63 68 61 72 28 32 30 29 0a ion varchar(20).
00000040: 29 3b 0a e2 80 a8 69 6e 73 65 72 74 20 69 6e 74 );....insert int
00000050: 6f 20 70 72 6f 64 75 63 74 20 28 e2 80 a8 70 72 o product (...pr
00000060: 6f 64 75 63 74 69 64 2c e2 80 a8 64 65 73 63 72 oductid,...descr
^^ ^^ ^^ ^^^
00000070: 69 70 74 69 6f 6e 20 29 e2 80 a8 56 61 6c 75 65 iption )...Value
^^ ^^ ^^ ^^^
00000080: 73 20 28 20 34 32 20 2c 20 27 20 74 76 27 29 3b s ( 42 , ' tv');
00000090: 0a 45 52 52 4f 52 3a 20 20 63 6f 6c 75 6d 6e 20 .ERROR: column
000000a0: 22 64 65 73 63 72 69 70 74 69 6f 6e 22 20 6f 66 "description" of
000000b0: 20 72 65 6c 61 74 69 6f 6e 20 22 70 72 6f 64 75 relation "produ
000000c0: 63 74 22 20 64 6f 65 73 20 6e 6f 74 20 65 78 69 ct" does not exi
000000d0: 73 74 st
(Note that xxd represents bytes that don't correspond to printable ASCII characters as "." in the text display on the right. The "."s that correspond to 0a in hex are newline characters.)
The hex codes e2 80 a8 correspond to the UTF-8 encoding of the unicode "line separator" character. I don't know how that character got in there; you'd have to trace back the origin of that code snippet to figure out where they were added.
I'd avoid using TextEdit for source code (and config files, etc) . Instead, I'd recommend using BBEdit or some other code-oriented editor. I think even in BBEdit's free-demo mode it can show (and let you remove) normally-invisible characters by choosing View menu -> Text Display -> Show Invisibles.
You can also remove non-plain-ASCII characters from a text file from the macOS Terminal with:
LC_ALL=C tr -d '\n\t -~' <infile.txt >cleanfile.txt
(Replacing infile.txt and cleanfile.txt with the paths/names of the input file and where you want to store the output.) Warning: do not try to write the cleaned contents back to the original file, that won't work. Also, don't use this to clean anything except plain text files (if the file has any sections that aren't supposed to be text sections, this may mangle those sections). Keep the original file as a backup until you've verified that the "clean" version works right.
You can also "clean" the paste buffer with:
pbpaste | LC_ALL=C tr -d '\n\t -~' | pbcopy
...so just copy the relevant code from your text editor, run that in Terminal, then paste the cleaned version back into the editor.

Random symbols in Source window instead of Russian characters in RStudio

I have been googling and stackoverflowing (yes, that is the word now) on how to fix the problem with wrong encoding. However, I could not find the solution.
I am trying to load .Rmd file with UTF-8 encoding which basically has Russian characters in it. They do not show properly. Instead, the code lines in the Source window look like so:
Initially, I created this .Rmd file long ago on my previous laptop. Now, I am using another one and I cannot spot the issue here.
I have already tried to use some Sys.setlocale() commands with no success whatsoever.
I run RStudio on Windows 10.
Edited
This is the output of readBin('raw[1].Rmd', raw(), 10000). Slice from 2075 to 2211:
[2075] 64 31 32 2c 20 71 68 35 20 3d 3d 20 22 d0 a0 d1 9a d0 a0 d0 88 d0 a0
e2 80 93 d0 a0 d0 8e d0 a0 d1 99
[2109] d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20 64 31 32 6d 24 71 68 35 20 3d
20 4e 55 4c 4c 0d 0a 64 31 35 6d
[2143] 20 3d 20 66 69 6c 74 65 72 28 64 31 35 2c 20 74 68 35 20 3d 3d 20 22
d0 a0 d1 9a d0 a0 d0 88 d0 a0 e2
[2177] 80 93 d0 a0 d0 8e d0 a0 d1 99 d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20
64 31 35 6d 24 74 68 35 20 3d 20
Thank you.
Windows doesn't have very good support for UTF-8. Likely your local encoding is something else.
RStudio normally reads files using the system encoding. If that is wrong, you can use "File | Reopen with encoding..." to re-open the file using a different encoding.
Edited to add:
The first line of the sample output looks like UTF-8 encoding with some Cyrillic letters, but not Russian-language text. I decode it as "d12, qh5 == \"РњРЈР–РЎРљ". Is that what RStudio gave you when you re-opened the file, declaring it as UTF-8?

Why does Perl MIME::Base64 insert CR characters before LF characters on decoding Base64-encoded strings when they aren't present in the original data?

Why does the Perl MIME::Base64 module on decoding Base64-encoded strings insert CR characters before LF characters when they are not present in the original data?
Input: a binary described by the following hex string,
14 15 6A 48 E4 15 6A 32 E5 48 46 13 A5 E3 88 43 18 A6 84 E3 51 3A 8A 0A 1A 3E E6 84 A6 1A 16 E8 46 84 A1 2E A3 5E 84 8A 4E 1A 35 E1 35 1E 84 A9 8E 46 54 44
This encodes to the Base64-encoded string:
FBVqSOQVajLlSEYTpeOIQximhONROooKGj7mhKYaFuhGhKEuo16Eik4aNeE1HoSpjkZURA==
My Perl script for decoding is
use MIME::Base64;
my $bin = decode_base64('FBVqSOQVajLlSEYTpeOIQximhONROooKGj7mhKYaFuhGhKEuo16Eik4aNeE1HoSpjkZURA==');
open FH, ">test.bin" or die $!;
print FH $bin;
close FH;
Output: the resulting file 'test.bin' has the following hex string representation,
14 15 6A 48 E4 15 6A 32 E5 48 46 13 A5 E3 88 43 18 A6 84 E3 51 3A 8A 0D 0A 1A 3E E6 84 A6 1A 16 E8 46 84 A1 2E A3 5E 84 8A 4E 1A 35 E1 35 1E 84 A9 8E 46 54 44
Note the hex digits in bold highlighting the additional '0D' character that has been inserted before '0A' where it was not present in the original data.
I'm using Perl v5.14.2 on Windows 7.
Since you're on Windows, you will need to open that filehandle in binary mode to prevent your line-endings from being munged.
open FH, ">test.bin" or die $!;
binmode FH;
You can do that all at once using IO layers, and also using a lexical filehandle which is better practice than a package symbol like FH:
open my $fh, '>:raw', 'test.bin' or die $!;
print { $fh } $bin;
For more, check out
perldoc perlio
perldoc perlopentut

Why does encoding, then decoding strings make Arabic characters lose their context?

I'm (belatedly) testing Unicode waters for the first time and am failing to understand why the process of encoding, then decoding an Arabic string is having the effect of separating out the individual characters that the word is made of.
In the example below, the word "ﻟﻠﺒﻴﻊ" comprises of 5 individual letters: "ع","ي","ب","ل","ل", written right to left. Depending on the surrounding context (adjacent letters), the letters change form
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n" # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
ADDITIONAL INFO
When pasting the string to this post, the rendering is reversed so it looks like "ﻊﻴﺒﻠﻟ". I'm reversing it manually to get it to look 'right'. The correct hexdump is given below:
$ echo "ﻟﻠﺒﻴﻊ" | hexdump
0000000 bbef ef8a b4bb baef ef92 a0bb bbef 0a9f
0000010
The output of the Perl script (per ikegami's request):
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63
0000040 6f 64 65 64 20 73 74 72 69 6e 67 20 31 3a 20 d8
0000060 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63 6f 64 65
0000100 64 20 73 74 72 69 6e 67 20 32 3a 20 d8 b9 d9 8a
0000120 d8 a8 d9 84 d9 84 0a
0000127
And if I just print $str:
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
0000035
Finally (per ikegami's request):
$ grep 'For sale' unicode.pl | od -t x1
0000000 6d 79 20 24 73 74 72 20 3d 20 27 d8 b9 d9 8a d8
0000020 a8 d9 84 d9 84 27 3b 20 20 23 20 22 46 6f 72 20
0000040 73 61 6c 65 22 20 0a
0000047
Perl details
$ perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)
Outputting to file reverses the string: "ﻊﻴﺒﻠﻟ"
QUESTIONS
I have several:
How can I preserve the context of each character while printing?
Why is the original string printed out to screen as individual letters, even though it hasn't been 'processed'?
When printing to file, the word is reversed (I'm guessing this is due to the script's right-to-left nature). Is there a way I can prevent this from happening?
Why does the following not hold true: $str !~ /\P{Bidi_Class: Right_To_Left}/;
Source code returned by StackOverflow (as fetched using wget):
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a ...
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
perl output I get from the source code returned by StackOverflow:
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
U+000A LINE FEED
So I get exactly what's in the source, as I should.
perl output you got:
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
U+0639 ARABIC LETTER AIN
U+064A ARABIC LETTER YEH
U+0628 ARABIC LETTER BEH
U+0644 ARABIC LETTER LAM
U+0644 ARABIC LETTER LAM
U+000A LINE FEED
Ok, so you could have a buggy Perl (that reverses and changes Arabic characters and only those), but it's far more likely that your sources doesn't contain what you think it does. You need to check what bytes form up your source.
echo output you got:
ef bb 8a ef bb b4 ef ba 92 ef bb a0 ef bb 9f 0a
U+FECA ARABIC LETTER AIN FINAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+000A LINE FEED
There are significant differences in what you got from perl and from echo, so it's no surprise they show up differently.
Output inspected using:
$ perl -Mcharnames=:full -MEncode=decode_utf8 -E'
say sprintf("U+%04X %s", $_, charnames::viacode($_))
for unpack "C*", decode_utf8 pack "H*", $ARGV[0] =~ s/\s//gr;
' '...'
(Don't forget to swap the bytes of hexdump.)
Maybe something odd with your shell? If I redirect the output to a file, the result will be the same. Please try this out:
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
open(F1,'>',"origiinal.txt") or die;
open(F2,'>',"decoded.txt") or die;
open(F3,'>',"decoded2.txt") or die;
binmode(F1, ':utf8');binmode(F2, ':utf8');binmode(F3, ':utf8');
print F1 "$str\n"; # ل ل ب ي ع
print F2 "$dec\n"; # ل ل ب ي ع
print F3 "$decoded\n";