Spliting an emoji sequence in powershell

Spliting an emoji sequence in powershell - powershell

I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried:
function emoji_to_unicode(){
foreach ($emoji in $textbox.Text) {
$unicode = [System.Text.Encoding]::Unicode.GetBytes($emoji)
Write-Host $unicode
}
}
Instead of printing the bytes one by one, the loop is running just once, printing the codes of all the emoji joined together. It's like all the emoji was a single item. I tested with 6 emoji, and instead of getting this:
61 216 7 222
61 216 67 222
61 216 10 222
61 216 28 222
61 216 86 220
60 216 174 223
I'm getting this:
61 216 7 222 61 216 67 222 61 216 10 222 61 216 28 222 61 216 86 220 60 216 174 223
What am I missing?

A string is just one element. You want to change it to a character array.
foreach ($i in 'hithere') { $i }
hithere
foreach ($i in [char[]]'hithere') { $i }
h
i
t
h
e
r
e
Hmm this doesn't work well. These code points are pretty high, U+1F600 (32-bit), etc
foreach ($i in [char[]]'😀😁😂😃😄😅😆') { $i }
� # 16 bit surrogate pairs?
�
�
�
�
�
�
�
�
�
�
�
�
�
Hmm ok, add every pair. Here's another way to do it using https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates (or just use ConvertToUTF32($emoji, 0) )
$emojis = '😀😁😂😃😄😅😆'
for ($i = 0; $i -lt $emojis.length; $i += 2) {
[System.Char]::IsHighSurrogate($emojis[$i])
0x10000 + ($emojis[$i] - 0xD800) * 0x400 + $emojis[$i+1] - 0xDC00 | % tostring x
# [system.char]::ConvertToUtf32($emojis,$i) | % tostring x # or
$emojis[$i] + $emojis[$i+1]
}
True
1f600
😀
True
1f601
😁
True
1f602
😂
True
1f603
😃
True
1f604
😄
True
1f605
😅
True
1f606
😆
Note that unicode in the Unicode.GetBytes() method call refers to utf16le encoding.
Chinese works.
[char[]]'嗨，您好'
嗨
，
您
好
Here it is using utf32 encoding. All characters are 4 bytes long. Converting every 4 bytes into an int32 and printing them as hex.
$emoji = '😀😁😂😃😄😅😆'
$utf32 = [System.Text.Encoding]::utf32.GetBytes($emoji)
for($i = 0; $i -lt $utf32.count; $i += 4) {
$int32 = [bitconverter]::ToInt32($utf32[$i..($i+3)],0)
$int32 | % tostring x
}
1f600
1f601
1f602
1f603
1f604
1f605
1f606
Or going the other way from int32 to string. Simply casting the int32 to [char] does not work (have to add pairs of [char]'s). Script reference: https://www.powershellgallery.com/packages/Emojis/0.1/Content/Emojis.psm1
for ($i = 0x1f600; $i -le 0x1f606; $i++ ) { [System.Char]::ConvertFromUtf32($i) }
😀
😁
😂
😃
😄
😅
😆
See also How to encode 32-bit Unicode characters in a PowerShell string literal?
EDIT:
Powershell 7 has a nice enumeraterunes() method:
$emojis = '😀😁😂😃😄😅😆'
$emojis.enumeraterunes() | % value | % tostring x
1f600
1f601
1f602
1f603
1f604
1f605
1f606

Related

Powershell - [DateTime]::TryParseExact, two apparently identical strings, one works the other doesn't [duplicate]

This question already has answers here:
Unable to convert a string to an integer variable from DateTaken attribute on a JPG file
(2 answers)
Debugging PowerShell, two apparent identical values do not behave the same. How to find the difference [duplicate]
(1 answer)
Closed 1 year ago.
In powershell I have a date that I expect is a string and another that I created for debugging purposes which is a string.
function ParseDate {
param
(
$inputDate #when debugging this outputs - 04/11/2021 23:00
)
$date = "04/11/2021 23:00" #hard coded for debugging
$parsedDate = [DateTime]::MinValue;
# $date here, which I had for debugging, $inputDate will fail for parse
$parsedSuccessfully = [DateTime]::TryParseExact($date, "dd/MM/yyyy HH:mm", $null, [System.Globalization.DateTimeStyles]::None, [ref] $parsedDate);
return $parsedDate
}
If I parse the hardcoded date ($date) then it works fine, but if I parse the $inputDate it fails and $parsedSuccessfully will be false.
If I output the GetType() on both objects then it returns the same type -
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String System.Object
Is there any way to tell what the difference is between the inputDate and hard coded date as something must be different as one works and the other does not.

See this answer here - https://stackoverflow.com/a/67228497/440760
Yes this was also my question!
Essentially use write-host( $tmp.ToCharArray() | % { [int] $_ }) to show the text in hex. This clearly shows that there are differences.
In my case -
8206 48 52 47 8206 49 49 47 8206 50 48 50 49 32 8207 8206 49 51 58 49 57 (didn't work)
AND
48 52 47 49 49 47 50 48 50 49 32 49 51 58 49 57 (worked).
The extra chars are BSTR which can be removed using
$formattedDateString = $value -replace '[^\p{L}\p{Nd}\:\/\ ]', ''

I have data in hex dump but don't know the encoding. Eg. 0x91 0x05 = 657

I have some data in hexdump code.
left hand are DEC and right hand are hexdump code.
16 = 10
51 = 33
164 = A4 01
388 = 84 03
570 = BA 04
657 = 91 05
1025 = 81 08
246172 = 9C 83 0F
How to calculate any hexdump to DEC ?
In perl, I tried to use ord() command but don't work.
Update
I don't known what it call. It look like 7bits data. I try to build formula in excel look like these:
DEC = hex2dec(X) + (128^1 * hex2dec(Y-1)) + (128^2 * hex2dec(Z-1)) + ...

What you have is a variable-length encoding. The length is encoded using a form of sentinel value: Each byte of the encoded number except the last has its high bit set. The remaining bits form the two's-complement encoding of the number in little-ending byte order.
0xxxxxxx ⇒ 0xxxxxxx
1xxxxxxx 0yyyyyyy ⇒ 00yyyyyy yxxxxxxx
1xxxxxxx 1yyyyyyy 0zzzzzzz ⇒ 000zzzzz zzyyyyyy yxxxxxxx
etc
The following can be used to decode a stream:
use strict;
use warnings;
use feature qw( say );
sub extract_first_num {
$_[0] =~ s/^([\x80-\xFF]*[\x00-\x7F])//
or return;
my $encoded_num = $1;
my $num = 0;
for (reverse unpack 'C*', $encoded_num) {
$num = ( $num << 7 ) | ( $_ & 0x7F );
}
return $num;
}
my $stream_buf = "\x10\x33\xA4\x01\x84\x03\xBA\x04\x91\x05\x81\x08\x9C\x83\x0F";
while ( my ($num) = extract_first_num($stream_buf) ) {
say $num;
}
die("Bad data") if length($stream_buf);
Output:
16
51
164
388
570
657
1025
246172

Why is the output the way it is? -Splitting and chop

I'm trouble understanding the output of the below code.
1. Why is the output Jo Al Ch and Sa? Doesn't chop remove the last character of string and return that character, so shouldn't the output be n i n and y? 2. What is the purpose of the $firstline=0; line in the code?
3. What exactly is happening at the lines
foreach(#data)
{$name,$age)=split(//,$_);
print "$name $age \n";
The output of the following code is
Data in file is:
J o
A l
C h
S a
The file contents are:
NAME AGE
John 26
Ali 21
Chen 22
Sally 25
The code:
#!/usr/bin/perl
my ($firstline,
#data,
$data);
open (INFILE,"heading.txt") or die $.;
while (<INFILE>)
{
if ($firstline)
{
$firstline=0;
}
else
{
chop(#data=<INFILE>);
}
print "Data in file is: \n";
foreach (#data)
{
($name,$age)=split(//,$_);
print "$name $age\n";
}
}

There are few issues with this script but first I will answer your points
chop will remove the last character of a string and returns the character chopped. In your data file "heading.txt" every line might be ending with \n and hence chop will be removing \n. It is always recommended to use chomp instead.
You can verify what is the last character of the line by running the command below:
od -bc heading.txt
0000000 116 101 115 105 040 101 107 105 012 112 157 150 156 040 062 066
N A M E A G E \n J o h n 2 6
0000020 012 101 154 151 040 062 061 012 103 150 145 156 040 062 062 012
\n A l i 2 1 \n C h e n 2 2 \n
0000040 123 141 154 154 171 040 062 065 012
S a l l y 2 5 \n
0000051
You can see \n
There is no use of $firstline because it is never been set to 1. So you can remove the if/else block.
In the first line it is reading all the elements of array #data one by one. In 2nd line it is splitting the contents of the element in characters and capturing first 2 characters and assigning them to $name and $age variables and discarding the rest. In the last line we are printing those captured characters.
IMO, in line 2 we should do split based on space to actual capture the name and age.
So the final script should looks like:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (INFILE,"heading.txt") or die "Can't open heading.txt: $!";
while (<INFILE>) {
chomp(#data= <INFILE>);
}
close(INFILE);
print "Data in file is: \n";
foreach (#data) {
my ($name,$age)=split(/ /,$_);
print "$name $age\n";
}
Output:
Data in file is:
John 26
Ali 21
Chen 22
Sally 25

Hex conversion of GUID removes Zeros?

I have a script that takes the hex value of a GUID and converts to a GUID. However it is removing zeros for eg. this is my output. And the reg reads for the hex values.
$GUIDLocal1 = (Get-ItemProperty "$Reg\Common Api")."UniqueID"
$GUIDLocal2 = (Get-ItemProperty "$Reg\Test\Common Api")."UniqueID"
# This is not code below just info
$GUIDLocal1 is 54 171 225 63 61 204 14 79 168 61 49 246 193 140 121 152
$GUIdlocal2 is 54 171 225 63 61 204 14 79 168 61 49 246 193 140 121 152
ID in Database is 36ABE13F3DCC0E4FA83D31F6C18C7998
$guidinhex 36ABE13F3DCCE4FA83D31F6C18C7998
$guidinhex2 36ABE13F3DCCE4FA83D31F6C18C7998
# This is not code above just info
I am using this code for the conversion
$guidinHex = [string]::Empty
$guidinHex2 = [string]::Empty
$GUIDLocal1 | % { $guidInHEX += '{0:X}' -f [int]$_ }
$GUIDLocal2 | % { $guidInHEX2 += '{0:X}' -f [int]$_ }
ID is GUID with all {, }, and - removed for ease of view.
$GUIDLocal1 and $GUIDLocal2 is the hex value in registry.
I then use the code above to convert ($GUIDLocal1 and $GUIDLocal2 is the values guidinhex / 2).
The conversion works, but if there is a zero it strips it out as you can see above - this machine the GUID actually matches the reg values but my conversion is skewing the result I just need to know why and how not to have the conversion remove the Zero / s.
I thought adding [int] would help but to no avail.

The -f (format) operator lets you format a numeric value as a hexadecimal string with leading zeros. The format specification is {0:Xn} or {0:xn}, where n is the number of digits desired in the string output (padded with zeros if needed). Uppercase X or lowercase x specifies whether you want the hex values A through F to be uppercase or lowercase. Examples:
"{0:X2}" -f 15 # outputs string 0F
"{0:X3}" -f 27 # outputs string 01B
"{0:x4}" -f 254 # outputs string 00fe
...and so forth. Documentation here:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-numeric-format-strings#XFormatString

Matlab: delete complete line in txt-file if there is a non-ascii-character

I'm currently writing a Matlab code to plot measurement data. Unfortunately there is a hardware problem with serial communication and sometimes i receive just gibberish. My code works only for defined data, so this gibberish has to be removed. I want something like this pseudo code:
for eachLine
if currentLineContainsNonASCII
delete completeLine
end if
end for
the data is read like this
rawdataInputFilename = 'measurementData.txt';
fileID = fopen(rawdataInputFilename);
% load data as string
DataCell = textscan(fileID,'%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s','HeaderLines', 1);
I was thinking about first creating a new 'clean' file with only ASCII chars and then reading that file with my actual plotting code.
Where I stuck is how to identify a non ASCII and then deleting the whole line, not only overwriting that single char.
Some example data, 1. and 3. line are 'clean' and can be handled with the current code. Second Line has non ASCIIs in it and therefore kills my code. Whitespace characters are windows linefeed, tab and space.
61 380 Module03 Slot02 27.01.2015 13:47:13 450 3587 1175 84 101.83 22.30 5.20 1 1
62 386 Module03 Slot03 27.01.2015 13:47:18 450ÆăǳШШ 106.83 22.30 25.20 1 1
63 391 Module03 Slot04 27.01.2015 13:47:24 ERROR dgsf 5643332 103.26 22.40 25.20 1 1

You can just check if the received character is in the range [32, 127], otherwise skip it.
The following function will tell you if there is any non-printable character in a given string:
function R = has_non_printable_characters(str)
% Remove non-printable characters
str2 = str(31<str & str<127);
% check if length of resulting string is the same than input string
R = (lenght(str) > length(str2))
end;
If instead of just skipping the entire string you want to remove non-printable characters keeping the printable ones, modify the function and return str2. (And change the function name so it matches the new behaviour)

There are several ways to do it.
Save that to a text file named data.txt:
bla Header bla
61 380 Module03 Slot02 27.01.2015 13:47:13 450 3587 1175 84 101.83 22.30 5.20 1 1
62 386 Module03 Slot03 27.01.2015 13:47:18 450ÆăǳШШ 106.83 22.30 25.20 1 1
63 391 Module03 Slot04 27.01.2015 13:47:24 ERROR dgsf 5643332 103.26 22.40 25.20 1 1
Method 1 (using textscan and cellfun):
Removing the non-ASCII line completely:
fileID = fopen('data.txt'); % open file
DataCell = textscan(fileID,'%s','delimiter','','HeaderLines', 1); % read a complete line of text, ignore the first line
fclose(fileID); % close file
DataCell = DataCell{1}; % there is only one string per line
DataCell(cellfun(#(x) any(x>127),DataCell)) = []; % remove line if there is any non-ASCII in it, adjust that to your liking, i.e (x>126 | x<32)
celldisp(DataCell)
DataCell{1} =
61 380 Module03 Slot02 27.01.2015 13:47:13 450 3587 1175 84 101.83 22.30 5.20 1 1
DataCell{2} =
63 391 Module03 Slot04 27.01.2015 13:47:24 ERROR dgsf 5643332 103.26 22.40 25.20 1 1
You could now loop over the cell array or, if you like, start all over again with the updated text (f.e. as input to textscan). To do that join the cells together to one big chunk of text:
strjoin(DataCell','\n')
ans =
61 380 Module03 Slot02 27.01.2015 13:47:13 450 3587 1175 84 101.83 22.30 5.20 1 1
63 391 Module03 Slot04 27.01.2015 13:47:24 ERROR dgsf 5643332 103.26 22.40 25.20 1 1
Method 2 (using regexprep):
I'm loading the whole text file at once and replacing any line with an empty string '', which does not contain a given set of characters.
s = fileread('data.txt');
snew = regexprep(s, '.*[^\w\s.:].*\n', '', 'dotexceptnewline')
snew =
61 380 Module03 Slot02 27.01.2015 13:47:13 450 3587 1175 84 101.83 22.30 5.20 1 1
63 391 Module03 Slot04 27.01.2015 13:47:24 ERROR dgsf 5643332 103.26 22.40 25.20 1 1
The [^\w\s.:] bit bascially translates to:
Match any chararcter which is not (the ^ means not):
alphabetic, numeric or underscore (\w)
whitespace (\s)
a dot . or
a colon :
If you want to exclude any other ASCII character, just add it (to within the brackets).

here is the code which creates a new txt-file whitout the lines with non-ASCII
%% read in via GUI
[inputFilename, inputPathname] = uigetfile('*.txt', ...
'Pick a .txt file from which you want to remove lines with non ASCII characters.');
if isequal(inputFilename, 0)
disp('User selected ''Cancel''')
else
disp(['User selected ', fullfile(inputPathname, inputFilename)])
inputFileID = fopen(fullfile(inputPathname, inputFilename)); %open/load file
end
tempCell = (strsplit(inputFilename,'.'));
inputFilenameWOextension = cell2mat(tempCell(1));
fileExtension = cell2mat(tempCell(2));
outputFileID = fopen([inputFilenameWOextension, '_ASCIIonly.', fileExtension], 'w'); %overwrite existing file
% get a single line of text
tline = fgetl(inputFileID);
while tline ~= -1
% get a single line of text
tline = fgetl(inputFileID);
% Remove non-printable characters
tempStr = tline(tline<127); % not really ASCII, but also tab
%tempStr = tline(31<tline & tline<127); % true ASCII
if (length(tempStr) < length(tline));
continue;
else
fprintf(outputFileID, '%s\r\n', tempStr);
end
end
fclose(inputFileID);
fclose(outputFileID);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spliting an emoji sequence in powershell - powershell

Related

Powershell - [DateTime]::TryParseExact, two apparently identical strings, one works the other doesn't [duplicate]

I have data in hex dump but don't know the encoding. Eg. 0x91 0x05 = 657

Why is the output the way it is? -Splitting and chop

Hex conversion of GUID removes Zeros?

Matlab: delete complete line in txt-file if there is a non-ascii-character

Categories

Resources