Perl pattern matching for a specific line - perl

I'm new to Perl and Regex. I need to read a line in an Arabic text file with IBM864 encoding by using a regular expression specific to that line in file. The line structure is as follows:
16 whitespace character, 4 arabic characters, 36 whitespace characters, 3 digits, 2 whitespaces, \n escape character.
please advise.
Thank you.

Well on the face of it what you want is a regex like this
/ ^ \s{16} \p{Arabic}{4} \s{36} \d{3} \s{2} $ /x

Related

Issue matching Chinese characters in Perl one liner using \p{script=Han}

I'm really stumped by trying to match Chinese characters using a Perl one liner in zsh. I canot get \p{script=Han} to match Chinese characters, but \P{script=Han} does.
Task:
I need to change this:
一
<lb/> 二
to this:
<tag ref="一二">一
<lb/> 二</tag>
There could be a variable number of tags, newlines, whitespaces, tabs, alphanumeric characters, digits, etc. between the two Chinese characters. I believe the most efficient and robust way to do this would be to look for something that is *not a Chinese character.
My attempted solution:
perl -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g'
This has the desired effect when applied to the example above.
Problem:
The issue I am having is that \P{script=Han} (or \p{^script=Han}) matches Chinese characters as well.
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters. When trying to match \P{script=Han}, the regex matches every character in the file.
I don't know why.
This is a problem because in the case of this situation, the output is not as desired:
一
<lb/> 三二
becomes
<tag ref="一二">一
<lb/> 三二</tag>
I don't want this to be matched at all- just instances where 一 and 二 are separated only by characters that are not Chinese characters.
Can anyone tell me what I'm doing wrong? Or suggest a workaround? Thanks!
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters.
The problem is that both your script and your input file are UTF-8 encoded, but you do not say so to perl. If you do not tell perl, it will assume that they are ASCII encoded.
To say that your script is UTF-8 encoded, use the utf8 pragma. To tell perl that all files you open are UTF-8 encoded, use the -CD command line option. So the following oneliner should solve your problem:
perl -Mutf8 -CD -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g' file

Unable to delete whitespace from string with tr, sed

I have a file that contains a whitespace character that I'm not able to successfully remove with command-line tools such as tr or sed. Here's the input:
2,  78 ,, 1
6, 74, ,1
and I want the output to look like:
2,78,,1
6,74,,1
Attempts
If I try tr -d "[[:space:]] the result is 2, 78,,16,74,,1 which leaves a space character and removes the newline.
If I try sed 's/[[:space:]]//g' the result is
2, 78,,1
6,74,,1
which still leaves the space.
I converted the string to hex, and it seems the offending character is a0, but even then the results are not what I'd expect:
sed 's/\xa0//g' yields
2, �78 ,, 1
6, 74, ,1
Question
What is that whitespace character that is not getting caught by the [[:space:]] character class? How can I delete it?
The offending character is a UTF-8-encoded non-breaking space, with hex representation \xc2\xa0. You can remove all spaces, including non-breaking spaces, with
sed -E 's/[[:space:]]|\xc2\xa0//g'
Explanation
-E turns on extended regex to allow the | to represent logical OR
's/pattern/replacement/' substitutes pattern matches with the replacement text (in this case, an empty string), with /g repeating the pattern substitution multiple times per line
[[:space:]] matches most whitespace characters, including spaces and tabs
\xc2\xa0 is the hex code for the UTF-8 non-breaking space
The characters you want to remove are the non-printable ones (i.e the ones not in the [:print:] character class) rather than the ones just the ones in the [:space:] character class:
$ printf 'foo\xc2\xa0bar\n' > file
$ cat file
foo bar
$ tr -dc '[:print:]' < file
foobar$
but I notice the equivalent doesn't work in GNU sed or GNU awk and idk why.

Delete line with specific number of characters

I have a specific file (file.txt) with several lines.
How is it possible to delete all lines that do not have 12 characters, using sed?
Use an interval expression to specify the exact number of characters you want to match between the beginning (^) and end ($) of the input record.
sed '/^.\{12\}$/!d' file
Not sure why you would use sed. This is much cleaner in awk:
awk 'length == 12' file.txt

regex matching white spaces and non-characters in perl

I'm looking for pattern matching for the following.
While space at start followed by characters and then a decimal number like 3.2 and then followed by symbols like $ and #.
For ex: " bash-3.2#"
My code:
while(#wait = $t->waitfor('/^[\s]bash\-3\.2[.] $/i'))
How do i do this.
Thanks,
Sharath
While space at start
^\s
followed by characters
\w+
and then a decimal number like 3.2
-?\d+\.\d+
and then followed by symbols like $ and #.
[\$\#]
So, something like this:
/^\s\w+-?\d+\.\d+[\$\#]/
I assumed that the characters are typical word characters and that the number could be negative

sed: change word order and replace

I'm trying to replace;
randomtext{{XX icon}}
by
randomtext{{ref-XX}}
..in a file, where XX could be any sequence of 2 or 3 lowercase letters.
I attempted rearranging the word order with awk before replacing "icon" with "ref-" with sed;
awk '{print $2, $1}'
..but since there is no space before the first word nor after the second one, it messed up the curly brackets;
icon}} {{XX
What is the simplest way to achieve this using sed?
sed 's/{{\([a-z]\{2,3\}\)\sicon/{{ref-\1/'
This one liner uses the substitute command s/PATTERN/REPLACE/. {{ matches two brackets. \([a-z]\{2,3\}\) captures the pattern that matches 2 or 3 lowercase letters. \s matches a white space. icon matches the literal string "icon". Then we replace the match, that is, {{....icon with the literal string {{ref- and the captured 2 or 3 letter word.
Here's a more generic version using hash tags (#) as regex delimiter:
sed 's#{{\([^ ]*\) [^}]*#{{ref-\1#'
{{ anchors the regex at the double open curly braces.
\([^ ]*\) captures everything up until a space.
[^}]* eats everything up until a closing curly brace.