I have two problems converting data using Perl? - perl

I want to convert this input {111,222},{333,444},{555,666} so that the output be like this 111;222;333;444;555;666. The same thing goes for {111,222},{333,444} to 111;222;333;444 and {111,222} to 111;222.
I want to convert this date format from 2019-11-12 15:25:19 to 11/12/2019 15:25:19 and 2019-11-04T15:26:11.000+00:00 to 11/12/2019 15:25:19.

You could use a simple script like this:
#!/usr/bin/perl -pl
s/^\{//;
s/\}$//;
s/\}?,\{?/;/g;
The shebang line says read input lines and print the modified form of each one (-p) and handle line endings automatically (-l).
The first s/// removes a leading {. The second removes a trailing }. The last replaces },{ and , with ; across the line.
You could use a simple script like this (assuming that you really want 2019-11-04T15:26:11.000+00:00 mapped to 11/04/2019 15:26:11 and not the other value which is 8 days bar 52 seconds after the input value):
#!/usr/bin/perl -pl
s%(\d{4})-(\d\d)-(\d\d)[T ](\d\d:\d\d:\d\d).*%$2/$3/$1 $4%;
The shebang line is the same as before. The regex looks for 4 digits, dash, 2 digits, dash, 2 digits, blank or T, an hh:mm:ss string and stray junk and reformats it in the correct sequence.
Between them, they produce the desired results.
Of course, this being Perl, TMTOWTDI — timtoady, or "there's more than one way to do it":
$ cat data
{111,222},{333,444},{555,666}
{111,222},{333,444}
{111,222}
$ cat data.2
2019-11-12 15:25:19
2019-11-04T15:26:11.000+00:00
$ perl -pl -e 's/^\{//; s/\}$//; s/\}?,\{?/;/g; s%(\d{4})-(\d\d)-(\d\d)[T ](\d\d:\d\d:\d\d).*%$2/$3/$1 $4%;' data data.2
111;222;333;444;555;666
111;222;333;444
111;222
11/12/2019 15:25:19
11/04/2019 15:26:11
$

Just transliterating sequences of non-digits to ;:
$data =~ y/0-9/;/cs;

Related

Extract filename from multiple lines in unix

I'm trying to extract the name of the file name that has been generated by a Java program. This Java program spits out multiple lines and I know exactly what the format of the file name is going to be. The information text that the Java program is spitting out is as follows:
ABCASJASLEKJASDFALDSF
Generated file YANNANI-0008876_17.xml.
TDSFALSFJLSDJF;
I'm capturing the output in a variable and then applying a sed operator in the following format:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p'
The result set is:
YANNANI-0008876_17.xml.
However, my problem is that want the extraction of the filename to stop at .xml. The last dot should never be extracted.
Is there a way to do this using sed?
Let's look at what your capture group actually captures:
$ grep 'YANNANI.\([[:digit:]]\).\([xml]\)*' infile
Generated file YANNANI-0008876_17.xml.
That's probably not what you intended:
\([[:digit:]]\) captures just a single digit (and the capture group around it doesn't do anything)
\([xml]\)* is "any of x, m or l, 0 or more times", so it matches the empty string (as above – or the line wouldn't match at all!), x, xx, lll, mxxxxxmmmmlxlxmxlmxlm, xml, ...
There is no way the final period is removed because you don't match anything after the capture groups
What would make sense instead:
Match "digits or underscores, 0 or more": [[:digit:]_]*
Match .xml, literally (escape the period): \.xml
Make sure the rest of the line (just the period, in this case) is matched by adding .* after the capture group
So the regex for the string you'd like to extract becomes
$ grep 'YANNANI.[[:digit:]_]*\.xml' infile
Generated file YANNANI-0008876_17.xml.
and to remove everything else on the line using sed, we surround regex with .*\( ... \).*:
$ sed -n 's/.*\(YANNANI.[[:digit:]_]*\.xml\).*/\1/p' infile
YANNANI-0008876_17.xml
This assumes you really meant . after YANNANI (any character).
You can call sed twice: first in printing and then in replacement mode:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p' | sed 's/\.$//g'
the last sed will remove all the last . at the end of all the lines fetched by your first sed
or you can go for a awk solution as you prefer:
awk '/.*YANNANI.[0-9]+.[0-9]+.xml/{print substr($NF,1,length($NF)-1)}'
this will print the last field (and truncate the last char of it using substr) of all the lines that do match your regex.

Perl one-liner: deleting a line with pattern matching

I am trying to delete bunch of lines in a file if they match with a particular pattern which is variable.
I am trying to delete a line which matches with abc12, abc13, etc.
I tried writing a C-shell script, and this is the code:
**!/bin/csh
foreach $x (12 13 14 15 16 17)
perl -ni -e 'print unless /abc$x/' filename
end**
This doesn't work, but when I use the one-liner without a variable (abc12), it works.
I am not sure if there is something wrong with the pattern matching or if there is something else I am missing.
Yes, it's the fact you're using single quotes. It means that $x is being interpreted literally.
Of course, you're also doing it very inefficiently, because you're processing each file multiple times.
If you're looking to remove lines abc12 to abc17 you can do this all in one go:
perl -n -i.bak -e 'print unless m/abc1[234567]/' filename
Try this
perl -n -i.bak -e 'print unless m/abc1[2-7]/' filename
using the range [2-7] only removes the need to type [234567] which has the effect of saving you three keystrokes.
man 1 bash: Pattern Matching
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched.
A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

What does \x do in print

I would like to start by saying that I am not familiar with Perl. That being said, I came across this piece of code and I could not figure out what the \x was for in the code below. In addition, I was unsure why nothing was displayed when I ran the following:
perl -e 'print "\x7c\x8e\x04\x08"'
It's not about print: it's about string representation, in which codes represent characters from your character set. For more information you should read Quote and Quote-like Operators and Effects of Character Semantics
In your case the character code is in hex. You should look in your character set table, and you may need to convert to decimal first.
You said "I was unsure why nothing was displayed when I ran the following:"
perl -e 'print "\x7c\x8e\x04\x08"'
That command outputs 4 characters to STDOUT. Each of the characters is specified in hexadecimal. The "\x7c" part will output the vertical bar character |. The other three characters are control characters, so probably wouldn't produce any visible output. If you redirect output to a file, you will end up with a 4 byte file.
It's possible that you're not seeing the vertical bar character because it's being overwritten by your command prompt. Unlike the shell echo or Python's print, Perl's print function does not automatically append a newline to all output. If you want new lines, you can insert them in the string using \n.
\x signifies the start of a hexadecimal character notation.

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

Bash/AWK/SED Match and ReWrite a string of numbers (date) in a line

I have a text file with the following contents repeating about 60 times coming from a converted .ics file:
Start Vak
Tijd van: 20120411T093000Z
Tijd tot: 20120411T100000Z
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak
I want to rewrite the "Tijd van" and "Tijd tot" values to become a good date (in a bash script on a gnu/linux system with awk,sed and grep etc.). I tried to use awk to find it:
awk '/^Tijd.*[:digit:][:digit:]Z$/; { getline; print $0; }' rooster2.txt
and grep:
egrep '/^Tijd(.*)[:digit:][:digit:]Z$/' rooster2.txt
But they both do not even find the line.
What I want is to get that date rewritten to a more bash parsable/feasible time format like the EPOCH or something like 31.04.2012 13:00:00. I do not want to replace or rewrite the whole line, just the specific string! Anything, either tips, examples or links are welcome and very usefull.
Try this (GNU sed):
sed -r 's/(Tijd ...: )(....)(..)(..).(..)(..)(..)./\1 \4.\3.\2 \5:\6:\7/' FILE
There are several issues with your awk code:
While [:digit:] refers to "any digit", you still need another pair of square brackets ([...]) for the character group: [[:digit:]] (Just image you wanted "a,any digit or _" , this would be [a[:digit:]_], the outer square brackets defining the character group.)
The semicolon (;) between your pattern (/.../) and the corresponding action ({...}) separates the two, so you have a pattern with no action, resulting in the standard action {print $0}, and a second action without a pattern, resulting in it being performed for all records (i.e. lines).
The getline asks awk to read the next record (i.e. line) before continuing.
Taking all that together your code does the following:
Print all lines matching /^Tijd.*[:digit:][:digit:]Z$/ (that is none, since [:digit:] translates to "one of :,d,i,g, or t").
Additionally, for all lines: read the next line and print it.
Thus, it will print all but the first line (because that is the only one that is not the next one to any other line).
Assuming you just want to print the lines matching "starting with 'Tijd' and ending with two digits followed by a 'Z'" you could use the following code:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{ print $0; }' rooster2.txt
Since {print $0} is the standard action you could even shorten that to
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/' rooster2.txt
To solve your actual problem you could use something like the following:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{year=substr($NF,1,4);month=substr($NF,5,2);day=substr($NF,7,2);hour=substr($NF,10,2);min=substr($NF,12,2);sec=substr($NF,14,2);$NF=day"."month"."year" "hour":"min":"sec}1' rooster2.txt
This works as follows:
For records (i.e lines) matching the pattern (/.../), rearrange the last field ($NF) to your needs.
Print all records (i.e. lines) (1 is a pattern matching all records (i.e. lines) with no specified action, resulting in the standard one ({print $0}))
Note that GNU awk also has a strftime function.
However, that needs the timestamp to be in a different format.
If you want to use that you must still rearrange the field, first:
awk -v FORMAT="%c" '/^Tijd.*[[:digit:]][[:digit:]]Z$/{$NF=strftime(FORMAT,mktime(substr($NF,1,4)" "substr($NF,5,2)" "substr($NF,7,2)" "substr($NF,10,2)" "substr($NF,12,2)" "substr($NF,14,2)))}1' rooster2.txt
Now, you just need to adjust FORMAT to your needs to change the format.
See man strftime for details.
As a ruby one-liner; requiring time for Time.parse then replacing
matching regexp. You may look strftime method for formatting time
output.
[slmn#uriel ~]$ ruby -rtime -ne 'puts $_.sub(/(Tijd (van|tot): )(.*)/) { $1 + Time.parse($3).strftime("%D %T") }' < yourfile.txt
Start Vak
Tijd van: 04/11/12 09:30:00
Tijd tot: 04/11/12 10:00:00
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak