perl regex matched part as output filename - perl

when I have a simple file like
Ann Math 99
Bob Math 100
Ann Chemistry 92
Ann History 78
I may split it into files per person with
awk '{print > $1}' input_filename
However, when the file becomes complex, it is no longer possible to do so unless I use a very complex regex as a field separator. I find that I can extract output filename with some regex, and the following command seems to be able to do what I want for a test with 5 lines:
sed 5q input_filename | perl -nle 'if(/\[([A-Za-z0-9_]+)\]/){open(FH,">","$1"); print FH $_; close FH}'
but the file is large and the command seems to be inefficient. Are there better ways to do it?
original files are like this:
SOME_VERY_LONG_STUFF[TAG1]SOME_EVEN_LONGER_STUFF
SOME_VERY_LONG_STUFF[TAG2]SOME_EVEN_LONGER_STUFF
SOME_VERY_LONG_STUFF[TAG3]SOME_EVEN_LONGER_STUFF
SOME_VERY_LONG_STUFF[TAG1]SOME_EVEN_LONGER_STUFF
SOME_VERY_LONG_STUFF[TAG3]SOME_EVEN_LONGER_STUFF
...
and I just want to split it into files with name TAG1, TAG2, TAG3..., each file contains and only contains lines in the original file that has the tag in the bracket.
the first line with small modifications:
Nov 30 18:00:00 something#syslog: [2019-11-30 18:00:00][BattleEnd],{"result":1,"life":[[0,30,30],[1,30,30],[2,30,29],[3,30,29],[4,30,29],[5,28,29],[6,28,21],[7,28,21],[8,28,14],[9,28,14],[10,29,13],[11,21,13],[12,21,13],[13,15,13],[14,16,12],[15,12,12],[16,12,12],[17,9,12],[18,9,12],[19,5,12],[20,5,12],[21,3,12],[22,3,12],[23,1,12],[24,1,10],[25,1,10],[26,1,10],[27,1,10],[28,2,9],[29,-1,9]],"Info":[[160,0],[161,0],[162,0],[163,0],[155,0],[157,0],[158,0],[159,0]],"cards":[11401,11409,11408,12201,12208,10706,12002,10702,12207,12204,12001,12007,12208,10702,12005,10701,12005,11404,10705,10705,12007,11401,10706,12002,12001,12204,10701,12207,11404,11409,11408,12201]}
the tag I want is "BattleEnd". I want to split the log according to log sources.

EDIT: Since OP changed samples so adding this code now, completely based on shown samples of OP.
awk -F"[][]" '{print >> ($4);close($4)}' Input_file
OR if you want to close output files(to avoid too many files opened error) on whenever previous field is NOT matched then try following.
awk -F"[][]" 'prev!=$4{close(prev)} {print >> ($4);prev=$4}' Input_file
Could you please try following, based on your shown samples.
awk '
match($0,/[^]]*/){
val=substr($0,RSTART,RLENGTH)
sub(/.*\[/,"",val)
print >> (val)
close(val)
}
' Input_file

Related

Improving sed program - conditions

I use this code according to this question.
$ names=(file1.txt file2.txt file3.txt) # Declare array
$ printf 's/%s/a-&/g\n' "${names[#]%.txt}" # Generate sed replacement script
s/file1/a-&/g
s/file2/a-&/g
s/file3/a-&/g
$ sed -f <(printf 's/%s/a-&/g\n' "${names[#]%.txt}") f.txt
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{a-file3}
TEXT
75
How to make conditions that solve the following problem please?
names=(file1.txt file2.txt file3file2.txt)
I mean that there is a world in the names of files that is repeated as a part of another name of file. Then there is added a- more times.
I tried
sed -f <(printf 's/{%s}/{s-&}/g\n' "${files[#]%.tex}")
but the result is
\input{a-{file1}}
I need to find {%s} and a- place between { and %s
It's not clear from the question how to resolve conflicting input. In particular, the code will replace any instance of file1 with a-file1, even things like 'foofile1'.
On surface, the goal seems to be to change tokens (e.g., foofile1 should not be impacted by by file1 substitution. This could be achieved by adding word boundary assertion (\b) - before and after the filename. This will prevent the pattern from matching inside other longer file names.
printf 's/\\b%s\\b/a-&/g\n' "${names[#]%.txt}"
Since this explanation is too long for comment so adding an answer here. I am not sure if my previous answer was clear or not but my answer takes care of this case and will only replace exact file names only and NOT mix of file names.
Lets say following is array value and Input_file:
names=(file1.txt file2.txt file3file2.txt)
echo "${names[*]}"
file1.txt file2.txt file3file2.txt
cat file1
TEXT
\connect{file1}
\begin{file2}
\connect{file3}
TEXT
75
Now when we run following code:
awk -v arr="${names[*]}" '
BEGIN{
FS=OFS="{"
num=split(arr,array," ")
for(i=1;i<=num;i++){
sub(/\.txt/,"",array[i])
array1[array[i]"}"]
}
}
$2 in array1{
$2="a-"$2
}
1
' file1
Output will be as follows. You could see file3 is NOT replaced since it was NOT present in array value.
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{file3}
TEXT
75

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

How to assign number for a repeating pattern

I am doing some calculations using gaussian. From the gaussian output file, I need to extract the input structure information. The output file contains more than 800 structure coordinates. What I did so far is, collect all the input coordinates using some combinations of the grep, awk and sed commands, like so:
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"}1' | sed '/--/d' > test.out
This helped me to grep all the input coordinates and insert a line with "structure number". So now I have a file that contains a pattern which is being repeated in a regular fashion. The file is like the following:
structure Number
4.176801 -0.044096 2.253823
2.994556 0.097622 2.356678
5.060174 -0.115257 3.342200
structure Number
4.180919 -0.044664 2.251182
3.002927 0.098946 2.359346
5.037811 -0.103410 3.389953
Here, "Structure number" is being repeated. I want to write a number like "structure number:1", "structure number 2" in increasing order.
How can I solve this problem?
Thanks for your help in advance.
I am not familiar at all with a program called gaussian, so I have no clue what the original input looked like. If someone posts an example I might be able to give an even shorter solution.
However, as far as I got it the OP is contented with the output of his/her code besided that he/she wants to append an increasing number to the lines inserted with awk.
This can be achieved with the following line (adjusting the OP's code):
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"++i}1' | sed '/--/d' > test.out
Addendum:
Even without knowing the actual input, I am sure that one can at least get rid of the sed command leaving that piece of work to awk. Also, there is no need to quote a single character grep pattern:
grep -A 7 "Input orientation:" test.log | grep -A 5 C | awk '/C/{print "structure number"++i}!/--/' > test.out
I am not sure since I cannot test, but it should be possible to let awk do the grep's work, too. As a first guess I would try the following:
awk '/Input orientation:/{li=7}!li{next}{--li}/C/{print "structure number"++i;lc=5}!lc{next}{--lc}!/--/' test.log > test.out
While this might be a little bit longer in code it is an awk-only solution doing all the work in one process. If I had input to test with, I might come up with a shorter solution.

I want to print a text file in columns

I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.

Bash/AWK/SED Match and ReWrite a string of numbers (date) in a line

I have a text file with the following contents repeating about 60 times coming from a converted .ics file:
Start Vak
Tijd van: 20120411T093000Z
Tijd tot: 20120411T100000Z
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak
I want to rewrite the "Tijd van" and "Tijd tot" values to become a good date (in a bash script on a gnu/linux system with awk,sed and grep etc.). I tried to use awk to find it:
awk '/^Tijd.*[:digit:][:digit:]Z$/; { getline; print $0; }' rooster2.txt
and grep:
egrep '/^Tijd(.*)[:digit:][:digit:]Z$/' rooster2.txt
But they both do not even find the line.
What I want is to get that date rewritten to a more bash parsable/feasible time format like the EPOCH or something like 31.04.2012 13:00:00. I do not want to replace or rewrite the whole line, just the specific string! Anything, either tips, examples or links are welcome and very usefull.
Try this (GNU sed):
sed -r 's/(Tijd ...: )(....)(..)(..).(..)(..)(..)./\1 \4.\3.\2 \5:\6:\7/' FILE
There are several issues with your awk code:
While [:digit:] refers to "any digit", you still need another pair of square brackets ([...]) for the character group: [[:digit:]] (Just image you wanted "a,any digit or _" , this would be [a[:digit:]_], the outer square brackets defining the character group.)
The semicolon (;) between your pattern (/.../) and the corresponding action ({...}) separates the two, so you have a pattern with no action, resulting in the standard action {print $0}, and a second action without a pattern, resulting in it being performed for all records (i.e. lines).
The getline asks awk to read the next record (i.e. line) before continuing.
Taking all that together your code does the following:
Print all lines matching /^Tijd.*[:digit:][:digit:]Z$/ (that is none, since [:digit:] translates to "one of :,d,i,g, or t").
Additionally, for all lines: read the next line and print it.
Thus, it will print all but the first line (because that is the only one that is not the next one to any other line).
Assuming you just want to print the lines matching "starting with 'Tijd' and ending with two digits followed by a 'Z'" you could use the following code:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{ print $0; }' rooster2.txt
Since {print $0} is the standard action you could even shorten that to
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/' rooster2.txt
To solve your actual problem you could use something like the following:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{year=substr($NF,1,4);month=substr($NF,5,2);day=substr($NF,7,2);hour=substr($NF,10,2);min=substr($NF,12,2);sec=substr($NF,14,2);$NF=day"."month"."year" "hour":"min":"sec}1' rooster2.txt
This works as follows:
For records (i.e lines) matching the pattern (/.../), rearrange the last field ($NF) to your needs.
Print all records (i.e. lines) (1 is a pattern matching all records (i.e. lines) with no specified action, resulting in the standard one ({print $0}))
Note that GNU awk also has a strftime function.
However, that needs the timestamp to be in a different format.
If you want to use that you must still rearrange the field, first:
awk -v FORMAT="%c" '/^Tijd.*[[:digit:]][[:digit:]]Z$/{$NF=strftime(FORMAT,mktime(substr($NF,1,4)" "substr($NF,5,2)" "substr($NF,7,2)" "substr($NF,10,2)" "substr($NF,12,2)" "substr($NF,14,2)))}1' rooster2.txt
Now, you just need to adjust FORMAT to your needs to change the format.
See man strftime for details.
As a ruby one-liner; requiring time for Time.parse then replacing
matching regexp. You may look strftime method for formatting time
output.
[slmn#uriel ~]$ ruby -rtime -ne 'puts $_.sub(/(Tijd (van|tot): )(.*)/) { $1 + Time.parse($3).strftime("%D %T") }' < yourfile.txt
Start Vak
Tijd van: 04/11/12 09:30:00
Tijd tot: 04/11/12 10:00:00
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak