use perl to extract specific output lines - perl

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.

This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.

sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"

Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Related

PCRE Regex - How to return matches with multiline string looking for multiple strings in any order

I need to use Perl-compatible regex to match several strings which appear over multiple lines in a file.
The matches need to appear in any order (server servernameA.company.com followed by servernameZ.company.com followed by servernameD.company.com or any order combination of the three). Note: All matches will appear at the beginning of each line.
In my testing with grep -P, I haven't even been able to produce a match on simple string terms that appear in any order over new lines (even when using the /s and /m modifiers). I am pretty sure from reading I need a look-ahead assertion but the samples I used didn't produce a match for me even after analyzing each bit of the regex to make sure it was relevant to my scenario.
Since I need to support this in Production, I would like an answer that is simple and relatively straight-forward to interpret.
Sample Input
irrelevant_directive = 0
# Comment
server servernameA.company.com iburst
additional_directive = yes
server servernameZ.company.com iburst
server servernameD.company.com iburst
# Additional Comment
final_directive = true
Expectation
The regex should match and return the 3 lines beginning with server (that appear in any order) if and only if there is a perfect match for strings'serverA.company.com', 'serverZ.company.com', and 'serverD.company.com' followed by iburst. All 3 strings must be included.
Finally, if the answer (or a very similar form of the answer) can address checking for strings in any order on a single line, that would be very helpful. For example, if I have a single-line string of: preauth param audit=true silent deny=5 severe=false unlock_time=1000 time=20ms and I want to ensure the terms deny=5 and time=20ms appear in any order and if so match.
Thank you in advance for your assistance.
Regarding the main issue [for the secondary question see Casimir et Hippolyte answer] (using x modifier): https://regex101.com/r/mkxcap/5
(?:
(?<a>.*serverA\.company\.com\s+iburst.*)
|(?<z>.*serverZ\.company\.com\s+iburst.*)
|(?<d>.*serverD\.company\.com\s+iburst.*)
|[^\n]*(?:\n|$)
)++
(?(a)(?(z)(?(d)(*ACCEPT))))(*SKIP)(*F)
The matches are now all in the a, z and d capturing groups.
It's not the most efficient (it goes three times over each line with backtracking...), but the main takeaway is to register the matches with capturing groups and then checking for them being defined.
You don't need to use the PCRE features, you can simply write in ERE:
grep -E '.*(\bdeny=5\b.*\btime=20ms\b|\btime=20ms\b.*\bdeny=5\b).*' file
The PCRE approach will be different: (however you can also use the previous pattern)
grep -P '^(?=.*\bdeny=5\b).*\btime=20ms\b.*' file

how can i do that pattern ?(pattern with asterisks only)

Qu.17 Write down the program to output the pattern given below using appropriate control structures. Use of control structures is compulsory in this program.
(*****)
(****)
(***)
(**)
(*)
(**)
(***)
(****)
(*****)
edit: have removed probable extra (**)
sounds like a college assignment to me :)
break down the problem into its simplest form and write a test to check your program.
your first test could be something really simple:
can print out single asterisk: (*)
then build it up from there:
given starting number of 2, prints 3 lines of two asterisks (**), (**), (**)
second line should only have one asterisk (**), (*), (**)
...
given starting number x, prints 2x - 1 lines

Diff command - avoiding monolithic grouping of consecutive differing lines

Playing around with the standard linux diff command, I could not find a way to avoid the following type of grouping in its output (the output listings here assume the unified format)
This question aims at the case that each line differs by little from its counterpart in the other file, and it's more useful to see each line next to its counterpart.
I would like instead of having groups like this show up in the comparison output:
- line 1
- line 2
- line 3
+ line 1 modified
+ line 2 modified
+ line 3 modified
To get this:
- line 1
+ line 1 modified
- line 2
+ line 2 modified
- line 3
+ line 3 modified
Of course, this is a convenience question as this can be accomplished by writing your own code to post-process the diff output, or diverging from the lcs algorithm with your own algorithm. I don't think variants like wdiff etc. would help much, as the plain diff -U0 output format fits my needs very well except for this grouping property, whereas wdiff introduces other aspects that are not optimal for my case.
I'm looking for a command-line way, or a library that can be used in code, not a UI tool.
I was trying to solve this myself. The closest I go was this:
diff -y -W 10000 file1 file2 | grep '|' | sed 's/\s*|\s*/\n/g'
The one issue is that this assumes there are no "white space" difference at the beginning of the lines (or that you don't care about it).

Find Duplicate Function names in different files

I have been merging all of source-code files used by various developers/CAD drafters for the past 15 or so years. It appears that everyone worked off the same code base until about 7 years ago, when everyone seems to have made a local copy of all the files and used/edited them locally.
I have successfully/painfully merged all of their files with the same names back together. However, I am finding that sometimes, files with different names contain functions with the same names and parameters. Tools that are expecting one implementation of a function may end up calling a different one depending on which files were loaded when.
Is there a simple way to search all of the files for repeated function names?
For Example, a function looks like this:
(defun MyInStr (SearchIn SearchFor)
...
)
How could I search all files for (defun MyInStr (SearchIn SearchFor)
I would suggest using ctags to generate the TAGS file, then searching it for duplicate lines:
$ ctags -R
$ sort TAGS -o - | uniq -c | grep -v '^ *1 '
The above will produce output like this:
...
3 defun MyInStr (SearchIn SearchFor)
...
which will tell you that MyInStr is re-defined 3 times in the codebase with the identical signature.
You can also extract just the function name using sed or do a more complicated processing of the TAGS file with perl or lisp or python any other scripting tool.

Find and Replace one list of "words" with another list of "words" pairwise in csh

I am trying to modify some length code. I want to replace words in all words in list 1 with words in list 2 (pairwise).
List 1:
Vsap1*(GF/(Kagf+GF))
kdap1*AP1
vsprb
kpc1*pRB*E2F
.
.
List 2:
v1
v2
v3
v4
.
.
In other words, I'd like it to replace all instances of "Vsap1*(GF/(Kagf+GF))" with "v1" (and so on) in the file "code.txt". I have List 1 in a text file ("search_for.txt").
So far, I've been doing something like this:
set search_for=`cat search_for.txt`
set vv=1
foreach reaction $search_for
sed -i s/$reaction/$vv/g code.txt
set vv=$vv+1
end
There are many problems with this code. First, it seems the code can't handle expression with parentheses (something about "regular expressions"?). Second, I'm not sure my counter is working properly. Third, I haven't even integrated the replace list -- I thought it would be easier to just replace with 1,2,3… instead. Ideally, I would like to replace with v1,v3,v3…
Any help would be greatly appreciated!! I work mainly in Matlab (in which it is hard to deal with strings and such) so I'm not that great at csh.
Best,
Mehdi
awk should be better i think
set search_for=`cat search_for.txt`
set vindex=1
foreach reaction ${search_for}
ReactionEscaped="`printf \"%s\" \"${reaction}\" | sed 's²[\+*./[]²\\\\&²g'`"
sed -i "s/${ReactionEscaped}/v${vindex}/g code.txt
let vindex+=1
end
I haven't test (no system available here) so
ReactionEscaped="printf \"%s\" \"${reaction}\" | sed
's²[\+*./[]²\\\\&²g'\"
have to be fine tuned certainly (due to double \ between "", and special meaning of car in first sed pattern) [there is lot of post about escaping special char sed pattern on the site)