Terminal command to find unique pairs where order does not matter - command-line

I have a Python script my_script.py which generates a list of tab-separated pairings between two elements, one for each line:
$ python my_script.py
cat dog
dog wolf
cat dog
pig chicken
dog cat
I am looking to pipe the output of this script into a terminal command of some sort that I want to filter out duplicate combinations, not just duplicate permutations. For duplicate permutations, I can use something like:
$ python my_script.py | sort | uniq
cat dog
dog cat
dog wolf
pig chicken
to remove the duplicate "cat dog".
The problem with this approach is that I am left with both "cat dog" and "dog cat", which for my purposes should be treated as the same (same combination). I know I could write another very simple Python script to perform the kind of filtering I am after, but I wanted to see whether there is an even simpler terminal command that will do the equivalent.

Here's one way using awk:
... | awk -F "\t" '!a[$1,$2]++ && !a[$2,$1]++'
Results:
cat dog
dog wolf
pig chicken
Explanation:
-F "\t" # sets the field (column) separator to a single tab character
!a[$1,$2]++ # adds column one and column two to a pseudo-multidimensional
# array if they haven't already been added to the array
!a[$2,$1]++ # does the same thing, but adds the columns in the opposite
# orientation.
Putting it altogether:
So for every line of input, the line will be printed if and only if the first two fields (in either orientation) don't exist in the array. You can read more about how to emulate a multi-dimensional array here.

Caution: script above doesn't provide any output for cases where $1==$2 . Can test via:
echo "dog dog" | awk '!a[$1,$2]++ && !a[$2,$1]++'|wc -l
Try this instead:
|awk '{if($1<$2)print $1,$2; else print $2,$1}'|sort|uniq

Related

Perl System Grep and Paste

I have a file that looks like this:
Dog
BulldogTerrier
Cat
Persian
Ape
Gorilla
Dog
PitbullLabShepardHusky
I want to be able to search for each line containing dog and select everything until the next empty line and put it into a new file.
I want an output file like:
Dog
BulldogTerrier
Dog
PitbullLabShepardHusky
I know I can use grep to find the word dog but how can I use it, or with what can I use it, so that it grabs everything after it UNTIL the next empty line and moves it into another file.
I am writing a script in Perl to do this because there are other things I wish to add on that are made easier with Perl. I was going to use system(grep....) to find the word but I wasn't sure what to do after that.
I will also note that I want to be able to do this recursively. I have many files that look like what I had shown and I would like to extract the Dog block from all of them. So it would be something recursive from the directory.
perl -ne 'print if /^Dog/../^$/' file
The .. and ... operators in perl can join two conditionals. From the time that the first evaluates true until the second conditional evaluates true, the joined conditional will evaluate true. So you want to print from the time $_ =~ m/^Dog/ is true until $_ =~ m/^\s+$/ is true. The above is shorthand for that.
The distinction between .. vs ... is not important here because in this case, the conditionals cannot both be true on the same line.
IF you can use awk, then this can be done. By setting Record Selector to nothing awk works in block mode. Test if block starts with dog, and if yes do the default action, print the block.
awk '/^Dog/' ORS="\n\n" RS="" file
Dog
Bulldog
Terrier
Dog
Pitbull
Lab
Shepard
Husky

grep or awk - how to return line if column 1 and 3 have the same value

I have a tab delimited file and I want the output to have the entire line in my file if values in column 1 are the same as the values in column 3. Having very limited knowledge in perl and linux, this is as close as I came to a solution.
File example
Apple Sugar Apple
Apple Butter Orange
Raisins Flour Orange
Orange Butter Orange
The results would be:
Apple Sugar Apple
Orange Butter Orange
Code:
#!/bin/sh
awk '{
prev=$0; f1=$1; f3=$3;
getline
if ($1 == $3) {
print prev
print
}'
} myfilename
I am sure that there is an easier solution to it. Maybe even a grep or awk on the command line. But that was the only code I could find that seemed to give me my solution.
Thanks!
It's easy with awk:
awk '$1 == $3' myfile
The default action is to print out the record, so if fields 1 and 3 are equal, that's what will happen.
Using awk
awk is the tool for the job:
awk '$1 == $3'
If your fields in the data are strictly tab separated and may contain blanks, then you will need to specify the field separator explicitly:
awk -F'\t' '$1 == $3'
(where the The \t represents a tab; you may have to type Tab (or even Control-VTab) to get it into the string).
Using grep
You can do it with grep, but you don't want to do it with grep:
grep -E '([A-Za-z]+)\t[A-Za-z]+\t\1'
The key part of the regex is the \1 which means 'the same value as the first captured string.
You might even go through gyrations like this in bash:
grep -E $'([A-Za-z]+)\t[A-Za-z]+\t\\1'
You could simplify life by noting (assuming) there are no spaces within fields:
grep -E '([A-Za-z]+)[[:space:]]+[A-Za-z]+[[:space:]]+\1'
As noted in one of the comments, I didn't put a $ at the end of the search pattern; it would be feasible (though the data would have to be cleaned up to contain tabs and drop trailing blanks), so that 'Good Noise GoodBad' would not be picked up. There are other ways to do it, and you can make the regex more and more complex to handle more possible situations. But those only go to emphasize that the awk solution is better; awk deals with the details automatically.
Using grep:
grep -P "([^\t]+)\t[^\t]+\t\1" inFile

I want to print a text file in columns

I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.

Read word from a file and return next word

Using shell script I want to read a word from text file and return next column word.
For eg, my input file will be like
AGE1 PERSON1
AGE2 PERSON2
AGE3 PERSON3
AGE4 PERSON4
I have variable in Sh file having PERSON's name.
I want read input text file and get value of person's age.
Please help, i'm beginner in Shell Scripting
A slightly simpler solution is:
age=$( awk '$2==name { print $1 }' name="$name" input-file )
Building upon shellter's comment:
age=$(grep "$person_name" people_file.txt | cut -f1 -d' ')
I'll try to explain everything. First, I assume somethings (but you can change them on your script):
Your file with the data you entered is called people_file.txt.
The person's name you want to find is in the variable $person_name.
The variable you want to store the result is $age.
Firstly, because we need to use commands to generate the value of the $age variable, we must use $( and ) to run a command (or a series of commands), and replace itself with the text it captures from executing the command (or commands).
We first need to find the line which contains the person's name. For that we use grep: grep regex file. Grep will search file line by line until it finds a line that matches the regular expression regex. In our case we can simply search for the person's name directly (assuming it doesn't contain special characters, like the period or an asterisk). Note that we must place the variable between double quotes, otherwise a person's name that has a space in it might be split in the command line so that its first name is used as the regular expression and the surname as the file. If you want to search in a case insensitive manner (like for example: John will find a line with JOHN or john), you can use the -i flag: grep -i regex file. The selected lines will be printed by grep into its output, but we will pump those lines into the input of the next command with the pipe operator |.
Finally, we have a line (or many lines) with the results. Now we must extract the age. The cut command will split each line it reads from the input into fields, and only print the fields you ask it to. In this case, we ask for the first field with the -f1 option. Also, we specify that the space character is to be used as the delimeter (ie. the character that separates the fields) with the -d1 command.
If you have more than one line with the same person's name, we need to pipe the output of grep into a head command, so that we can have only the number of lines we want. We can tell head how many lines we want with the -n N option. So if you only want the first match:
age=$(grep "$person_name" people_file.txt | head -n 1 | cut -f1 -d' ')
Hope this helps a little =)
age=`
perl -nle'
BEGIN { $n = shift(#ARGV); }
print $1 if /^(\S+)\s+\Q$n\E$/;
' "$name" file
`
Tested with bash in sh mode.

Extract the part enclosed by a predefined multiline character sequence

Hope the AWK gurus can provide a solution to my problem .
I have a file that goes like this :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat dog pat ate cat dog
I have to use AWK to extract the pattern between the first occuring c and a d .Starting from the first c a count should be kept on the number of c's and d's such that when the count matches , the part between the first c and the matched d shoud be ouput to a file including the number of the line in which the match for d occured .
In this particular example the match occurs on the seventh dog , therefore the output will have to be :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat d
The match can go beyond just two lines ! The output can or cannot be inclusive of the c and the d .There exists all kinds of characters inclusive of the special ones in the text !
In order for the print to occur the count has to be matched .
Thanks in advance for the replies. Suggestions are always welcome .
EDIT : The capture of the pattern between c and d can be compromised as long as the condition is met and the line number of the exit d is obtained :)
A few tips, without giving the full solution:
By default, awk considers each line as a record. The default record separator is RS="\n".
Depending on your version of awk, you may be able to set RS, the record separator, to a regex which matches either c or d. Then, for each record, you can check the RT variable, which will contain either c or d, depending on what has actually been matched. Starting from there, using a variable incremented on c, decremented on d you will be able to find the end of the match when it reaches 0.
You can then use a variable that contains your match so far, and keep concatenating RT and the new record to it, until you're done.
If you need to know the line number of the end of the match, you can set RS to a regex that either matches c, d, as previously, but also add the possibility to match \n. And by maintaining another counter variable incremented every time RT tells you that \n has been matched, you'll have your line number.
Here's a sed solution just for fun:
sed -rne ':r;$!{N;br};s/^[^c]*(.*d)[^d]*$/\1/;:a;h;s/[^cd]//g;' \
-e ':s;s/d(.*)c/c\1d/;ts;s/cd/c\nd/;T;y/c/d/;/^(d+)\n\1$/{g;i -------' \
-e 'p};g;s/d[^d]*d$/d/;ta'
This prints all satisfying sequences from longest to shortest.