left outer join by comparing 2 files

left outer join by comparing 2 files - sed

I have 2 files as shown below:
success.txt
amar
akbar
anthony
john
jill
tom
fail.txt
anthony
tom
I want to remove the records from sucess.txt those matches with fail.txt
Expected output:
amar
akbar
john
jill

I'd use fgrep - if available - as you're using fixed strings it should be more efficient.
fgrep -v -x -f fail.txt success.txt
You need the -x option to ensure only whole lines are matched, otherwise fails like tom will match successes like tomas.

awk one-liner: also keep the original order
awk 'NR==FNR{a[$0]=1;next;}!($0 in a)' fail.txt success.txt

There is a Posix-standard join(1) program in all modern Unix systems, see man join.
$ join -v1 success.txt fail.txt

Related

extract some values from file#1 and others from file#2 and print them into file#3

I hope you can help with this question, I have two files, each one has some lines that I need in a third file. But I need to take some entire lines (with values in 5 or 6 columns) from file#1 and others from file#2 and save them in file#3 (keeping the line number). Example:
File 1
1. mike
2. linda
3. matt
4. eric
5. emma
File 2
1. beth
2. shelly
3. michael
4. andy
5. theo
File 3 (output)
1. mike
2. shelly
3. matt
4. andy
5. emma
So, I need to extract the values of line 2 and 4 (from file#2) and print them in a third file while keeping the content of lines 1, 3 and 5 from file#1.
I tried this using sed (easy example):
sed -n -e 1,3p -e 5p file1.txt > file3.txt
This will take lines 1,3 and 5 from my file#1 and print them in file#3, but I don't know how to get the lines from file#2 (2 and 4) and add them into file#3.

Using grep to annotate with file names:
grep -H '.*' in1 in2 | sed '/in1:[24]/d;/in2:[135]/d;s/[^:]*://' | sort
Output:
1. mike
2. shelly
3. matt
4. andy
5. emma

sed probably isn't a very suitable tool for this. How about
paste in1 in2 | awk -F '\t' '{ print $(1+(1+NR)%2) }'
The Awk variable NR is the current input line number and the modulo operator NR%2 flip-flops between 1 and 0. We need to perform a couple of additions to get it to flip-flop between 1 and 2. Then it's easy to print alternating columns from the paste output.

How to delete certain pattern in a record?

I have a file which has hundreds of recorded in the below format:
20150416110321|21,VPLA,91974737XXX5|91974737XXX5,404192086271201|404192086271201,SAI-IMEISV,gsn65.xxxxx.com,gsn65.xxxxx.com;1429148977;301814701;11276100,100.XX.199.250|100.XX.199.XXX|,1,SAIOLU-Location,SAIOLU-LG,2,internet|internet,,SAIOLU-SGSNIP,6,AL,AL_F_1_25G40K_2_25G20K_28|KL_BASIC,,UNKNOWN,SAIOLU-MK,UNKNOWN,SAIOLU-MBRUL,SAIOLU-MBRDL,,,,SAI-IMEI,,,,
I want to take only the first part of the pipe separated data in fields/columns 1-8. How can I do that with awk/sed ?
For example:
20150416110321,VPLA,91974737XXX5,404192086271201,SAI-IMEISV,gsn65.xxxxx.com;1429148977;301814701;11276100,100.XX.199.250,1,SAIOLU-Location,SAIOLU-LG,2,internet|internet,,SAIOLU-SGSNIP,6,AL,AL_F_1_25G40K_2_25G20K_28|KL_BASIC,,UNKNOWN,SAIOLU-MK,UNKNOWN,SAIOLU-MBRUL,SAIOLU-MBRDL,,,,SAI-IMEI,,,,
Thanks

You could use awk.
$ awk -F, -v OFS="," '{for(i=1;i<=8;i++)sub(/\|.*/,"",$i)}1' file
20150416110321,VPLA,91974737XXX5,404192086271201,SAI-IMEISV,gsn65.xxxxx.com,gsn65.xxxxx.com;1429148977;301814701;11276100,100.XX.199.250,1,SAIOLU-Location,SAIOLU-LG,2,internet|internet,,SAIOLU-SGSNIP,6,AL,AL_F_1_25G40K_2_25G20K_28|KL_BASIC,,UNKNOWN,SAIOLU-MK,UNKNOWN,SAIOLU-MBRUL,SAIOLU-MBRDL,,,,SAI-IMEI,,,,

sed ':cycle
s/^\(\([^,]*,\)\{0,7\}[^,|]*\)|[^,]*/\1/;t cycle' YourFile
recursively remove all content between | and next , included for first 8 group separate by ,

Should I use cut or awk to extract fields and field substrings?

I have a file with pipe-separated fields. I want to print a subset of field 1 and all of field 2:
cat tmpfile.txt
# 10 chars.|variable length num|text
ABCDEFGHIJ|99|U|HOMEWORK
JIDVESDFXW|8|C|CHORES
DDFEXFEWEW|73|B|AFTER-HOURS
I'd like the output to look like this:
# 6 chars.|variable length num
ABCDEF|99
JIDVES|8
DDFEXF|73
I know how to get fields 1 & 2:
cat tmpfile.txt | awk '{FS="|"} {print $1"|"$2}'
And know how to get the first 6 characters of field 1:
cat tmpfile.txt | cut -c 1-6
I know this is fairly simple, but I can't figure out is how to combine the awk and cut commands.
Any suggestions would be greatly appreciated.

You could use awk. Use the substr() function to trim the first field:
awk -F'|' '{print substr($1,1,6),$2}' OFS='|' inputfile
For your input, it'd produce:
ABCDEF|99
JIDVES|8
DDFEXF|73
Using sed, you could say:
sed -r 's/^(.{6})[^|]*([|][^|]*).*/\1\2/' inputfile
to produce the same output.

You could use cut and paste, but then you have to read the file twice, which is a big deal if the file is very large:
paste -d '|' <(cut -c 1-6 tmpfile.txt ) <(cut -d '|' -f2 tmpfile.txt )

Just for another variation: awk -F\| -vOFS=\| '{print $1,$2}' t.in | cut -c 1-6,11-
Also, as tripleee points out, two cuts can do this too: cut -c 1-6,11- t.in | cut -d\| -f 1,2

I like a combination of cut and sed, but that's just a preference:
cut -f1-2 -d"|" tmpfile.txt|sed 's/\([A-Z]\{6\}\)[A-Z]\{4\}/\1/g'
Result:
# 10-digits|variable length num
ABCDEF|99
JIDVES|8
DDFEXF|73
Edit: (Removed the useless cat) Thanks!

grep or awk - how to return line if column 1 and 3 have the same value

I have a tab delimited file and I want the output to have the entire line in my file if values in column 1 are the same as the values in column 3. Having very limited knowledge in perl and linux, this is as close as I came to a solution.
File example
Apple Sugar Apple
Apple Butter Orange
Raisins Flour Orange
Orange Butter Orange
The results would be:
Apple Sugar Apple
Orange Butter Orange
Code:
#!/bin/sh
awk '{
prev=$0; f1=$1; f3=$3;
getline
if ($1 == $3) {
print prev
print
}'
} myfilename
I am sure that there is an easier solution to it. Maybe even a grep or awk on the command line. But that was the only code I could find that seemed to give me my solution.
Thanks!

It's easy with awk:
awk '$1 == $3' myfile
The default action is to print out the record, so if fields 1 and 3 are equal, that's what will happen.

Using awk
awk is the tool for the job:
awk '$1 == $3'
If your fields in the data are strictly tab separated and may contain blanks, then you will need to specify the field separator explicitly:
awk -F'\t' '$1 == $3'
(where the The \t represents a tab; you may have to type Tab (or even Control-VTab) to get it into the string).
Using grep
You can do it with grep, but you don't want to do it with grep:
grep -E '([A-Za-z]+)\t[A-Za-z]+\t\1'
The key part of the regex is the \1 which means 'the same value as the first captured string.
You might even go through gyrations like this in bash:
grep -E $'([A-Za-z]+)\t[A-Za-z]+\t\\1'
You could simplify life by noting (assuming) there are no spaces within fields:
grep -E '([A-Za-z]+)[[:space:]]+[A-Za-z]+[[:space:]]+\1'
As noted in one of the comments, I didn't put a $ at the end of the search pattern; it would be feasible (though the data would have to be cleaned up to contain tabs and drop trailing blanks), so that 'Good Noise GoodBad' would not be picked up. There are other ways to do it, and you can make the regex more and more complex to handle more possible situations. But those only go to emphasize that the awk solution is better; awk deals with the details automatically.

Using grep:
grep -P "([^\t]+)\t[^\t]+\t\1" inFile

*nix: perform set union/intersection/difference of lists

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary

Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.

If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.

If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}

I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

left outer join by comparing 2 files - sed

I have 2 files as shown below: success.txt amar akbar anthony john jill tom fail.txt anthony tom I want to remove the records from sucess.txt those matches with fail.txt Expected output: amar akbar john jill

I'd use fgrep - if available - as you're using fixed strings it should be more efficient. fgrep -v -x -f fail.txt success.txt You need the -x option to ensure only whole lines are matched, otherwise fails like tom will match successes like tomas.

awk one-liner: also keep the original order awk 'NR==FNR{a[$0]=1;next;}!($0 in a)' fail.txt success.txt

There is a Posix-standard join(1) program in all modern Unix systems, see man join. $ join -v1 success.txt fail.txt

Related

extract some values from file#1 and others from file#2 and print them into file#3

How to delete certain pattern in a record?

Should I use cut or awk to extract fields and field substrings?

grep or awk - how to return line if column 1 and 3 have the same value

*nix: perform set union/intersection/difference of lists

Categories

Resources