I want TO extract list of names from other bigger file (input), having that name and some additional information associated with that name. My problem is with grep -f option, as it is not matching the exact entries in input file but some other entries that contain similar name.
I tried:
$ grep -f list.txt -A 1 input >output
Following are the format of files;
list.txt
TE_final_35005
TE_final_1040
Input file
>TE_final_10401
ACGTACGTACGTACGT
>TE_final_35005
ACGTACGATCAGT
>TE_final_1040
ACGTACGTACGT
Required output:
>TE_final_35005
ACGTACGATCAGT
>TE_final_1040
ACGTACGTACGT
output I am getting:
>TE_final_10401
ACGTACGTACGTACGT
>TE_final_35005
ACGTACGATCAGT
>TE_final_1040
ACGTACGTACGT
Although TE_final_10401 is not in the list.txt
How I can use ^ in list?
Please help to match the exact value or suggest other ways to do this.
Add the whole word switch (-w):
grep -w -A1 -f list.txt infile
Output:
>TE_final_35005
ACGTACGATCAGT
>TE_final_1040
ACGTACGTACGT
A couple of things, remove the blanks lines from the files first:
sed -i '/^\s*$/d' file list
Then -w is used to match whole words only and -A1 will print the next line after the match:
$ grep -w -A1 -f list file > new_file
$ cat new_file
>TE_final_35005
ACGTACGATCAGT
>TE_final_1040
ACGTACGTACGT
as others have mentioned, adding the -w flag is the cleanest and easiest approach based on your sample data. but since you explicitly asked how you could use ^ in list.txt, here's another option.
to add ^ and/or $ anchors to each line in list.txt:
$ cat list.txt
^>TE_final_35005[ ]*$
^>TE_final_1040[ ]*$
this searches for your patterns at the start of the line, preceded by a > character, and ignores any trailing spaces. then your previous command will work (assuming you either remove those blank lines or change your argument to -A 2).
if you'd like to add these anchors to the list file automatically (and delete any blank lines at the same time), use this awk construct:
awk '{if($0 != ""){print "^>"$0"[ ]*$"}}' list.txt >newlist.txt
or if you prefer sed inplace editing:
sed -i '/^[ ]*$/d;s/\(.*\)/^>\1[ ]*$/g' list.txt
Related
I'm having trouble in getting sed to remove just the specific line I want. Let's say I have a file that looks like this:
testfile
testfile.txt
testfile2
Currently I'm using this to remove the line I want:
sed -i "/$1/d" file
The issue is that with this if I were to give testfile as input it would delete all three lines but I want it to only remove the first line. How do I do this?
With grep
grep -x -F -v -- "$1" file
# or
grep -xFv -- "$1" file
-F is for "fixed strings" -- turns off regex engine.
-x is to match entire line.
-v is for "everything but" the matched line(s).
-- to signal the end of options, in case $1 starts with a hyphen.
To save the file
grep -xFv -- "$1" file | sponge file # `moreutils` package
# or
tmp=$(mktemp)
grep -xFv -- "$1" file > "$tmp" && mv "$tmp" file
So match the whole line.
var=testfile
sed -i '/^'"$var"'$/d' file
# or with " quoting
sed -i "/^$var\$/d" file
You can learn regex with fun online with regex crosswords.
I have a file with many lines, in each line
there is either substring
whatever_blablablalsfjlsdjf;asdfjlds;f/watch?v=yPrg-JN50sw&,whatever_blabla
or
whatever_blablabla"/watch?v=yPrg-JN50sw&" class=whatever_blablablavwhate
I want to extract a substring, like the "yPrg-JN50s" above
the matching pattern is
the 11 characters after the string "/watch?="
how to extract the substring
I hope it is sed, awk in one line
if not, a pn line perl script is also ok
You can do
grep -oP '(?<=/watch\?v=).{11}'
if your grep knows Perl regex, or
sed 's/.*\/watch?v=\(.\{11\}\).*/\1/g'
$ cat file
/watch?v=yPrg-JN50sw&
"/watch?v=yPrg-JN50sw&" class=
$
$ awk 'match($0,/\/watch\?v=/) { print substr($0,RSTART+RLENGTH,11) }' file
yPrg-JN50sw
yPrg-JN50sw
Just with the shell's parameter expansion, extract the 11 chars after "watch?v=":
while IFS= read -r line; do
tmp=${line##*watch?v=}
echo ${tmp:0:11}
done < filename
You could use sed to remove the extraneous information:
sed 's/[^=]\+=//; s/&.*$//' file
Or with awk and sensible field separators:
awk -F '[=&]' '{print $2}' file
Contents of file:
cat <<EOF > file
/watch?v=yPrg-JN50sw&
"/watch?v=yPrg-JN50sw&" class=
EOF
Output:
yPrg-JN50sw
yPrg-JN50sw
Edit accommodating new requirements mentioned in the comments
cat <<EOF > file
<div id="" yt-grid-box "><div class="yt-lockup-thumbnail"><a href="/watch?v=0_NfNAL3Ffc" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-contextlink contains-addto result-item-thumb" data-sessionlink="ved=CAMQwBs%3D&ei=CPTsy8bhqLMCFRR0fAodowXbww%3D%3D"><span class="video-thumb ux-thumb yt-thumb-default-185 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="//i1.ytimg.com/vi/0_NfNAL3Ffc/mqdefault.jpg" alt="Miniature" width="185" ><span class="vertical-align"></span></span></span></span><span class="video-time">5:15</span>
EOF
Use awk with sensible record separator:
awk -v RS='[=&"]' '/watch/ { getline; print }' file
Note, you should use a proper XML parser for this sort of task.
grep --perl-regexp --only-matching --regexp="(?<=/watch\\?=)([^&]{0,11})"
Assuming your lines have exactly the format you quoted, this should work.
awk '{print substr($0,10,11)}'
Edit: From the comment in another answer, I guess your lines are much longer and complicated than this, in which case something more comprehensive is needed:
gawk '{if(match($0, "/watch\\?v=(\\w+)",a)) print a[1]}'
I have a CSV. I want to edit the 35th field of the CSV and write the change back to the 35th field. This is what I am doing on bash:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g'
so, I am pulling the 35th entry using awk and then replacing the "0" in the starting position in the string with "+91". This one works perfet and I get desired output on the console.
Now I want this new entry to get written in the file. I am thinking of sed's "in -place" replacement feature but this fetuare needs and input file. In above command, I cannot provide input file because my primary command is awk and sed is taking the input from awk.
Thanks.
You should choose one of the two tools. As for sed, it can be done as follows:
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/' test.csv
Not sure about awk, but #shellter's comment might help with that.
The in-place feature of sed is misnamed, as it does not edit the file in place. Instead, it creates a new file with the same name. eg:
$ echo foo > foo
$ ln -f foo bar
$ ls -i foo bar # These are the same file
797325 bar 797325 foo
$ echo new-text > foo # Changes bar
$ cat bar
new-text
$ printf '/new/s//newer\nw\nq\n' | ed foo # Edit foo "in-place"; changes bar
9
newer-text
11
$ cat bar
newer-text
$ ls -i foo bar # Still the same file
797325 bar 797325 foo
$ sed -i s/new/newer/ foo # Does not edit in-place; creates a new file
$ ls -i foo bar
797325 bar 792722 foo
Since sed is not actually editing the file in place, but writing a new file and then renaming it to the old file, you might as well do the same.
awk ... test.csv | sed ... > test.csv.1 && mv test.csv.1 test.csv
There is the misperception that using sed -i somehow avoids the creation of the temporary file. It does not. It just hides the fact from you. Sometimes abstraction is a good thing, but other times it is unnecessary obfuscation. In the case of sed -i, it is the latter. The shell is really good at file manipulation. Use it as intended. If you do need to edit a file in place, don't use the streaming version of ed; just use ed
So, it turned out there are numerous ways to do it. I got it working with sed as below:
sed -i 's/0\([0-9]\{10\}\)/\+91\1/g' test.csv
But this is little tricky as it will edit any entry which matches the criteria. however in my case, It is working fine.
Similar implementation of above logic in perl:
perl -p -i -e 's/\b0(\d{10})\b/\+91$1/g;' test.csv
Again, same caveat as mentioned above.
More precise way of doing it as shown by Lev Levitsky because it will operate specifically on the 35th field
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/g' test.csv
For more complex situations, I will have to consider using any of the csv modules of perl.
Thanks everyone for your time and input. I surely know more about sed/awk after reading your replies.
This might work for you:
sed -i 's/[^,]*/+91/35' test.csv
EDIT:
To replace the leading zero in the 35th field:
sed 'h;s/[^,]*/\n&/35;/\n0/!{x;b};s//+91/' test.csv
or more simply:
|sed 's/^\(\([^,]*,\)\{34\}\)0/\1+91/' test.csv
If you have moreutils installed, you can simply use the sponge tool:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g' | sponge test.csv
sponge soaks up the input, closes the input pipe (stdin) and, only then, opens and writes to the test.csv file.
As of 2015, moreutils is available in package repositories of several major Linux distributions, such as Arch Linux, Debian and Ubuntu.
Another perl solution to edit the 35th field in-place:
perl -i -F, -lane '$F[34] =~ s/^0/+91/; print join ",",#F' test.csv
These command-line options are used:
-i edit the file in-place
-n loop around every line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with 0
$F[34] is the 35 element of the array
s/^0/+91/ does the substitution
I would like the option of extracting the following string/data:
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
=or=
25
myproxy
sample
But it would help if I see both.
From this output using cut or perl or anything else that would work:
Found 3 items
drwxr-xr-x - foo_hd foo_users 0 2011-03-16 18:46 /work/foo/processed/25
drwxr-xr-x - foo_hd foo_users 0 2011-04-05 07:10 /work/foo/processed/myproxy
drwxr-x--- - foo_hd testcont 0 2011-04-08 07:19 /work/foo/processed/sample
Doing a cut -d" " -f6 will get me foo_users, testcont. I tried increasing the field to higher values and I'm just not able to get what I want.
I'm not sure if cut is good for this or something like perl?
The base directories will remain static /work/foo/processed.
Also, I need the first line Found Xn items removed. Thanks.
You can do a substitution from beginning to the first occurrence of / , (non greedily)
$ your_command | ruby -ne 'print $_.sub(/.*?\/(.*)/,"/\\1") if /\//'
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
Or you can find a unique separator (field delimiter) to split on. for example, the time portion is unique , so you can split on that and get the last element. (2nd element)
$ ruby -ne 'print $_.split(/\s+\d+:\d+\s+/)[-1] if /\//' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
With awk,
$ awk -F"[0-9][0-9]:[0-9][0-9]" '/\//{print $NF}' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
perl -lanF"\s+" -e 'print #F[-1] unless /^Found/' file
Here is an explanation of the command-line switches used:
-l: remove line break from each line of input, then add one back on print
-a: auto-split each line of input into an #F array
-n: loop through each line of input
-F: the regexp pattern to use for the auto-split (with -a)
-e: the perl code to execute (for each line of input if using -n or -p)
If you want to just output the last portion of your directory path, and the basedir is always '/work/foo/processed', I would do this:
perl -nle 'print $1 if m|/work/foo/processed/(\S+)|' file
Try this out :
<Your Command> | grep -P -o '[\/\.\w]+$'
OR if the directory '/work/foo/processed' is always static then:
<Your Command>| grep -P -o '\/work\/foo\/processed\/.+$'
-o : Show only the part of a matching line that matches PATTERN.
-P : Interpret PATTERN as a Perl regular expression.
In this example, the last word in the input will be matched .
(The word can also contain dot(s)),so file names like 'text_file1.txt', can be matched).
Ofcourse, you can change the pattern, as per your requirement.
If you know the columns will be the same, and you always list the full path name, you could try something like:
ls -l | cut -c79-
which would cut out the 79th character until the end. That might work in this exact case, but I think it would be better to find the basename of the last field. You could easily do this in awk or perl. Respond if this is not what you want and I'll add the awk and perl versions.
take the output of your ls command and pipe it to awk
your command|awk -F'/' '{print $NF}'
your_command | perl -pe 's#.*/##'
My target is to delete line in file only if PATH match the PATH in the file
For example, I need to delete all lines that have /etc/sysconfig PATH from /tmp/file file
more /tmp/file
/etc/sysconfig/network-scripts/ifcfg-lo file1
/etc/sysconfig/network-scripts/ifcfg-lo file2
/etc/sysconfig/network-scripts/ifcfg-lo file3
I write the following Perl code (the perl code integrated in my bash script) in order to delete lines that have "/etc/sysconfig"
export FILE=/etc/sysconfig
perl -i -pe 's/\Q$ENV{FILE}\E// ' /tmp/file
But I get the following after I run the perl code: (in place to get empty lines)
/network-scripts/ifcfg-lo file1
/network-scripts/ifcfg-lo file2
/network-scripts/ifcfg-lo file3
first question:
How to change the perl syntax : perl -i -pe 's/\Q$ENV{FILE }\E// ' in order to delete line that matches the required PATH (/etc/sysconfig)?
second question:
The same as the first question but line will deleted only if PATH match the first field in the file
Example:
/tmp/file before perl edit:
file1 /etc/sysconfig/network-scripts/ifcfg-lo
/etc/sysconfig/network-scripts/ifcfg-lo file2
/etc/sysconfig/network-scripts/ifcfg-lo file3
/tmp/file after perl edit:
file1 /etc/sysconfig/network-scripts/ifcfg-lo
Perl is a fine way to do it. Use the -n switch, not -p.
perl -i -l -n -e'print unless /\Q$ENV{FILE}/' filename
s/pattern/otherpattern/ won't delete entire lines; it will only alter substrings. You need to entirely change your program to delete entire lines. In pseudocode, it would be:
while (read in a line)
{
if (doesn't match)
{
write the line back out unaltered.
}
}
It can still be rewritten as a oneliner though, with knowledge of how continue and redo work in loops: perl -pe '$_ = <> and redo if /Q$ENV{FILE}\E/'
mef#iwlappy:~$ cat /tmp/file
aaaa
/etc/sysconfig/network-scripts/ifcfg-lofile1
/etc/sysconfig/network-scripts/ifcfg-lofile2
/etc/sysconfig/network-scripts/ifcfg-lofile3
aaa
mef#iwlappy:~$ perl -i -pe 's/$ENV{FILE}\E.*//' /tmp/file
mef#iwlappy:~$ cat /tmp/file
aaaa
aaa
You can do a further regexp to remove empty lines with s/^$//
If I were doing this from the command line, I probably wouldn't even use Perl. I'd just use a negated grep:
$ mv old.txt old.bak; grep -v $FILE old.bak > old.txt
Renaming the original file and writing to a new file with the old name is the same thing that perl's -i switch does for you.
If you want to match just the first column, then I might punt to perl so I don't have to use awk or cut. perl's -a switch splits the line on whitespace and puts the results in #F:
$ perl -ai.bak -ne 'print if $F[0] !~ /^\Q$ENV{FILE}/' old.txt
When you think you have it right, you can remove the .bak training wheels that saves a copy of your original file. Or not. I tend to like the safety net.
See perlrun for the details of command-line switches.