There are often times I will grep -n whatever file to find what I am looking for. Say the output is:
1234: whatev 1
5555: whatev 2
6643: whatev 3
If I want to then just extract the lines between 1234 and 5555, is there a tool to do that? For static files I have a script that does wc -l of the file and then does the math to split it out with tail & head but that doesn't work out so well with log files that are constantly being written to.
Try using sed as mentioned on
http://linuxcommando.blogspot.com/2008/03/using-sed-to-extract-lines-in-text-file.html. For example use
sed '2,4!d' somefile.txt
to print from the second line to the fourth line of somefile.txt. (And don't forget to check http://www.grymoire.com/Unix/Sed.html, sed is a wonderful tool.)
The following command will do what you asked for "extract the lines between 1234 and 5555" in someFile.
sed -n '1234,5555p' someFile
If I understand correctly, you want to find a pattern between two line numbers. The awk one-liner could be
awk '/whatev/ && NR >= 1234 && NR <= 5555' file
You don't need to run grep followed by sed.
Perl one-liner:
perl -ne 'if (/whatev/ && $. >= 1234 && $. <= 5555) {print}' file
Line numbers are OK if you can guarantee the position of what you want. Over the years, my favorite flavor of this has been something like this:
sed "/First Line of Text/,/Last Line of Text/d" filename
which deletes all lines from the first matched line to the last match, including those lines.
Use sed -n with "p" instead of "d" to print those lines instead. Way more useful for me, as I usually don't know where those lines are.
Put this in a file and make it executable:
#!/usr/bin/env bash
start=`grep -n $1 < $3 | head -n1 | cut -d: -f1; exit ${PIPESTATUS[0]}`
if [ ${PIPESTATUS[0]} -ne 0 ]; then
echo "couldn't find start pattern!" 1>&2
exit 1
fi
stop=`tail -n +$start < $3 | grep -n $2 | head -n1 | cut -d: -f1; exit ${PIPESTATUS[1]}`
if [ ${PIPESTATUS[0]} -ne 0 ]; then
echo "couldn't find end pattern!" 1>&2
exit 1
fi
stop=$(( $stop + $start - 1))
sed "$start,$stop!d" < $3
Execute the file with arguments (NOTE that the script does not handle spaces in arguments!):
Starting grep pattern
Stopping grep pattern
File path
To use with your example, use arguments: 1234 5555 myfile.txt
Includes lines with starting and stopping pattern.
If I want to then just extract the lines between 1234 and 5555, is
there a tool to do that?
There is also ugrep, a GNU/BSD grep compatible tool but one that offers a -K option (or --range) with a range of line numbers to do just that:
ugrep -K1234,5555 -n '' somefile.log
You can use the usual GNU/BSD grep options and regex patterns (but it also offers a lot more such as -K.)
If you want lines instead of line ranges, you can do it with perl: eg. if you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
Related
Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"
I'm trying to extract from a tab delimited file a number that i need to store in a variable. I'm approaching the problem with a regex that thanks to some research online I have been able to built.
The file is composed as follow:
0 0 2500 5000
1 5000 7500 10000
2 10000 12500 15000
3 15000 17500 20000
4 20000 22500 25000
5 25000 27500 30000
I need to extract the number in the second column given a number of the first one. I wrote and tested online the regex:
(?<=5\t).*?(?=\t)
I need the 25000 from the sixth line.
I started working with sed but as you already know, it doesn't like lookbehind and lookahead pattern even with the -E option to enable extended version of regular expressions. I tried also with awk and grep and failed for similar reasons.
Going further I found that perl could be the right command but I'm not able to make it work properly. I'm trying with the command
perl -pe '/(?<=5\t).*?(?=\t)/' | INFO.out
but I admit my poor knowledge and I'm a bit lost.
The next step would be to read the "5" in the regex from a variable so if you already know problems that could rise, please let me know.
No need for lookbehinds -- split each line on space and check whether the first field is 5.
In Perl there is a command-line option convenient for this, -a, with which each line gets split for us and we get #F array with fields
perl -lanE'say $F[1] if $F[0] == 5' data.txt
Note that this tests for 5 numerically (==)
grep supports -P for perl regex, and -o for only-matching, so this works with a lookbehind:
grep -Po '(?<=5\t)\d+' file
That can use a shell variable pretty easily:
VAR=5 && grep -Po "(?<=$VAR\t)\d+"
Or perl -n, to show using s///e to match and print capture group:
perl -lne 's/^5\t(\d+)/print $1/e' file
Why do you need to use a regex? If all you are doing is finding lines starting with a 5 and getting the second column you could use sed and cut, e.g.:
<infile sed -n '/^5\t/p' | cut -f2
Output:
25000
One option is to use sed, match 5 at the start of the string and after the tab capture the digits in a group
sed -En 's/^5\t([[:digit:]]+)\t.*/\1/p' file > INFO.out
The file INFO.out contains:
25000
Using sed
$ var1=$(sed -n 's/^5[^0-9]*\([^ ]*\).*/\1/p' input_file)
$ echo "$var1"
25000
I have 'file1' with (say) 100 lines. I want to use sed or awk to print lines 23, 71 and 84 (for example) to 'file2'. Those 3 line numbers are in a separate file, 'list', with each number on a separate line.
When I use either of these commands, only line 84 gets printed:
for i in $(cat list); do sed -n "${i}p" file1 > file2; done
for i in $(cat list); do awk 'NR==x {print}' x=$i file1 > file2; done
Can a for loop be used in this way to supply line addresses to sed or awk?
This might work for you (GNU sed):
sed 's/.*/&p/' list | sed -nf - file1 >file2
Use list to build a sed script.
You need to do > after the loop in order to capture everything. Since you are using it inside the loop, the file gets overwritten. Inside the loop you need to do >>.
Good practice is to or use > outside the loop so the file is not open for writing during every loop iteration.
However, you can do everything in awk without for loop.
awk 'NR==FNR{a[$1]++;next}FNR in a' list file1 > file2
You have to >>(append to the file) . But you are overwriting the file. That is why, You are always getting 84 line only in the file2.
Try use,
for i in $(cat list); do sed -n "${i}p" file1 >> file2; done
With sed:
sed -n $(sed -e 's/^/-e /' -e 's/$/p/' list) input
given the example input, the inner command create a string like this: `
-e 23p
-e 71p
-e 84p
so the outer sed then prints out given lines
You can avoid running sed/awk in a for/while loop altgether:
# store all lines numbers in a variable using pipe
lines=$(echo $(<list) | sed 's/ /|/g')
# print lines of specified line numbers and store output
awk -v lineS="^($lines)$" 'NR ~ lineS' file1 > out
I would like the option of extracting the following string/data:
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
=or=
25
myproxy
sample
But it would help if I see both.
From this output using cut or perl or anything else that would work:
Found 3 items
drwxr-xr-x - foo_hd foo_users 0 2011-03-16 18:46 /work/foo/processed/25
drwxr-xr-x - foo_hd foo_users 0 2011-04-05 07:10 /work/foo/processed/myproxy
drwxr-x--- - foo_hd testcont 0 2011-04-08 07:19 /work/foo/processed/sample
Doing a cut -d" " -f6 will get me foo_users, testcont. I tried increasing the field to higher values and I'm just not able to get what I want.
I'm not sure if cut is good for this or something like perl?
The base directories will remain static /work/foo/processed.
Also, I need the first line Found Xn items removed. Thanks.
You can do a substitution from beginning to the first occurrence of / , (non greedily)
$ your_command | ruby -ne 'print $_.sub(/.*?\/(.*)/,"/\\1") if /\//'
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
Or you can find a unique separator (field delimiter) to split on. for example, the time portion is unique , so you can split on that and get the last element. (2nd element)
$ ruby -ne 'print $_.split(/\s+\d+:\d+\s+/)[-1] if /\//' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
With awk,
$ awk -F"[0-9][0-9]:[0-9][0-9]" '/\//{print $NF}' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
perl -lanF"\s+" -e 'print #F[-1] unless /^Found/' file
Here is an explanation of the command-line switches used:
-l: remove line break from each line of input, then add one back on print
-a: auto-split each line of input into an #F array
-n: loop through each line of input
-F: the regexp pattern to use for the auto-split (with -a)
-e: the perl code to execute (for each line of input if using -n or -p)
If you want to just output the last portion of your directory path, and the basedir is always '/work/foo/processed', I would do this:
perl -nle 'print $1 if m|/work/foo/processed/(\S+)|' file
Try this out :
<Your Command> | grep -P -o '[\/\.\w]+$'
OR if the directory '/work/foo/processed' is always static then:
<Your Command>| grep -P -o '\/work\/foo\/processed\/.+$'
-o : Show only the part of a matching line that matches PATTERN.
-P : Interpret PATTERN as a Perl regular expression.
In this example, the last word in the input will be matched .
(The word can also contain dot(s)),so file names like 'text_file1.txt', can be matched).
Ofcourse, you can change the pattern, as per your requirement.
If you know the columns will be the same, and you always list the full path name, you could try something like:
ls -l | cut -c79-
which would cut out the 79th character until the end. That might work in this exact case, but I think it would be better to find the basename of the last field. You could easily do this in awk or perl. Respond if this is not what you want and I'll add the awk and perl versions.
take the output of your ls command and pipe it to awk
your command|awk -F'/' '{print $NF}'
your_command | perl -pe 's#.*/##'
I have program output that looks like this (tab delim):
$ ./mycode somefile
0000000000000000000000000000000000 238671
0000000000000000000000000000000001 0
0000000000000000000000000000000002 0
0000000000000000000000000000000003 0
0000000000000000000000000000000010 0
0000000000000000000000000000000011 1548.81
0000000000000000000000000000000012 0
0000000000000000000000000000000013 937.306
What I want to do is on FIRST column only: replace 0 with A, 1 with C, 2 with G, and 3 with T.
Is there a way I can transliterate that output piped directly from "mycode".
Yielding this:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
...
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
Using Perl:
C:\> ./mycode file | perl -lpe "($x,$y)=split; $x=~tr/0123/ACGT/; $_=qq{$x\t$y}"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC 1548.81
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
You can use single quotes in Bash:
$ ./mycode file | perl -lpe '($x,$y)=split; $x=~tr/0123/ACGT/; $_="$x\t$y"'
As #ysth notes in the comments, perl actually provides the command line options -a and -F:
-a autosplit mode with -n or -p (splits $_ into #F)
...
-F/pattern/ split() pattern for -a switch (//'s are optional)
Using those:
perl -lawnF'\t' -e '$,="\t"; $F[0] =~ y/0123/ACGT/; print #F'
It should be possible to do it with sed, put this in a file (you can do it command-line to, with -e, just don't forget those semicolons, or use separate -e for each line). (EDIT: Keep in mind, since your data is tab delimited, it should in fact be a tab character, not a space, in the first s//, make sure your editor doesn't turn it into spaces)
#!/usr/bin/sed -f
h
s/ .*$//
y/0123/ACGT/
G
s/\n[0-3]*//
and use
./mycode somefile | sed -f sedfile
or chmod 755 sedfile and do
./mycode somefile | sedfile
The steps performed are:
copy buffer to hold space (replacing held content from previous line, if any)
remove trailing stuff (from first space to end of line)
transliterate
append contents from hold space
remove the newline (from the append step) and all digits following it (up to the space)
Worked for me on your data at least.
EDIT:
Ah, you wanted a one-liner...
GNU sed
sed -e "h;s/ .*$//;y/0123/ACGT/;G;s/\n[0-3]*//"
or old-school sed (no semicolons)
sed -e h -e "s/ .*$//" -e "y/0123/ACGT/" -e G -e "s/\n[0-3]*//"
#sarathi
\AWK solution for this
awk '{gsub("0","A",$1);gsub("1","C",$1);gsub("2","G",$1);gsub("3","T",$1); print $1"\t"$2}' temp.txt