I've made a small script trying to search through a file looking for all occurrences of specific strings like this: a0002 b0590 c0964
The script goes like this:
#!/bin/sh
#include <stdio.h>
#
while read id;
do
awk {'print $1'} test.trans | grep -e "$id"
done < test.id
To simplify things, I made a stripped down version of the file I'm searching through (test.trans):
"a0001"
"a0002"
"b0586"
"b0587"
"b0588"
"b0589"
"b0590"
"b0591"
"b0852"
"b0952"
"a0002"
"b0587"
"c0952"
"c0964"
"c1783"
"c1786"
"c1787"
I have stored all the relevant search strings in a separate file named test.id which looks like this:
a0002
b0587
b0588
b0589
b0590
b0591
b0852
b0952
c0952
c0964
c1781
The idea is to pass each search string in the test.id file as a variable which is then used by grep to filter out all occurrences in the test.trans file
However, when I run the script, grep only matches some of the strings. When I change the order of the search patterns in the test.id file, the result also changes. What am I doing wrong?
I consider myself a newbie in shell programming, and would appreciate any help.
I don't know the reason for your problem.
But here are some remarks that don't fit in a comment.
If you want to put out a file to stdout (the standard output) you can use cat test.trans instead of awk {'print $1'} test.trans.
But is you want grep to process a file you must not read it with some other tool and pipr it to grep. grep can read this file directly by using `grep -e "$id" test.trans
If you alread use awk you don't need grep. you can achiefe this wih awk by calling awk /"$id"'/ {print $1}' test.trans grep can filter more than one pattern. Instead of your for loop do
grep -f test.id test.trans
After some experimenting, I found out that it all boiled down to removing the carriage return (\r) from each line in the test.id file. This file was received from a DOS-machine, and my iMac is using UNIX-format which only contains newline (\n)
Related
i have several files in which i want to replace a certain word with the name of the file itself..
for example i have 2 files named test1.txt and test2.txt
both files are equal and look like
bla1,bla2,temp
bla2,bla3,temp
with the sed i want to replace the word temp with the name of the file itself
so after the sed operation i have 2 different files
test1.txt , which looks like :
bla1,bla2,test1
bla2,bla3,test1
test2.txt, which looks like :
bla1,bla2,test2
bla2,bla3,test2
so my question ... how do i use the actual name of the input file itself as part of the replace command?
sed "s/temp/ ??filename??/ ??? " *.txt
thanks for your suggestions
I'm not sure you can reference the filename using sed although I could be wrong. You would probably use a shell hack. A better aproach to substitute all occurrences of temp with the filename would be the following awk script:
$ awk '{gsub(/temp/,FILENAME)}1' file
use awk, awk has FILENAME variable:
awk '{sub(/temp/,FILENAME)}7' yourfile
awk 'BEGIN{FS=OFS=","} {$NF=FILENAME}1' file
The difference between this and the sub() solutions is that this will work even if the word "temp" exists elsewhere in your file, e.g. if "bla1" contains the word "temperature".
If you need to strip ".txt" from the file name as it appears from your posted desired output, tweak it to:
awk 'BEGIN{FS=OFS=","} {t=FILENAME; sub(/\.txt$/,"",t); $NF=t}1' file
You can probably edit FILENAME itself but I find it best not to mess with the builtin variables if you don't have to.
You could do it with a little bit of bash to help you out, if that's available.
find . -name "test*.txt" -type f | awk -F '/' '{print $2;}' | while read file; do sed -i "s|temp|$file|" ./$file; done
That's a kind of hacky adaptation of a script I have to do something similar. It can undoubtedly be shortened.
no sed internal variable for the file name so you need some previous batch command for a generic process
for FileName in MyFileShellFilter
do
cat <> ${FileName} | sed "s|,temp$|,${FileName}|"
done
just be carrefull with file name used, they normaly don't have \ but could have & that are s// special meaning. I use | as separator to allow / in file name but for this reason, no unescaped | are allowed in file name (normaly not)
with xargs:
printf "%s\n" *.txt | xargs -I FILE -L 1 sed 's/temp/FILE/' FILE
The filename cannot have: newlines, slashes, ampersand, single quote.
I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams
I have a CSV. I want to edit the 35th field of the CSV and write the change back to the 35th field. This is what I am doing on bash:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g'
so, I am pulling the 35th entry using awk and then replacing the "0" in the starting position in the string with "+91". This one works perfet and I get desired output on the console.
Now I want this new entry to get written in the file. I am thinking of sed's "in -place" replacement feature but this fetuare needs and input file. In above command, I cannot provide input file because my primary command is awk and sed is taking the input from awk.
Thanks.
You should choose one of the two tools. As for sed, it can be done as follows:
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/' test.csv
Not sure about awk, but #shellter's comment might help with that.
The in-place feature of sed is misnamed, as it does not edit the file in place. Instead, it creates a new file with the same name. eg:
$ echo foo > foo
$ ln -f foo bar
$ ls -i foo bar # These are the same file
797325 bar 797325 foo
$ echo new-text > foo # Changes bar
$ cat bar
new-text
$ printf '/new/s//newer\nw\nq\n' | ed foo # Edit foo "in-place"; changes bar
9
newer-text
11
$ cat bar
newer-text
$ ls -i foo bar # Still the same file
797325 bar 797325 foo
$ sed -i s/new/newer/ foo # Does not edit in-place; creates a new file
$ ls -i foo bar
797325 bar 792722 foo
Since sed is not actually editing the file in place, but writing a new file and then renaming it to the old file, you might as well do the same.
awk ... test.csv | sed ... > test.csv.1 && mv test.csv.1 test.csv
There is the misperception that using sed -i somehow avoids the creation of the temporary file. It does not. It just hides the fact from you. Sometimes abstraction is a good thing, but other times it is unnecessary obfuscation. In the case of sed -i, it is the latter. The shell is really good at file manipulation. Use it as intended. If you do need to edit a file in place, don't use the streaming version of ed; just use ed
So, it turned out there are numerous ways to do it. I got it working with sed as below:
sed -i 's/0\([0-9]\{10\}\)/\+91\1/g' test.csv
But this is little tricky as it will edit any entry which matches the criteria. however in my case, It is working fine.
Similar implementation of above logic in perl:
perl -p -i -e 's/\b0(\d{10})\b/\+91$1/g;' test.csv
Again, same caveat as mentioned above.
More precise way of doing it as shown by Lev Levitsky because it will operate specifically on the 35th field
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/g' test.csv
For more complex situations, I will have to consider using any of the csv modules of perl.
Thanks everyone for your time and input. I surely know more about sed/awk after reading your replies.
This might work for you:
sed -i 's/[^,]*/+91/35' test.csv
EDIT:
To replace the leading zero in the 35th field:
sed 'h;s/[^,]*/\n&/35;/\n0/!{x;b};s//+91/' test.csv
or more simply:
|sed 's/^\(\([^,]*,\)\{34\}\)0/\1+91/' test.csv
If you have moreutils installed, you can simply use the sponge tool:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g' | sponge test.csv
sponge soaks up the input, closes the input pipe (stdin) and, only then, opens and writes to the test.csv file.
As of 2015, moreutils is available in package repositories of several major Linux distributions, such as Arch Linux, Debian and Ubuntu.
Another perl solution to edit the 35th field in-place:
perl -i -F, -lane '$F[34] =~ s/^0/+91/; print join ",",#F' test.csv
These command-line options are used:
-i edit the file in-place
-n loop around every line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with 0
$F[34] is the 35 element of the array
s/^0/+91/ does the substitution
I would like the option of extracting the following string/data:
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
=or=
25
myproxy
sample
But it would help if I see both.
From this output using cut or perl or anything else that would work:
Found 3 items
drwxr-xr-x - foo_hd foo_users 0 2011-03-16 18:46 /work/foo/processed/25
drwxr-xr-x - foo_hd foo_users 0 2011-04-05 07:10 /work/foo/processed/myproxy
drwxr-x--- - foo_hd testcont 0 2011-04-08 07:19 /work/foo/processed/sample
Doing a cut -d" " -f6 will get me foo_users, testcont. I tried increasing the field to higher values and I'm just not able to get what I want.
I'm not sure if cut is good for this or something like perl?
The base directories will remain static /work/foo/processed.
Also, I need the first line Found Xn items removed. Thanks.
You can do a substitution from beginning to the first occurrence of / , (non greedily)
$ your_command | ruby -ne 'print $_.sub(/.*?\/(.*)/,"/\\1") if /\//'
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
Or you can find a unique separator (field delimiter) to split on. for example, the time portion is unique , so you can split on that and get the last element. (2nd element)
$ ruby -ne 'print $_.split(/\s+\d+:\d+\s+/)[-1] if /\//' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
With awk,
$ awk -F"[0-9][0-9]:[0-9][0-9]" '/\//{print $NF}' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
perl -lanF"\s+" -e 'print #F[-1] unless /^Found/' file
Here is an explanation of the command-line switches used:
-l: remove line break from each line of input, then add one back on print
-a: auto-split each line of input into an #F array
-n: loop through each line of input
-F: the regexp pattern to use for the auto-split (with -a)
-e: the perl code to execute (for each line of input if using -n or -p)
If you want to just output the last portion of your directory path, and the basedir is always '/work/foo/processed', I would do this:
perl -nle 'print $1 if m|/work/foo/processed/(\S+)|' file
Try this out :
<Your Command> | grep -P -o '[\/\.\w]+$'
OR if the directory '/work/foo/processed' is always static then:
<Your Command>| grep -P -o '\/work\/foo\/processed\/.+$'
-o : Show only the part of a matching line that matches PATTERN.
-P : Interpret PATTERN as a Perl regular expression.
In this example, the last word in the input will be matched .
(The word can also contain dot(s)),so file names like 'text_file1.txt', can be matched).
Ofcourse, you can change the pattern, as per your requirement.
If you know the columns will be the same, and you always list the full path name, you could try something like:
ls -l | cut -c79-
which would cut out the 79th character until the end. That might work in this exact case, but I think it would be better to find the basename of the last field. You could easily do this in awk or perl. Respond if this is not what you want and I'll add the awk and perl versions.
take the output of your ls command and pipe it to awk
your command|awk -F'/' '{print $NF}'
your_command | perl -pe 's#.*/##'
I run the code gives me the following sample data
md5deep find * | awk '{ print $1 }'
A sample of the output
/Users/math/Documents/Articles/Number theory: Is a directory
258fe6853b1bfb2d07f512ff6bec52b1
/Users/math/Documents/Articles/Probability and statistics: Is a directory
4811bfb2ad04b9f4318049c01ebb52ef
8aae4ac3694658cf90005dbdea37b4d5
258fe6853b1bfb2d07f512ff6bec52b1
I have tried to filter the rows which contain Is a directory by SED unsuccessfully
md5deep find * | awk '{ print $1 }' | sed s/\/*//g
Its sample output is
/Users/math/Documents/Articles/Number theory: Is a directory
/Users/math/Documents/Articles/Topology: Is a directory
/Users/math/Documents/Articles/useful: Is a directory
How can I filter Out each row which contains "Is a directory" by SED/AWK?
[clarification]
I want to filter out the rows which contain Is a directory.
I have not used the md5deep tool, but I believe those lines are error messages; they would be going to standard error instead of standard out, and so they are going directly to your terminal instead of through the pipe. Thus, they won't be filtered by your sed command. You could filter them by merging your standard error and standard output streams, but
It looks like (I'm not sure because you are missing the backquotes) you are trying to call
md5deep `find *`
and find is returning all of the files and directories.
Some notes on what you might want to do:
It looks like md5deep has a -r for "recursive" option. So, you may want to try:
md5deep -r *
instead of the find command.
If you do wish to use a find command, you can limit it to only files using -type f, instead of files and directories. Also, you don't need to pass * into a find command (which may confuse find if there are files that have names that looks like the options that find understands); passing in . will search recursively through the current directory.
find . -type f
In sed if you wish to use slashes in your pattern, it can be a pain to quote them correctly with \. You can instead choose a different character to delimit your regular expression; sed will use the first character after the s command as a delimiter. Your pattern is also lacking a .; in regular expressions, to indicate one instance of any character you use ., and to indicate "zero or more of the preceding expression" you use *, so .* indicates "zero or more of any character" (this is different from glob patterns, in which * alone means "zero or more of any character").
sed "s|/.*||g"
If you really do want to be including your standard error stream in your standard output, so it will pass through the pipe, then you can run:
md5deep `find *` 2>&1 | awk ...
If you just want to ignore stderr, you can redirect that to /dev/null, which is a special file that just discards anything that goes into it:
md5deep `find *` 2>/dev/null | awk ...
In summary, I think the command below will help you with your immediate problem, and the other suggestions listed above may help you if I did not undersand what you were looking for:
md5deep -r * | awk '{ print $1 }'
To specifically answer the clarification: how to filter out lines using awk and sed:
awk '/Is a directory/ {next} {print}'
sed 'g/Is a directory/d'
Why not use grep instead?
ie,
md5deep find * | grep "Is a directory" | awk '{ print $1 }'
Edit: I just re-read your question and if you want to remove the lines with Is a directory, use the -v flag of grep, ie:
md5deep find * | grep -v "Is a directory" | awk '{ print $1 }'
I'm not intimately familiar with md5deep, but this may do something like you are tying to do.
find -type f -exec md5sum {} +