Unix - Split to N files using regexp to name destination file - sed

How do I split a file to N files using as a filename the first 2 chars on the line.
Ex input file:
AA23409234TEXT
BA23201202Other Text
AA23509234YADA
BA23202202More Text.
C1000000000000000000
Should generate 3 files:
AA.txt
AA23409234TEXT
AA23509234YADA
BA.txt
BA23201202Other Text
BA23202202More Text.
C1.txt
C1000000000000000000
I'm thinking of using a sed script similar to this
/^(..)/w \1
But what that really does is create a file named '\1' instead of the capture group.
Any ideas?

$ awk '{fname=substr($0, 0, 2); print >>fname}' input.txt
Or
$ while read line; do echo "$line" >>"${line:0:2}"; done <input.txt

The first thing you need to do is determine all of your file names:
filenames=$(sed 's/\(..\).*/\1/' listOfStrings.txt | sort | uniq)
Then, loop through those filenames
for filename in $filenames
do
sed -n '/^$filename/ p' listOfStrings.txt > $filename.txt
done
I have not tested this, but I think it should work.

This might work for you:
sed 's/\(..\).*/echo "&" >>\1.txt/' file | sh
or if you have GNU sed:
sed 's/\(..\).*/echo "&" >>\1.txt/e' file

Related

How to do a custom "grep" in linux terminal

Suppose I have file which contains only a text like below:
Test transition to drned-internal-asr9k-rt24711
load drned-internal-asr9k-rt24711
commit**
Now on the terminal if I do
cat filename | grep load
I would get output something like
load drned-internal-asr9k-rt24711
But how can I modify my grep command to get output as
drned-internal-asr9k-rt24711.txt
i.e. remove "load " and add ".txt" at the end. So how to do that??
May be not the best solution but :
cat | grep load | cut -c4- | sed 's/$/.txt/'
cut -c4- will delete the 4 first characters
sed 's/$/.txt/' will add the ".txt" at the end of output
This can be achieved with the following:
sed -nr 's/.*load\s+(.*)/\1.txt/p' file.txt
This matches anything after load (plus one or more spaces) and returns it, adding .txt on the end.
awk '{for(i=1;i<NF;i++){if(tolower($i)~/^load$/){print $(i+1) ".txt"}}}' file.txt
This matches next column after load and append .txt to it in output.

How to Find & Replace a String Within Files with Find / Grep / Sed

I have a folder of 500 *.INI files that I need to manually edit. Within each INI file, I have the line Source =. I would like that line to become Source = C:\software\{filename}.
For instance, a dx4.ini file would need to be fixed to become: Source = C:\software\dx4
Is there a quick way to do this with Find, Grep, or Sed functions?
You can try with sed
For example
Input file contents:
file.txt
Source =
some lines..
script:
newstring='Source = C:\software\dx4'
oldstring='Source ='
echo `sed "s/$oldstring/$newstring/g" file.txt` > file.txt
After running the above commands
output:
Source = C:\software\dx4
some lines..
If you want to edit a file in a script, I think ed is the way to go. Combined with a shell for loop:
for file in *.INI; do
base=$(basename "$file" .INI)
ed -s "$file" <<EOF
/^Source =/s/=/= C:\\\\software\\\\$base/
w
EOF
done
(This does assume that filenames will not have newlines or ampersands in their names)
With GNU awk for the 3rd arg to match(), gensub(), and "inplace" editing:
awk -i inplace '
match($0,/(.*Source = C:\\software\\){filename}(.*)/,a) {
fname = gensub(/\..*/,"",1,FILENAME)
$0 = a[1] fname a[2]
}
1' *.INI
The above assumes you're running in a UNIX environment though your use of the term folder instead of directory and that path starting with C: and containing backslashes makes me suspicious. If you're on Windows then save the part between the 2 's (exclusive) in a file named foo.awk and execute it as awk -i inplace foo.awk *.INI or however it is you normally execute commands like this in Windows.
find *.ini -type -f > stack
while read line
do
sed -i s"#Source =#Source = C:\\software\\dx4#" "${line}"
done < stack
Assuming that a} You have sed with "-i" (the insert flag, which AFAIK is not always portable) and b} sed doesn't crap itself about a double escape sequence, I think that will work.

Inserting the filename before the first line of a text file

I'm trying to add the filename of a text file into the first line of a the same text file. for example if the file name is called test1.txt, then the first line when you open the file should be test1.
below is what I've done so for, the only problem i have is that the word "$file" is being written to the file not the file name. any help is appreciated.
for file in *.txt; do
sed -i '1 i\$file' $file;
awk 'sub("$", "\r")' "$file" > "$file"1;
mv "$file"1 "$file";
done
Without concise, testable sample input and expected output it's an untested guess but it SOUNDS like all you need is:
awk -i inplace -v ORS='\r\n' 'FNR==1{print FILENAME}1' *
No shell loop or multiple commands required.
The above uses GNU awk for inplace editing and I'm assuming the sub() in your code was intended to add a \r at the end of every line.
I've just started learning more about sed and awk and put this into a file called insert.sed and sourced it and passed it a file name:
sed -i '1s/^./'$1'\'$'\n/g' $1
In trying it, it seems to work okay:
rent$ cat x.txt
<<< Who are you?
rent$ source insert.sed x.txt
rent$ cat x.txt
x.txt
<< Who are you?
It is cutting off the first character of the first line so I'd have to fix that otherwise it does add the file name to first line.
I'm sure there's better ways of doing it.
If you want test1 on first line, with gnu sed
sed -i '1{x;s/.*/fich=$(ps -p $PPID -o args=);fich=${fich##*\\} };echo ${fich%%.*}/e;G}' test1.txt

how to use the name of the input file in sed replace

i have several files in which i want to replace a certain word with the name of the file itself..
for example i have 2 files named test1.txt and test2.txt
both files are equal and look like
bla1,bla2,temp
bla2,bla3,temp
with the sed i want to replace the word temp with the name of the file itself
so after the sed operation i have 2 different files
test1.txt , which looks like :
bla1,bla2,test1
bla2,bla3,test1
test2.txt, which looks like :
bla1,bla2,test2
bla2,bla3,test2
so my question ... how do i use the actual name of the input file itself as part of the replace command?
sed "s/temp/ ??filename??/ ??? " *.txt
thanks for your suggestions
I'm not sure you can reference the filename using sed although I could be wrong. You would probably use a shell hack. A better aproach to substitute all occurrences of temp with the filename would be the following awk script:
$ awk '{gsub(/temp/,FILENAME)}1' file
use awk, awk has FILENAME variable:
awk '{sub(/temp/,FILENAME)}7' yourfile
awk 'BEGIN{FS=OFS=","} {$NF=FILENAME}1' file
The difference between this and the sub() solutions is that this will work even if the word "temp" exists elsewhere in your file, e.g. if "bla1" contains the word "temperature".
If you need to strip ".txt" from the file name as it appears from your posted desired output, tweak it to:
awk 'BEGIN{FS=OFS=","} {t=FILENAME; sub(/\.txt$/,"",t); $NF=t}1' file
You can probably edit FILENAME itself but I find it best not to mess with the builtin variables if you don't have to.
You could do it with a little bit of bash to help you out, if that's available.
find . -name "test*.txt" -type f | awk -F '/' '{print $2;}' | while read file; do sed -i "s|temp|$file|" ./$file; done
That's a kind of hacky adaptation of a script I have to do something similar. It can undoubtedly be shortened.
no sed internal variable for the file name so you need some previous batch command for a generic process
for FileName in MyFileShellFilter
do
cat <> ${FileName} | sed "s|,temp$|,${FileName}|"
done
just be carrefull with file name used, they normaly don't have \ but could have & that are s// special meaning. I use | as separator to allow / in file name but for this reason, no unescaped | are allowed in file name (normaly not)
with xargs:
printf "%s\n" *.txt | xargs -I FILE -L 1 sed 's/temp/FILE/' FILE
The filename cannot have: newlines, slashes, ampersand, single quote.

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)