Ctrl+Z Character and EOF Issues With Pipes - perl

I have a huge file provided by a third party, which appears to have been generated in a Windows/DOS-like environment. The last line of the file contains a ^Z character. I noticed this when I looked at the processed file and the last line contained a ^Z. I added some logic to skip this line from the input and it was working fine until I changed my code to take the input from stdin as opposed to a file.
Here is a simpler illustration of this issue. When I do a line count on a single file stream with and without ^Z skipping, it reports the correct values:
unzip -j -p -qq file1.zip | perl -nle 'print' | wc -l
3451
unzip -j -p -qq file2.zip | perl -nle 'print' | wc -l
3451
unzip -j -p -qq file1.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450
unzip -j -p -qq file2.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450
Now when I try to process both files at once, I lose one record. I am guessing this is something to do with the ^Z character but I cannot figure out what I can do about it:
unzip -j -p -qq '*.zip' | perl -nle 'print' | wc -l
6901 ## this should have been 6902
unzip -j -p -qq '*.zip' | perl -nle 'next if /^\cZ/; print' | wc -l
6899 ## this should have been 6900
These files are huge (each 20+GB) and they are to be read in groups of 3-6 files so I wanted to avoid processing them one by one and then concatenate later. Any thoughts on how to avoid the ^Z character without running into the above issue?
I am on a Linux machine. Btw, opening the file in vim does not display the last record (i.e., ^Z) and setting set ff=unix did not change this either. So vim reports 3450 lines for the single unzipped file and 6900 for the combined unzipped files.
Thanks!

Since the ^Z isn't followed by a line ending, unzip is producing
file1:1
file1:2
file1:3
^Zfile2:1
file2:2
file2:3
^Z
so you delete the first line of the second file. You could simply remove the ^Z instead of the entire line.
perl -pe's/^\cZ//'
That said, unzip -a is designed for exactly this situation. Not only will it strip the ^Z for you, it will also fix the line endings if necessary.
$ unzip -j -p -qq z.zip a.txt | od -c
0000000 a b c \r \n d e f \r \n 032
0000013
$ unzip -j -p -qq z.zip b.txt | od -c
0000000 g h i \r \n j k l \r \n 032
0000013
$ unzip -j -p -qq z.zip | od -c
0000000 a b c \r \n d e f \r \n 032 g h i \r \n
0000020 j k l \r \n 032
0000026
$ unzip -j -p -qq -a z.zip | od -c
0000000 a b c \n d e f \n g h i \n j k l \n
0000020

Related

How to make sed take input from pipe, and insert into a file

is it possible to use the pipe to redirect the output of the previous command, to sed, and let sed use this as input(pattern or string) to access a file?
I know if you only use sed, you can use something like
sed -i '1 i\anything' file
But can I do something like
head -1 file1 | sed -i '1 i\OutputFromPreviousCmd' file2
This way, I don't need to manually copy the output and change the sed command everytime
Update:
Added the files I meant
head -3 file1.txt
Side A,Age(us),mm:ss.ms_us_ns_ps
84 Vendor Specific, 0000000009096, 0349588242
84 Vendor Specific, 0000000011691, 0349591828
head -3 file2.txt
84 Vendor Specific, 0000000000418, 0349575322
83 Vendor Specific, 0000000002099, 0349575343
83 Vendor Specific, 0000000001628, 0349576662
I'd like to grab the first line of file1 and insert it to file2, so the result should be :
head -3 file2.txt
Side A,Age(us),mm:ss.ms_us_ns_ps
84 Vendor Specific, 0000000000418, 0349575322
83 Vendor Specific, 0000000002099, 0349575343
83 Vendor Specific, 0000000001628, 0349576662
head -1 file1 | sed '1s/^/1i /' | sed -i -f- file2
This takes your one line of output, prepends the sed 1i command, the pipes that sed command stream to sed using -f- to take sed commands from stdin.
For example:
$ echo bob > bob.txt
$ echo alice | sed '1s/^/1i /' | sed -i -f- bob.txt
$ more bob.txt
alice
bob
This looks like pipes and not commands ending in > temp ; mv temp file2, but sed is doing that nonetheless when -i is used.
This might work for you (GNU sed):
head -1 file1 | sed -i '1e cat /dev/stdin' file2
Insert the first line of file1 into the start of file2.
But why not use cat?:
cat <(head -1 file1) file2

Find and replace in UNIX

I'm having the following string in a file called test.txt,
test.log test1.log test2.log
I want to replace it with
test.log -A test1.log -A test2.log
I tried:
sed -i 's/.log/.log -A/g' test.txt
But the output is
test.log -A test1.log -A test2.log -A
I don't want that to be appended in the last file. Can someone help me on this?
If the arguments are separated by space and final argument in the line doesn't have spaces after it, you could use this:
$ cat ip.txt
test.log test1.log test2.log
$ sed 's/\.log /&-A /g' ip.txt
test.log -A test1.log -A test2.log
since . is a metacharacter, you have to use \. to match it literally
& in replacement section represents entire matched portion in search section
You could also use awk here, better suited for field processing and added advantage of stripping away whitespaces at start/end of line
$ awk -v OFS=' -A ' '/\.log/{$1=$1} 1' ip.txt
test.log -A test1.log -A test2.log
default input field separator(FS) is one or more contiguous whitespace, so no need to set that
-v OFS=' -A ' set space followed by -A and space as output field separator(OFS)
/\.log/ if line contains .log
$1=$1 re-build input record, so that input FS will be replaced by OFS
1 idiomatic way to print input record
note that this solution won't change a line if it doesn't contain .log

sed does not recognize -r flag on AIX

thanks in advance for the help.
I have the following line that does work on linux.
myfile (extract)
active_instance_count=
aq_tm_processes=1
archive_lag_target=0
audit_file_dest=?/rdbms/audit
audit_sys_operations=FALSE
audit_trail=NONE
background_core_dump=partial
background_dump_dest=/home1/oracle/app/oracle/admin/iopecom/bdump
...
cat myfile |sed -r 's/ {1,}//g'|sed -r 's/\t*//g' |grep -v "^#"|sed -s "/^$/d" |sed =|sed 'N;s/\n/\t/'|sed -r "s/#.*//g" | sed "s/\t/;/g"|sed "s/\t/;/g"|sed -e "s,',\o042,g"
The result will be:
1;O7_DICTIONARY_ACCESSIBILITY=TRUE
2;active_instance_count=
3;aq_tm_processes=1
4;archive_lag_target=0
5;audit_file_dest=?/rdbms/audit
6;audit_sys_operations=FALSE
7;audit_trail=NONE
8;background_core_dump=partial
9;background_dump_dest=/home1/oracle/app/oracle/admin/iopecom/bdump
But, I can't figure out, how to perform the same command on AIX server.
Help is very welcome.
Regards.
Antonio.
Unless you have a compelling reason to use sed, you could use alternate tools:
awk -v OFS=';' '{print NR,$0}' filename
would produce the desired output.
You could also use perl:
perl -ne 'print "$.;$_"' filename
It appears that your sed expression would skip lines beginning with a #. As such, you could say:
perl -ne '$,=";"; !/^#/ && print ++$i,$_' filename
or something like:
grep -v '^#' filename | awk ...
reformatting your pipeline:
cat myfile |
sed -r 's/ {1,}//g' | # strip all spaces (1)
sed -r 's/\t*//g' | # strip all tabs (2)
grep -v "^#" | # delete all lines beginning `#` (3)
sed -s "/^$/d" | # delete all empty lines (4)
sed = | # interleave with line numbers (5)
sed 'N;s/\n/\t/' | # join line number and line with `\t` (6)
sed -r "s/#.*//g" | # strip all `#` comments (7)
sed "s/\t/;/g" | # replace all tabs with `;` (8)
sed "s/\t/;/g" | # do it again (9)
sed -e "s,',\o042,g" # replace all ' with " (10)
Boiling that down and using cat -n to provide the line numbers up front gets:
cat -n myfile |
sed "$(print 's/\t/;/')
$(print 's/[ \t]*//g')
s/#.*//g
/^$/d
s/'/\"/g"
which behaves identically unless I'm misreading the aix docs. The $(...) construction is command substitution, it runs that command and substitutes its output. print would be printf on linux.

Recursively remove trailing characters

I just copied couple of files from windows to unix and they all have ^M at the end. I know how to remove them using vi, but I can only do one file at a time, is there a way I can do it for all the files in the folder?. There are like 60 files and manually doing it for all of them is time consuming!.
I'm open to using other tools as well!
PS: The OS is Solaris
Thanks
For posterity, let's post the solution from within VI. You can remove the Ctrl-M at the end of every line like this:
:%s/^V^M$//
Note that this is what you type, wnere ^V means Ctrl-V and ^M means Ctrl-M. The idea here is that ^V will "escape" the following ^M, so that you can match it in the substitution regex.
And the % expression means "do this on every line".
Note that this may or may not work in vim, depending on your settings.
But your question asks how to do this in vi, in which you can't easily make a change to multiple files. If you're open to using other tools, please indicate so in your question.
You can use sed on a single file or stream:
$ printf 'one\r\ntwo\r\n' > /tmp/test.txt
$ od -c < /tmp/test.txt
0000000 o n e \r \n t w o \r \n
0000012
$ sed -i'' -e 's/^M$//' /tmp/test.txt
$ od -c < /tmp/test.txt
0000000 o n e \n t w o \n
0000010
$
In this case, in /bin/sh in FreeBSD, I escaped the ^M by ... you guessed it ... using ^V.
When using sed's -i option, you can specify multiple files and they will all be modified in place, perhaps eliminating the need to wrap this in a script. If you want to put this into a script anyway, I recommend you try to do so, and then ask for help if it doesn't work. That's the StackOverflow Way. :-)
Or just use Jonathan's for loop example. You don't need temp files.
UPDATE
If your sed does not have a -i option, then you can still do this pretty easily in a for loop:
[ghoti#pc ~]$ od -c /tmp/test1.txt
0000000 o n e \r \n t w o \r \n
0000012
[ghoti#pc ~]$ for f in /tmp/test*.txt; do sed -e 's/^M$//' "$f" > /tmp/temp.$$ && mv -v /tmp/temp.$$ "$f"; done
/tmp/temp.26687 -> /tmp/test1.txt
/tmp/temp.26687 -> /tmp/test2.txt
[ghoti#pc ~]$ od -c /tmp/test1.txt
0000000 o n e \n t w o \n
0000010
If you don't have a dos2unix or dtou command on your machine, you can use tr instead:
for file in "$#" # LIst of files passed as argument to script
do
tr -d '\015' < "$file" > tmp.$$
cp tmp.$$ "$file"
done
rm tmp.$$
You can add trap commands around that to clean up if you interrupt. Using cp instead of mv preserves owner, permissions, symlinks, hard links.
use the command dos2ux.
dos2ux file >file2

Change multiple files

The following command is correctly changing the contents of 2 files.
sed -i 's/abc/xyz/g' xaa1 xab1
But what I need to do is to change several such files dynamically and I do not know the file names. I want to write a command that will read all the files from current directory starting with xa* and sed should change the file contents.
I'm surprised nobody has mentioned the -exec argument to find, which is intended for this type of use-case, although it will start a process for each matching file name:
find . -type f -name 'xa*' -exec sed -i 's/asd/dsg/g' {} \;
Alternatively, one could use xargs, which will invoke fewer processes:
find . -type f -name 'xa*' | xargs sed -i 's/asd/dsg/g'
Or more simply use the + exec variant instead of ; in find to allow find to provide more than one file per subprocess call:
find . -type f -name 'xa*' -exec sed -i 's/asd/dsg/g' {} +
Better yet:
for i in xa*; do
sed -i 's/asd/dfg/g' $i
done
because nobody knows how many files are there, and it's easy to break command line limits.
Here's what happens when there are too many files:
# grep -c aaa *
-bash: /bin/grep: Argument list too long
# for i in *; do grep -c aaa $i; done
0
... (output skipped)
#
You could use grep and sed together. This allows you to search subdirectories recursively.
Linux: grep -r -l <old> * | xargs sed -i 's/<old>/<new>/g'
OS X: grep -r -l <old> * | xargs sed -i '' 's/<old>/<new>/g'
For grep:
-r recursively searches subdirectories
-l prints file names that contain matches
For sed:
-i extension (Note: An argument needs to be provided on OS X)
Those commands won't work in the default sed that comes with Mac OS X.
From man 1 sed:
-i extension
Edit files in-place, saving backups with the specified
extension. If a zero-length extension is given, no backup
will be saved. It is not recommended to give a zero-length
extension when in-place editing files, as you risk corruption
or partial content in situations where disk space is exhausted, etc.
Tried
sed -i '.bak' 's/old/new/g' logfile*
and
for i in logfile*; do sed -i '.bak' 's/old/new/g' $i; done
Both work fine.
#PaulR posted this as a comment, but people should view it as an answer (and this answer works best for my needs):
sed -i 's/abc/xyz/g' xa*
This will work for a moderate amount of files, probably on the order of tens, but probably not on the order of millions.
Another more versatile way is to use find:
sed -i 's/asd/dsg/g' $(find . -type f -name 'xa*')
I'm using find for similar task. It is quite simple: you have to pass it as an argument for sed like this:
sed -i 's/EXPRESSION/REPLACEMENT/g' `find -name "FILE.REGEX"`
This way you don't have to write complex loops, and it is simple to see, which files you are going to change, just run find before you run sed.
u can make
'xxxx' text u search and will replace it with 'yyyy'
grep -Rn '**xxxx**' /path | awk -F: '{print $1}' | xargs sed -i 's/**xxxx**/**yyyy**/'
There's some good answers above. I thought I'd throw in one more that is succinct and parallelizable, using GNU parallel, which I often prefer to xargs:
parallel sed -i 's/abc/xyz/g' {} ::: xa*
Combine this with the -j N option to run N jobs in parallel.
If you are able to run a script, here is what I did for a similar situation:
Using a dictionary/hashMap (associative array) and variables for the sed command, we can loop through the array to replace several strings. Including a wildcard in the name_pattern will allow to replace in-place in files with a pattern (this could be something like name_pattern='File*.txt' ) in a specific directory (source_dir).
All the changes are written in the logfile in the destin_dir
#!/bin/bash
source_dir=source_path
destin_dir=destin_path
logfile='sedOutput.txt'
name_pattern='File.txt'
echo "--Begin $(date)--" | tee -a $destin_dir/$logfile
echo "Source_DIR=$source_dir destin_DIR=$destin_dir "
declare -A pairs=(
['WHAT1']='FOR1'
['OTHER_string_to replace']='string replaced'
)
for i in "${!pairs[#]}"; do
j=${pairs[$i]}
echo "[$i]=$j"
replace_what=$i
replace_for=$j
echo " "
echo "Replace: $replace_what for: $replace_for"
find $source_dir -name $name_pattern | xargs sed -i "s/$replace_what/$replace_for/g"
find $source_dir -name $name_pattern | xargs -I{} grep -n "$replace_for" {} /dev/null | tee -a $destin_dir/$logfile
done
echo " "
echo "----End $(date)---" | tee -a $destin_dir/$logfile
First, the pairs array is declared, each pair is a replacement string, then WHAT1 will be replaced for FOR1 and OTHER_string_to replace will be replaced for string replaced in the file File.txt. In the loop the array is read, the first member of the pair is retrieved as replace_what=$i and the second as replace_for=$j. The find command searches in the directory the filename (that may contain a wildcard) and the sed -i command replaces in the same file(s) what was previously defined. Finally I added a grep redirected to the logfile to log the changes made in the file(s).
This worked for me in GNU Bash 4.3 sed 4.2.2 and based upon VasyaNovikov's answer for Loop over tuples in bash.
The Silver Searcher Solution
I'm adding another option for those people who don't know about the amazing tool called The Silver Searcher (command line tool is ag).
Note: You can use grep and other tools to do the same thing here, but The Silver Searcher is fantastic :)
TLDR
ag -l 'abc' | xargs sed -i 's/abc/xyz/g'
Install The Silver Searcher
sudo apt install silversearcher-ag # Debian / Ubuntu
sudo pacman -S the_silver_searcher # Arch / EndeavourOS
sudo yum install epel-release the_silver_searcher # RHEL / CentOS
Demo Files
Paste the following into your terminal to create some demonstration files:
mkdir /tmp/food
cd /tmp/food
content="Everybody loves to abc this food!"
echo "$content" > ./milk
echo "$content" > ./bread
mkdir ./fastfood
echo "$content" > ./fastfood/pizza
echo "$content" > ./fastfood/burger
mkdir ./fruit
echo "$content" > ./fruit/apple
echo "$content" > ./fruit/apricot
Using 'ag'
The following ag command will recursively find all the files that contain the string 'abc'. It ignores the .git directory, .gitignore files, and other ignore files:
$ ag 'abc'
milk
1:Everybody loves to abc this food!
bread
1:Everybody loves to abc this food!
fastfood/burger
1:Everybody loves to abc this food!
fastfood/pizza
1:Everybody loves to abc this food!
fruit/apple
1:Everybody loves to abc this food!
fruit/apricot
1:Everybody loves to abc this food!
To just list the files that contain the string 'abc', use the -l switch:
$ ag -l 'abc'
bread
fastfood/burger
fastfood/pizza
fruit/apricot
milk
fruit/apple
Changing Multiple Files
Finally, using xargs and sed, we can replace the 'abc' string with another string:
ag -l 'abc' | xargs sed -i 's/abc/eat/g'
In the above command, ag is listing all the files that contain the string 'abc'. The xargs command is splitting the file names and piping them individually into the sed command.