Merge Files using ksh - merge

I have 2 files in a directory (Files given below are only Examples)
File 1
abcd
efghi
1234
5678
File2
qwert
werty
poqrs
Desried Output
abcd
efghi
1234
5678
qwert
werty
poqrs
Currently i used the following code to merge the records in the file
for file in *.txt
do
cat "$file"
echo
done > output.txt
This is merging the records as expected but the total size of merged file not matching with the sum of sizes of files.
For Ex: if the File1 size is 120, File 2 size is 140 the Merged File Size is coming to be 262 and not 260.
I guess it is because of the echo statement in the code.
can any one help me out if there is any way to merge the data as stated above apart from the above way.
Thanks in advance,
Anand

This will cat the file contents directly into file "output.txt" via append ">>" instead of the original code of cat to stdout, then echo with an extra null terminator.
for file in *.txt ; do
cat $file >> output.txt
done

Related

How to sort files in a directory into different folders based on their diff result using Perl

I have a group of txt files with unique names in a directory, each has a unique name but and many of their contents are exactly the same. I need a good way to sort these txt files to separate folders such that all the files in each particular folders contain identical contents. The files needs a global diff to ensure the similarity.
for example, if 6 files have the following property (= means diff result are same)
a.txt = b.txt = c.txt
d.txt = e.txt != a.txt
f.txt != (a.txt nor d.txt)
Then, I need these files moved to directories like this:
/folder1/ contains (a.txt, b.txt, c.txt)
/folder2/ contains (d.txt, e.txt)
/folder3/ contains (only f.txt)
Thank you very much!
I wouldn't normally answer a question with no effort shown, but we tend to be a little more lenient with scripts than programs, and I was bored and wanted to refresh my awk skills a bit.
Here are two different ways using awk and Perl command-line scripts. These should be entered in one line. Both were tested with a small set of files.
NOTE: These scripts DO NOT perform the actual operations. It is intended that you redirect the output into a file and then, after carefully verifying that it does what you want, execute that file as a script to perform the moves.
Perl:
for i in *.txt; do echo `sha1sum $i`; done | sort | perl -ne
'BEGIN {$a=1}
($h,$f)=split;
if ($h ne $c) { $c=$h; $d="folder".$a++; print "mkdir $d\n"}
print "mv $f $d\n"'
Awk:
for i in *.txt; do echo `sha1sum $i`; done | sort | awk
'BEGIN {a=1}
$1!=c { c=$1; d="folder" a++; print "mkdir ",d}
{print "mv ",$2," ", d}'
They both use the same initial pipeline: run sha1sum on every file in the current directory, sort by hash value and then invoke either Perl or awk.
You should run the pipeline by itself (omit the last | and the entire awk or perl command) to see what the raw output looks like.
The scripts look for a change in hash value and create a new folder each time it changes, then move the file and subsequent files with matching hashes to the new folder.
Given a set of 7 input files consisting of a single byte each:
Filename Contents
-------- --------
a.txt 1
b.txt 2
c.txt 1
d.txt 1
e.txt 5
f.txt 1
g.txt 5
The raw pipeline output is:
$ for i in *.txt; do echo `sha1sum $i`; done | sort
5d9474c0309b7ca09a182d888f73b37a8fe1362c e.txt
5d9474c0309b7ca09a182d888f73b37a8fe1362c g.txt
7448d8798a4380162d4b56f9b452e2f6f9e24e7a b.txt
e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e a.txt
e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e c.txt
e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e d.txt
e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e f.txt
and the final output is
mkdir folder1
mv e.txt folder1
mv g.txt folder1
mkdir folder2
mv b.txt folder2
mkdir folder3
mv a.txt folder3
mv c.txt folder3
mv d.txt folder3
mv f.txt folder3
BTW, this illustrates a rule you are wise to follow whenever writing scripts that do bulk operations. Never have the script do the operation to begin with, have the script write a script that contains the bulk operations you want performed. Upgrade to doing the actual operations only when you are positive it's been fully tested and debugged.

Printing a line in a certain percentile of a larg text file

I am looking for a single line command to print a line in a certain percentile of a large text file. My preferred solution is something based on sed, wc -l, and/or head/tail as I already know how to do that with awk and wc -l. To make it more clear, if my file has 1K lines of text, I need to print for example the (95%*1K)th line of that file.
In bash:
head -`echo scale=0\;$(cat file|wc -l)\*95/100 | bc -l` file | tail -n 1
head -`wc -l file | awk '{print(int(0.95*$1))}'` file | tail -n 1

Delete a line starting with particular String in all 3000 files

I want to delete a particular line in each of 3000 text files.
I have tried using Notepad Plus but it creates a blank line for each matching line.
Sample File Content:
SAMPLE TXT FILE
---------------------
phone number
address
IAM A DEFAULT
city
state
pincode
----------------
Here IAM A DEFAULT is present in all 3000 files and it is present only once
You don't talk about identifying the files to be processed, but let's assume that you want to remove all IAM A DEFAULT lines from all *.txt files in the current directory
This Perl one-line command will doe that for you. It will also save each original file like abc.txt as abc.txt.bak
perl -i.bak -lne 'print unless $_ eq "IAM A DEFAULT"' *.txt
I hope that helps
With GNU sed and GNU bash 4:
shopt -s globstar nullglob
sed -i '/^IAM A DEFAULT/d' **/*.txt

difference between the content of two files

I have two files one file subset of other and i want to obtain a file which has contents not common to both.for example
File1
apple
mango
banana
orange
jackfruit
cherry
grapes
eggplant
okra
cabbage
File2
apple
banana
cherry
eggplant
cabbage
The resultant file, difference of above two files
mango
orange
jackfruit
grapes
okra
Any ideas on this are appreciated.
You can sort the files then use comm:
$ comm -23 <(sort file1.txt) <(sort file2.txt)
grapes
jackfruit
mango
okra
orange
You might also want to use comm -3 instead of comm -23:
-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files
1 Only one instance , in either
cat File1 File2 | sort | uniq -u
2 Only in first file
cat File1 File2 File2 | sort | uniq -u
3 Only in second file
cat File1 File1 File2 | sort | uniq -u
use awk, no sorting necessary (reduce overheads)
$ awk 'FNR==NR{f[$1];next}(!($1 in f)) ' file2 file
mango
orange
jackfruit
grapes
okra
1. Files uncommon to both files
diff --changed-group-format="%<" --unchanged-group-format="%>" file1 file2
2. File unique to first file
diff --changed-group-format="%<" --unchanged-group-format="" file1 file2
3. File unique to second file
diff --changed-group-format="" --unchanged-group-format="%>" file1 file2
Hope it works for you

How can I show lines in common (reverse diff)?

I have a series of text files for which I'd like to know the lines in common rather than the lines which are different between them. Command line Unix or Windows is fine.
File foo:
linux-vdso.so.1 => (0x00007fffccffe000)
libvlc.so.2 => /usr/lib/libvlc.so.2 (0x00007f0dc4b0b000)
libvlccore.so.0 => /usr/lib/libvlccore.so.0 (0x00007f0dc483f000)
libc.so.6 => /lib/libc.so.6 (0x00007f0dc44cd000)
File bar:
libkdeui.so.5 => /usr/lib/libkdeui.so.5 (0x00007f716ae22000)
libkio.so.5 => /usr/lib/libkio.so.5 (0x00007f716a96d000)
linux-vdso.so.1 => (0x00007fffccffe000)
So, given these two files above, the output of the desired utility would be akin to file1:line_number, file2:line_number == matching text (just a suggestion; I really don't care what the syntax is):
foo:1, bar:3 == linux-vdso.so.1 => (0x00007fffccffe000)
On *nix, you can use comm. The answer to the question is:
comm -1 -2 file1.sorted file2.sorted
# where file1 and file2 are sorted and piped into *.sorted
Here's the full usage of comm:
comm [-1] [-2] [-3 ] file1 file2
-1 Suppress the output column of lines unique to file1.
-2 Suppress the output column of lines unique to file2.
-3 Suppress the output column of lines duplicated in file1 and file2.
Also note that it is important to sort the files before using comm, as mentioned in the man pages.
I found this answer on a question listed as a duplicate. I find grep to be more administrator-friendly than comm, so if you just want the set of matching lines (useful for comparing CSV files, for instance) simply use
grep -F -x -f file1 file2
Or the simplified fgrep version:
fgrep -xf file1 file2
Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two.
Some other handy variations include
-n flag to show the line number of each matched line
-c to only count the number of lines that match
-v to display only the lines in file2 that differ (or use diff).
Using comm is faster, but that speed comes at the expense of having to sort your files first. It isn't very useful as a 'reverse diff'.
It was asked here before: Unix command to find lines common in two files
You could also try with Perl (credit goes here):
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
I just learned the comm command from the answers, but I wanted to add something extra: if the files are not sorted, and you don't want to touch the original files, you can pipe the output of the sort command. This leaves the original files intact. It works in Bash, but I can't say about other shells.
comm -1 -2 <(sort file1) <(sort file2)
This can be extended to compare command output, instead of files:
comm -1 -2 <(ls /dir1 | sort) <(ls /dir2 | sort)
The easiest way to do it is:
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
Files are not necessary to be sorted.
I think diff utility itself, using its unified (-U) option, can be used to achieve effect. Because the first column of output of diff marks whether the line is an addition, or deletion, we can look for lines that haven't changed.
diff -U1000 file_1 file_2 | grep '^ '
The number 1000 is chosen arbitrarily, big enough to be larger than any single hunk of diff output.
Here's the full, foolproof set of commands:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax "$f1" "$f2" | grep '^ ' | less
# Alternatively, use this grep to ignore the lines starting
# with +, -, and # signs.
# grep -vE '^[+#-]'
If you want to include the lines that are just moved around, you can sort the input before diffing, like so:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax <(sort "$f1") <(sort "$f2") | grep '^ ' | less
In Windows, you can use a PowerShell script with CompareObject:
compare-object -IncludeEqual -ExcludeDifferent -PassThru (get-content A.txt) (get-content B.txt)> MATCHING.txt | Out-Null #Find Matching Lines
CompareObject:
IncludeEqual without -ExcludeDifferent: Everything
ExcludeDifferent without -IncludeEqual: Nothing
Just for information, I made a little tool for Windows doing the same thing as "grep -F -x -f file1 file2" (As I haven't found anything equivalent to this command on Windows)
Here it is:
http://www.nerdzcore.com/?page=commonlines
Usage is "CommonLines inputFile1 inputFile2 outputFile"
Source code is also available (GPL).