Altering multiple text files using grep awk sed perl or something else - matlab

I have multiple text files named split01.txt, split02.txt etc... with the data in the format below: (This is what I have)
I would like to create another file with the data taken from the split01.txt, split02.txt etc... file in the format below: (this is the format I would like to see)
Can this be done in one instance? The reason I ask is that I'm going to be running/calling the command (awk,grep,sed,etc...) from inside of octave/matlab after the initial process has completed creating the audio files.
example: of what I mean in one instance below: (matlab/octave code)
system(strcat({'split --lines=3600 -d '},dirpathwaveformstmp,fileallplaylistStr,{' '},dirpathwaveformstmp,'allsplit'))
This splits a single file into multiple files with the names allsplit01 allsplit02 etc.. and each file only has a max of 3600 lines.
For those who asked this is creating playlist files for audio files I create with octave/matlab.
Any suggestions?

Here's one way you could do it with awk:
print "[playlist]"
print "NumberOfEntries=" len "\n"
i = 1
gsub(".*/|;", "")
printf "File%06d=%s\n" , i, $0
printf "Title%06d=%s\n\n", i, $0
print "Version 2"
Run it like this:
awk -v len=$(wc -l < infile) -f parse.awk infile
Version 2

If you're writing your program in Octave, why don't you do it in Octave as well? The language is not limited to numerical analysis. What you're trying to do can be done quite easily with Octave functions.
filepath = "path for input file"
playlistpath = "path for output file"
## read file and prepare cell array for printing
files = strsplit (fileread (filepath)', "\n");
if (isempty (files{end}))
files(end) = [];
[~, names, exts] = cellfun (#fileparts, files, "UniformOutput", false);
files = strcat (names, exts);
files(2,:) = files(1,:);
files(4,:) = files(1,:);
files(1,:) = num2cell (1:columns(files))(:);
files(3,:) = num2cell (1:columns(files))(:);
## write playlist
[fid, msg] = fopen (playlistpath, "w");
if (fid < 0)
error ("Unable to fopen %s for writing: %s", playlistpath, msg);
fprintf (fid, "[playlist]\n");
fprintf (fid, "NumberOfEntries=%i\n", columns (files));
fprintf (fid, "\n");
fprintf (fid, "File%06d=%s\nTitle%06d=%s\n\n", files{:});
fprintf (fid, "Version 2");
if (fclose (fid))
error ("Unable to fclose file %s with FID %i", playlistpath, fid);


how to get the position of nth occurrence of a string in a file

I have an xml file which contains data in a single line where same string is repeated multiple times in it.
I am looking to identify the position of nth occurrence of a string in that file so that i can split single file into multiple files based on that position so that it will be easy for processing.
sample data in file:
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id><id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id><id = 9><\id><id = 10><\id><id = 11><\id>
So i want to split the file based on the id tag. for eg i want to look for position of 5th occurrence of the id tag and need to split the file into 3 files totally
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id>
<id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id>
<id = 9><\id><id = 10><\id><id = 11><\id>
I tried splitting the one line into multiple lines with a simple sed
sed 's/></>\n</g' $file > data.txt
Later with a simple grep i identified the line number and started splitting based on the line number. This is working for smaller files but some file are in GB's (10-20) which is causing issues.
Could you help me if there is any easy way to get the position of the nth occurrence of a string in file so that i can split single file into multiple files based on the string position.
This might work for you (GNU sed):
sed 's/<\\id>/&\n/4;P;D' file | sed -ne '1~3w file1' -e '2~3w file2' -e '3~3w file3'
In the first sed invocation, split each line into three following the fourth <\id> (BTW should this be </id>?).
Pipe the result to a second sed invocation.
In the second sed invocation, send the first line modulo three to file1, the second line modulo three to file2 and the third line modulo three to file3.
Alternative using split instead of the second invocation of sed:
sed 's/<\\id>/&\n/4;P;D' file | split -a 1 --nume=1 -dn r/3 - file
suggesting an gawk script (standard awk in most Linux machines) that can do all the splittings as well:
f : filename prefix
m : number of id elements in file
gawk script:
gawk 'BEGIN{RS="<\\\\id>"}{a=a$0 RT}NR%m==0{print a > f NR;a=""}END{if (a)print a> f (NR-(NR%m-1)) }' f="FileName_" m=5 input.txt
Sample run on given example: f=file_ m=4
gawk 'BEGIN{RS="<\\\\id>"}{a=a$0 RT}NR%m==0{print a > f NR;a=""}END{if (a)print a> f (NR-(NR%m-1)) }' f="file_" m=4 input.xml
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id>
<id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id>
<id = 9><\id><id = 10><\id><id = 11><\id>
gawk script explanation
BEGIN{RS="<\\\\id>"} # set awk record seperator to <\id>
{output = output $0 RT} # accumulate each record in output variable
(NR % chunkSize) == 0 { # when read chunkSize of records
# save output variable into file. named: filePrefix appended with record cout
print output > filePrefix NR;
output = ""; # reset output variable
END { # after processing the last record
if (output) { # there is existing output
lastChunk = NR-((NR % chunkSize) - 1); # compute the file begin ID
print output > filePrefix lastChunk; # write output to last file

Compare two CSV files and show only the difference

I have two CSV files:
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 915.13, Aircel_Indore
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 815.13, Aircel_Indore
These are sample input files in actual so many headers and values will be there, so I can not make them hard coded.
In my expected output I want to keep the first two columns and the last column as it is as there won't be any change in the same and then the comparison should happen for the rest of the columns and values.
Expected output:
Time, Object_Name, Frequency, Longname
2013-08-05 00:00, 815.13, Aircel_Indore
How can I do this?
Please look at the links below, there are some examples scripts:
Perl: Compare Two CSV Files and Print out differences
If you are not bound to Perl, here a solution using AWK:
awk -v FS="," '
function filter_columns()
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
} NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}' File1.csv File2.csv
This outputs:
Time, Object_Name, Frequany, Longname
2013-08-05 00:00, Alpha, 815.13, Aircel_Indore
Here the explanation:
# FS = "," makes awk split each line in fields using
# the comma as separator
awk -v FS="," '
# this function selects the columns you want. NF is the
# the number of field. Therefore $NF is the content of
# the last column and $(NF-1) of the but last.
function filter_columns()
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
# This block processes just the first file, this is the aim
# of the condition NR == FNR. The condition NF != 0 skips the
# empty lines you have in your file. The block prints the header
# and then save all the other lines in the array memory.
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
# This block processes just the second file (NR != FNR).
# Since the header has been already printed, it skips the first
# line of the second file (FNR == 1). The block compares each line
# against that one saved in the array memory (the corresponding
# line in the first file). The block prints just the lines
# that do not match.
NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}' File1.csv File2.csv
Answering #IlmariKaronen's questions would clarify the problem much better, but meanwhile I made some assumptions and took a crack at the problem - mainly because I needed an excuse to learn a bit of Text::CSV.
Here's the code:
use strict;
use warnings;
use Text::CSV;
use Array::Compare;
use feature 'say';
open my $in_file, '<', 'infile.csv';
open my $exp_file, '<', 'expectedfile.csv';
open my $out_diff_file, '>', 'differences.csv';
my $text_csv = Text::CSV->new({ allow_whitespace => 1, auto_diag => 1 });
my $line = readline($in_file);
my $exp_line = readline($exp_file);
die 'Different column headers' unless $line eq $exp_line;
my #headers = $text_csv->fields();
my %all_differing_indices;
#array-of-array containings lists of "expected" rows for differing lines
# only columns that differ from the input have values, others are empty
my #all_differing_rows;
my $array_comparer = Array::Compare->new(DefFull => 1);
while (defined($line = readline($in_file))) {
$exp_line = readline($exp_file);
if ($line ne $exp_line) {
my #in_fields = $text_csv->fields();
my #exp_fields = $text_csv->fields();
my #differing_indices = $array_comparer->compare([#in_fields], [#exp_fields]);
#all_differing_indices{#differing_indices} = (1) x scalar(#differing_indices);
my #output_row = ('') x scalar(#exp_fields);
#output_row[0, 1, #differing_indices, $#exp_fields] = #exp_fields[0, 1, #differing_indices, $#exp_fields];
$all_differing_rows[$#all_differing_rows + 1] = [#output_row];
my #columns_needed = (0, 1, keys(%all_differing_indices), $#headers);
say $out_diff_file $text_csv->string();
for my $row_aref (#all_differing_rows) {
say $out_diff_file $text_csv->string();
It works for the File1 and File2 given in the question and produces the Expected output (except that the Object_Name 'Alpha' is present in the data line - I'm assuming that's a typo in the question).
"2013-08-05 00:00",Alpha,815.13,Aircel_Indore
I've created a script for it with very powerful linux tools. Link here...
Linux / Unix - Compare Two CSV Files
This project is about comparison of two csv files.
Let's assume that csvFile1.csv has XX columns and csvFile2.csv has YY columns.
Script I've wrote should compare one (key) column form csvFile1.csv with another (key) column from csvFile2.csv. Each variable from csvFile1.csv (row from key column) will be compared to each variable from csvFile2.csv.
If csvFile1.csv has 1,500 rows and csvFile2.csv has 15,000 total number of combinations (comparisons) will be 22,500,000. So this is very helpful way how to create Availability Report script which for example could compare internal product database with external (supplier's) product database.
Packages used:
csvcut (cut columns)
csvdiff (compare two csv files)
ssconvert (convert xlsx to csv)
More you can find on my official blog (+example script):

Renaming names in a file using another file without using loops

I have two files:
(one.txt) looks Like this:
I have like 10000 more ENST
(two.txt) looks like this:
>ENST001 110
>ENST002 59
and so on for the rest of all ENSTs
I basically would like to replace the ENSTs in the (one.txt) by the combination of the two fields in the (two.txt) so the results will look like this:
I wrote a matlab script to do so but since it loops for all lines in (two.txt) it take like 6 hours to finish, so I think using awk, sed, grep, or even perl we can get the result in few minutes. This is what I did in matlab:
frf = fopen('one.txt', 'r');
frp = fopen('two.txt', 'r');
fw = fopen('result.txt', 'w');
while feof(frf) == 0
line = fgetl(frf);
first_char = line(1);
if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID
id_fold = strrep(line, '>', ''); % Reomve the > symbol
frewind(frp) % Rewind two.txt file after each loop
while feof(frp) == 0
raw = fgetl(frp);
scan = textscan(raw, '%s%s');
id_pos = scan{1}{1};
pos = scan{2}{1};
if strcmp(id_fold, id_pos) == 1 % if both ids are the same
id_new = ['>', id_fold, '_', pos];
fprintf(fw, '%s\n', id_new);
fprintf(fw, '%s\n', line); % if the line doesn't start by > print it to results
One way using awk. FNR == NR process first file in arguments and saves each number. Second condition process second file, and when first field matches with a key in the array modifies that line appending the number.
awk '
FNR == NR {
data[ $1 ] = $2;
FNR < NR && data[ $1 ] {
$0 = $1 "_" data[ $1 ]
{ print }
' two.txt one.txt
With sed you can at first run only on two.txt you can make a sed commands to replace as you want and run it at one.txt:
First way
sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt
Second way
If files are huge you'll get too many arguments error with previous way. Therefore there is another way to fix this error. You need execute all three commands one by one:
sed -n '1i#!/bin/sed -f
/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed
chmod +x script.sed
./script.sed one.txt
The first command will form the sed script that will be able to modify one.txt as you want. chmod will make this new script executable. And the last command will execute command. So each file is read only once. There is no any loops.
Note that first command consist from two lines, but still is one command. If you'll delete newline character it will break the script. It is because of i command in sed. You can look for details in ``sed man page.
This Perl solution sends the modified one.txt file to STDOUT.
use strict;
use warnings;
open my $f2, '<', 'two.txt' or die $!;
my %ids;
while (<$f2>) {
$ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/;
open my $f1, '<', 'one.txt' or die $!;
while (<$f1>) {
Turn the problem on its head. In perl I would do something like this:
open(FH1, "one.txt");
open(FH2, "two.txt");
open(RESULT, ">result.txt");
my %data;
while (my $line = <FH2>)
# Delete leading angle bracket
$line =~ s/>//d;
# split enst and pos
my ($enst, $post) = split(/\s+/, line);
# Store POS with ENST as key
$data{$enst} = $pos;
while (my $line = <FH1>)
# Check line for ENST
if ($line =~ m/^>(ENST\d+)/)
my $enst = $1;
# Get pos for ENST
my $pos = $data{$enst};
# make new line
$line = '>' . $enst . '_' . $pos . '\n';
print RESULT $line;
This might work for you (GNU sed):
sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt
Try this MATLAB solution (no loops):
%# read files as cell array of lines
fid = fopen('one.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C1 = C{1};
fid = fopen('two.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C2 = C{1};
%# use regexp to extract ENST numbers from both files
num = regexp(C1, '>ENST(\d+)', 'tokens', 'once');
idx1 = find(~cellfun(#isempty, num)); %# location of >ENST line
val1 = str2double([num{:}]); %# ENST numbers
num = regexp(C2, '>ENST(\d+)', 'tokens', 'once');
idx2 = find(~cellfun(#isempty, num));
val2 = str2double([num{:}]);
%# construct new header lines from file2
C2(idx2) = regexprep(C2(idx2), ' +','_');
%# replace headers lines in file1 with the new headers
[tf,loc] = ismember(val2,val1);
C1( idx1(loc(tf)) ) = C2( idx2(tf) );
%# write result
fid = fopen('three.txt','wt');
fprintf(fid, '%s\n',C1{:});

Splitting a concatenated file based on header text

I have a few very large files which are basically a concatenation of several small files and I need to split them into their constituent files. I also need to name the files the same as the original files.
For example the files QMAX123 and QMAX124 have been concatenated to:
;QMAX123 - Student
... file content ...
;QMAX124 - Course
... file content ...
I need to recreate the file QMAX123 as
;QMAX123 - Student
... file content ...
And QMAX124 as
;QMAX124 - Course
... file content ...
The original file's header ;QMAX<some number> is unique and only appears as a header in the file.
I used the script below to split the content of the files, but I haven't been able to adapt it to get the file names right.
awk '/^;QMAX/{close("file"f);f++}{print $0 > "file"f}' <filename>
So I can either adapt that script to name the file correctly or I can rename the split files created using the script above based on the content of the file, whichever is easier.
I'm currently using cygwin bash (which has perl and awk) if that has any bearing on your answer.
The following Perl should do the trick
use warnings ;
use strict ;
my $F ; #will hold a filehandle
while (<>) {
if ( / ^ ; (\S+) /x) {
my $filename = $1 ;
open $F, '>' , $filename or die "can't open $filename " ;
} else {
next unless defined $F ;
print $F $_ or warn "can't write" ;
Note it discards any input before a line with filename next unless defined $F ; You may care to generate an error or add a default file. Let me know and I can change it
With Awk, it's as simple as
awk '/^;QMAX/ {filename = substr($1,2)} {print >> filename}' input_file

string.find using directory path in Lua

I need to translate this piece of code from Perl to Lua
open(FILE, '/proc/meminfo');
if (m/MemTotal/)
$mem = $_;
$mem =~ s/.*:(.*)/$1/;
elseif (m/MemFree/)
$memfree = $_;
$memfree =~ s/.*:(.*)/$1/;
So far I've written this
while assert("/proc/meminfo", "r")) do
Currentline = string.find(/proc/meminfo, "m/MemTotal")
if Currentline = m/MemTotal then
Mem = Currentline
Mem = string.gsub(Mem, ".*", "(.*)", 1)
elseif m/MemFree then
Memfree = Currentline
Memfree = string.gsub(Memfree, ".*", "(.*)", 1)
Now, when I try to compile, I get the following error about the second line of my code
luac: Perl to Lua:122: unexpected symbol near '/'
obviously the syntax of using a directory path in string.find is not like how I've written it. 'But how is it?' is my question.
You don't have to stick to Perl's control flow. Lua has a very nice "gmatch" function which allows you to iterate over all possible matches in a string. Here's a function which parses /proc/meminfo and returns it as a table:
function get_meminfo(fn)
local r={}
local f=assert(,"r"))
-- read the whole file into s
local s=f:read("*a")
-- now enumerate all occurances of "SomeName: SomeValue"
-- and assign the text of SomeName and SomeValue to k and v
for k,v in string.gmatch(s,"(%w+): *(%d+)") do
-- Save into table:
return r
-- use it
print(m.MemTotal, m.MemFree)
To iterate a file line by line you can use io.lines.
for line in io.lines("/proc/meminfo") do
if line:find("MemTotal") then --// Syntactic sugar for string.find(line, "MemTotal")
--// If logic here...
elseif --// I don't quite understand this part in your code.
No need to close the file afterwards.