i am not programmer, but i would like some help to remove duplicate lines in a document and keep only original lines.
i was trying to do this with some text processors, editpadpro, but since my file is more than 1 gigabyte, always gets frozen and can't complete the operation.
i know perl is very good at this, but i don't know how to use it, keeping in mind that the file can be over 1 or 2 gB.
example of input lines:
line 1
line 2
line 3
line 1
line 2
line 4
line 1
example of output lines:
line 1
line 2
line 3
line 4
i am sorry if this is very basic, but i really don't know how to proceed, most of the time i use built in functions, i hope not to annoy anyone with this question.
If you don't mind the lines not being in the original order, you can use this command:
$ sort -u old_file.txt > new_file.txt
The sort will sort your file, and the -u option stands for unique which means that it will only output the first matching line.
Even with very large files, sort may be your best hope.
Preserving the existing order (first time each line is found):
perl -i -wlne'our %uniq; $uniq{$_}++ or print' file.txt
This can also be done effectively using awk: http://awk.freeshell.org/AwkTips
awk '!a[$0]++'
Related
I'm running Windows and have the GnuWin32 toolkit, which includes sed. Specifically:
C:\TEMP>sed --version
GNU sed version 4.2.1
I have a text file with two sections: A fixed part I want to preserve, and a part that's appended after running a job.
In the file is a unique string that identifies the start of the part that's added, and I'd like to use Gnu sed to isolate only the part of the file that's before the unique string - i.e., so I can append different data to the fixed part each time the job is run.
I know I could keep the fixed portion in a separate file, but that adds complexity and it would be more elegant if I could just reuse the data at the start of the same file.
A long time ago I knew how to set up sed scripts, and I'm sure this can be done with sed, but I've slept since then. :)
Can you please describe how to use sed to display the lines of text in a file up to and not including a specific string?
Example:
line 1 of fixed portion
line 2 of fixed portion
unique string
line 1 of appended portion
line 2 of appended portion
line 3 of appended portion
What I'd like is to see as output:
line 1 of fixed portion
line 2 of fixed portion
I've gotten as far as:
sed -r -n -e "0,/unique string/p"
but that prints the unique string as well.
Thanks in advance.
-Noel
This should work for you:
sed -n '/unique string/q;p' file
It quits processing at unique string. Other lines get printed.
An alternative might be to use a range address like this:
sed -n '1,/unique string/{/unique string/!p}' file
Note that sed includes the range border. We need to exclude unique string from printing.
Furthermore I'm using the -n option which makes sed suppress the output of input lines by default.
One thing, if unique string can contain characters which are also syntax characters in the regex like ...
test*
... sed might not be the right tool for the job any more since it can only match regular expressions but not fixed strings.
In that case awk might be the tool of choice:
awk 'index("*unique string*"){exit}1' file
index("string") returns a non zero value (the position) if the string has been found. We cancel further processing of input lines in that case and don't print that line as well.
The trailing 1 always evaluates to true and makes awk print all the lines until the previous condition applies.
Please bear with me as I'm new to the forums and tried to do my research before posting this. What I'm trying to do is to use sed to look through multiple lines of a file and any line that contains the words 'CPU Usage" I want it to comment out that line and also 19 lines immediately after that.
Example file.txt
This is some random text CPU USAGE more random text
Line2
Line3
Line4
Line5
etc.
I want sed to find the string of text CPU usage and comment out the line and the 19 lines following
#This is some random text CPU USAGE more random text
#Line2
#Line3
#Line4
#Line5
#etc.
This is what I've been trying but obviously it is not working since I'm posting on here asking for help
sed '/\/(CPU Usage)s/^/#/+18 > File_name
sed: -e expression #1, char 17: unknown command: `^'
I'd like to be able to use this on multiple files. Any help you can provide is much appreciated!
GNU sed has a non-standard extension (okay, it has many non-standard extensions, but there's one that's relevant here) of permitting /pattern/,+N to mean from the line matching pattern to that line plus N.
I'm not quite sure what you expected your sed command to do with the \/ part of the pattern, and you're missing a single quote in what you show, but this does the trick:
sed '/CPU Usage/,+19 s/^/#/'
If you want to overwrite the original files, add -i .bak (or just -i if you don't mind losing your originals).
If you don't have GNU sed, now might be a good time to install it.
This can easily be done with awk
awk '/CPU Usage/ {f=20} f && f-- {$0="#"$0}1' file
When CPU Usage is found, set flag f=20
If flag f is true, decrements until 0 and for every time, add # in front of the line and print it.
Think this should work, cant test it, if anyone finds something wrong just let me know :)
awk '/CPU Usage/{t=1}t{x++;$0="#"$0}x==19{t=0;x=0}1' file
I have a simple ascii text file with a string in each line, something like
aa1
aa2
ab1
...
with a total of N lines. I know I can use the split command to split it out into a fixed number of lines per file. How do I specify the number of files I want to split it into and let split decide how many lines go into each file. For example if the file had 100 lines, I want to be able to specify
split 3 foo.txt
and it would write out three files xaa xab and xac each with 33, 33 and 34 lines. Is this even possible? Or do I write a custom Perl script for this?
Try doing this :
split -n 3 file
see
man split | less +/'^\s*-n'
There's no option for that[*]
You could use 'wc' to get the number of lines, and divide by 3, so it's few lines of whatever scripting you want to use.
([*]update: on ubuntu there is, and that's what the question was about. -n Does not seem to be there on all linux, or older).
If your split implementation doesn't accept -n paramater you can use this bash function:
function split_n() { split -l $((($1+`wc -l <"$2"`-1)/$1)) "$2" "${3:-$2.}"; }
You can invoke it as
split_n 3 file.txt
or
split_n 3 file.txt prefix
Given your comment that you do not have the -n option in your split, here is a slightly hackier approach you could take.
lines=`wc -l < foo.txt`
lines=$((lines/3+1))
split $lines foo.txt
If you do this often you could store it in a script by taking in the number of splits and filename as follows:
splits=$1
filename=$2
lines=`wc -l < $filename`
lines=$((lines/$splits+1))
split $lines $filename
I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.
I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb).
Any idea how to do this with minimum waste of resources?
Thanks
PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.
Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.
Yes, don't use perl.
Instead use the simple utility for counting lines; wc.exe
It's part of a suite of windows utilities ported from unix originals.
http://unxutils.sourceforge.net/
For example;
PS D:\> wc test.pl
12 26 271 test.pl
PS D:\>
Where 12 == number of lines, 26 == number of words, 271 == number of characters.
If you really have to use perl;
D:\>perl -lne "END{print $.;}" < test.pl
12
perl -lne "END { print $. }" myfile.csv
This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.
This one-liner handles new lines within the rows:
Considering lines with an odd number of quotes.
Considering that doubled quotes is a way of indicating quotes within the field.
It uses the awesome flip-flop operator.
perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
Consider:
wc is not going to work. It's awesome for counting lines, but not CSV rows
You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
This may get you there, nonetheless.
EDIT: It slipped my mind that this was windows:
perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"
The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.
Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.
I was being idiotic, the simple way to do it in the script is:
open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;
while (<$extract>)
{
$rowCount=$rowCount+1;
}
close($extract);