I manage a computer lab with 40 ubuntu machines and I have cobbled together this command to find the total disk usage of files larger than 100M in the students' home directories:
for i in `cat ./lab-machines.txt ` ; do ssh $i "nohup find /home -size +100M -print0 | du --files0-from=- -ch | tail -1 && hostname && ls /home" ; done > lab-disk-usage.txt
The file "lab-machines.txt" contains the hostnames of the computers on a separate line each. The command runs from a server that has been configured with password-less logins into the lab machines for the root user. The output in the file lab-disk-usage.txt contains something like this for every machine (I've inserted comments in parenthesis):
69G total
hostname
student-username (changes)
admin-username (always the same)
lost+found (always the same)
I would like the output to look like this for each machine:
69G hostname student-username
I am not familiar enough with text filtering to get this done in time. Can you help?
try this:
awk -vORS=" " 'NR==1{sub("total","")}NR<=3' file
Pipe Output Through tr Command
You might try a simpler solution, such as piping your output through the tr command. For example:
tr -s "\n" ' ' < lab-disk-usage.txt
This assumes there's only one record in the file, though. If you plan on having multiple records, you'll want to filter each record through the tr pipeline first before appending it to the output file. For example:
your_pipeline_commands | tr -s "\n" ' ' > lab-disk-usage.txt
Use Perl's Flip-Flop Operator
If you have a set of multi-line records, you'll need to be more clever. Perl offers some advantages over AWK for handling multi-line records, including the flip-flop operator. For example:
perl -ne 'if ( /total/../^lost/ ) {
chomp $_; print $_ . " "
} else {
print "\n"
};
END { print "\n" };' lab-disk-usage.txt
Depending on your actual corpus, you may need to tweak the regular expression a bit to get things working right, but on my system it does the right thing.
Corpus for Testing Perl
69G total
hostname
student-username
admin-username
lost+found
69G total
hostname
student-username
admin-username
lost+found
Sample Output from Perl
69G total hostname student-username admin-username lost+found
69G total hostname student-username admin-username lost+found
I've slightly modified your example data:
69G total
host1
jane
admin-username
lost+found
65G total
host2
albert
admin-username
lost+found
This can get turned into a table:
[ghoti#pc ~/tmp]$ awk 'NR%5==1{size=$1} NR%5==2{host=$1} NR%5==3{user=$1; printf("%-8s%-16s%s\n", size, host, user)}' lab-disk-usage.txt
69G host1 jane
65G host2 albert
The essential thing her is that we're using a modulo operator (NR%5) to figure out where we are in each set of five lines.
If you can't rely on five lines per set, then please clarify how your input data is structured. There are other ways we can detect record boundaries, like looking for /[0-9]+G total$/, if NR%5 can't be used:
[ghoti#pc ~/tmp]$ awk '/G total$/{size=$1; getline host; getline user; printf("%-8s%-16s%s\n", size, host, user)}' lab-disk-usage.txt
69G host1 jane
65G host2 albert
This is basically just an awk version of potong's GNU sed suggestion, which could also be made portable (i.e. not just GNU sed) as:
[ghoti#pc ~/tmp]$ sed -ne '/G total/{s/ .*//;N;N;s/\n/ /g;p;}' lab-disk-usage.txt
69G host1 jane
65G host2 albert
This might work for you (GNU sed):
sed -nr '/ total/{N;N;s/( total\s*)?\n/ /gp}' file
If there are no empty lines between the records you could introduce one first:
awk '/total/{print x}1' | awk '{print $1,$3,$4}' RS= OFS='\t'
With file contents:
69G total
host1
jane
admin-username
lost+found
65G total
host2
albert
admin-username
lost+found
This produces:
69G host1 jane
65G host2 albert
If there already is an empty line between the records you could skip the part before the pipe and use:
awk '{print $1,$3,$4}' RS= OFS='\t' file
Related
I'm trying to extract from a tab delimited file a number that i need to store in a variable. I'm approaching the problem with a regex that thanks to some research online I have been able to built.
The file is composed as follow:
0 0 2500 5000
1 5000 7500 10000
2 10000 12500 15000
3 15000 17500 20000
4 20000 22500 25000
5 25000 27500 30000
I need to extract the number in the second column given a number of the first one. I wrote and tested online the regex:
(?<=5\t).*?(?=\t)
I need the 25000 from the sixth line.
I started working with sed but as you already know, it doesn't like lookbehind and lookahead pattern even with the -E option to enable extended version of regular expressions. I tried also with awk and grep and failed for similar reasons.
Going further I found that perl could be the right command but I'm not able to make it work properly. I'm trying with the command
perl -pe '/(?<=5\t).*?(?=\t)/' | INFO.out
but I admit my poor knowledge and I'm a bit lost.
The next step would be to read the "5" in the regex from a variable so if you already know problems that could rise, please let me know.
No need for lookbehinds -- split each line on space and check whether the first field is 5.
In Perl there is a command-line option convenient for this, -a, with which each line gets split for us and we get #F array with fields
perl -lanE'say $F[1] if $F[0] == 5' data.txt
Note that this tests for 5 numerically (==)
grep supports -P for perl regex, and -o for only-matching, so this works with a lookbehind:
grep -Po '(?<=5\t)\d+' file
That can use a shell variable pretty easily:
VAR=5 && grep -Po "(?<=$VAR\t)\d+"
Or perl -n, to show using s///e to match and print capture group:
perl -lne 's/^5\t(\d+)/print $1/e' file
Why do you need to use a regex? If all you are doing is finding lines starting with a 5 and getting the second column you could use sed and cut, e.g.:
<infile sed -n '/^5\t/p' | cut -f2
Output:
25000
One option is to use sed, match 5 at the start of the string and after the tab capture the digits in a group
sed -En 's/^5\t([[:digit:]]+)\t.*/\1/p' file > INFO.out
The file INFO.out contains:
25000
Using sed
$ var1=$(sed -n 's/^5[^0-9]*\([^ ]*\).*/\1/p' input_file)
$ echo "$var1"
25000
I would like the option of extracting the following string/data:
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
=or=
25
myproxy
sample
But it would help if I see both.
From this output using cut or perl or anything else that would work:
Found 3 items
drwxr-xr-x - foo_hd foo_users 0 2011-03-16 18:46 /work/foo/processed/25
drwxr-xr-x - foo_hd foo_users 0 2011-04-05 07:10 /work/foo/processed/myproxy
drwxr-x--- - foo_hd testcont 0 2011-04-08 07:19 /work/foo/processed/sample
Doing a cut -d" " -f6 will get me foo_users, testcont. I tried increasing the field to higher values and I'm just not able to get what I want.
I'm not sure if cut is good for this or something like perl?
The base directories will remain static /work/foo/processed.
Also, I need the first line Found Xn items removed. Thanks.
You can do a substitution from beginning to the first occurrence of / , (non greedily)
$ your_command | ruby -ne 'print $_.sub(/.*?\/(.*)/,"/\\1") if /\//'
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
Or you can find a unique separator (field delimiter) to split on. for example, the time portion is unique , so you can split on that and get the last element. (2nd element)
$ ruby -ne 'print $_.split(/\s+\d+:\d+\s+/)[-1] if /\//' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
With awk,
$ awk -F"[0-9][0-9]:[0-9][0-9]" '/\//{print $NF}' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
perl -lanF"\s+" -e 'print #F[-1] unless /^Found/' file
Here is an explanation of the command-line switches used:
-l: remove line break from each line of input, then add one back on print
-a: auto-split each line of input into an #F array
-n: loop through each line of input
-F: the regexp pattern to use for the auto-split (with -a)
-e: the perl code to execute (for each line of input if using -n or -p)
If you want to just output the last portion of your directory path, and the basedir is always '/work/foo/processed', I would do this:
perl -nle 'print $1 if m|/work/foo/processed/(\S+)|' file
Try this out :
<Your Command> | grep -P -o '[\/\.\w]+$'
OR if the directory '/work/foo/processed' is always static then:
<Your Command>| grep -P -o '\/work\/foo\/processed\/.+$'
-o : Show only the part of a matching line that matches PATTERN.
-P : Interpret PATTERN as a Perl regular expression.
In this example, the last word in the input will be matched .
(The word can also contain dot(s)),so file names like 'text_file1.txt', can be matched).
Ofcourse, you can change the pattern, as per your requirement.
If you know the columns will be the same, and you always list the full path name, you could try something like:
ls -l | cut -c79-
which would cut out the 79th character until the end. That might work in this exact case, but I think it would be better to find the basename of the last field. You could easily do this in awk or perl. Respond if this is not what you want and I'll add the awk and perl versions.
take the output of your ls command and pipe it to awk
your command|awk -F'/' '{print $NF}'
your_command | perl -pe 's#.*/##'
I have a file like this:
1 2 3
4 5 6
7 6 8
9 6 3
4 4 4
What are some one-liners that can output unique elements of the nth column to another file?
EDIT: Here's a list of solutions people gave. Thanks guys!
cat in.txt | cut -d' ' -f 3 | sort -u
cut -c 1 t.txt | sort -u
awk '{ print $2 }' cols.txt | uniq
perl -anE 'say $F[0] unless $h{$F[0]}++' filename
In Perl before 5.10
perl -lane 'print $F[0] unless $h{$F[0]}++' filename
In Perl after 5.10
perl -anE 'say $F[0] unless $h{$F[0]}++' filename
Replace 0 with the column you want to output.
For j_random_hacker, here is an implementation that will use very little memory (but will be a slower and requires more typing):
perl -lane 'BEGIN {dbmopen %h, "/tmp/$$", 0600; unlink "/tmp/$$.db" } print $F[0] unless $h{$F[0]}++' filename
dbmopen creates an interface between a DBM file (that it creates or opens) and the hash named %h. Anything stored in %h will be stored on disc instead of in memory. Deleting the file with unlink ensures that the file will not stick around after the program is done, but has no effect on the current process (since, according to POSIX rules, open filehandles are respected by the filesystem as real files).
Corrected: Thank you Mark Rushakoff.
$ cut -c 1 t.txt | sort | uniq
or
$ cut -c 1 t.txt | sort -u
1
4
7
9
Taking the unique values of the third column:
$ cat in.txt | cut -d' ' -f 3 | sort -u
3
4
6
8
cut -d' ' means to separate the input delimited by spaces, and the -f 3 part means take the third field. Finally, sort -u sorts the output, keeping only unique entries.
Say your file is "cols.txt" and you want the unique elements of the second column:
awk '{ print $2 }' cols.txt | uniq
You might find the following article useful for learning more about such utilities:
Simplify data extraction using Linux text utilities
if using awk, no need to use other commands
awk '!_[$2]++{print $2}' file
Some commands in Solaris (such as iostat) report disk related information using disk names such as sd0 or sdd2. Is there a consistent way to map these names back to the standard /dev/dsk/c?t?d?s? disk names in Solaris?
Edit: As Amit points out, iostat -n produces device names such as eg c0t0d0s0 instead of sd0. But how do I found out that sd0 actually is c0t0d0s0? I'm looking for something that produces a list like this:
sd0=/dev/dsk/c0t0d0s0
...
sdd2=/dev/dsk/c1t0d0s4
...
Maybe I could run iostat twice (with and without -n) and then join up the results and hope that the number of lines and device sorting produced by iostat is identical between the two runs?
Following Amit's idea to answer my own question, this is what I have come up with:
iostat -x|tail -n +3|awk '{print $1}'>/tmp/f0.txt.$$
iostat -nx|tail -n +3|awk '{print "/dev/dsk/"$11}'>/tmp/f1.txt.$$
paste -d= /tmp/f[01].txt.$$
rm /tmp/f[01].txt.$$
Running this on a Solaris 10 server gives the following output:
sd0=/dev/dsk/c0t0d0
sd1=/dev/dsk/c0t1d0
sd4=/dev/dsk/c0t4d0
sd6=/dev/dsk/c0t6d0
sd15=/dev/dsk/c1t0d0
sd16=/dev/dsk/c1t1d0
sd21=/dev/dsk/c1t6d0
ssd0=/dev/dsk/c2t1d0
ssd1=/dev/dsk/c3t5d0
ssd3=/dev/dsk/c3t6d0
ssd4=/dev/dsk/c3t22d0
ssd5=/dev/dsk/c3t20d0
ssd7=/dev/dsk/c3t21d0
ssd8=/dev/dsk/c3t2d0
ssd18=/dev/dsk/c3t3d0
ssd19=/dev/dsk/c3t4d0
ssd28=/dev/dsk/c3t0d0
ssd29=/dev/dsk/c3t18d0
ssd30=/dev/dsk/c3t17d0
ssd32=/dev/dsk/c3t16d0
ssd33=/dev/dsk/c3t19d0
ssd34=/dev/dsk/c3t1d0
The solution is not very elegant (it's not a one-liner), but it seems to work.
One liner version of the accepted answer (I only have 1 reputation so I can't post a comment):
paste -d= <(iostat -x | awk '{print $1}') <(iostat -xn | awk '{print $NF}') | tail -n +3
Try using the '-n' switch. For eg. 'iostat -n'
As pointed out in other answers, you can map the device name back to the instance name via the device path and information contained in /etc/path_to_inst. Here is a Perl script that will accomplish the task:
#!/usr/bin/env perl
use strict;
my #path_to_inst = qx#cat /etc/path_to_inst#;
map {s/"//g} #path_to_inst;
my ($device, $path, #instances);
for my $line (qx#ls -l /dev/dsk/*s2#) {
($device, $path) = (split(/\s+/, $line))[-3, -1];
$path =~ s#.*/devices(.*):c#$1#;
#instances =
map {join("", (split /\s+/)[-1, -2])}
grep {/$path/} #path_to_inst;
*emphasized text*
for my $instance (#instances) {
print "$device $instance\n";
}
}
I found the following in the Solaris Transistion Guide:
"Instance Names
Instance names refer to the nth device in the system (for example, sd20).
Instance names are occasionally reported in driver error messages. You can determine the binding of an instance name to a physical name by looking at dmesg(1M) output, as in the following example.
sd9 at esp2: target 1 lun 1
sd9 is /sbus#1,f8000000/esp#0,800000/sd#1,0
<SUN0424 cyl 1151 alt 2 hd 9 sec 80>
Once the instance name has been assigned to a device, it remains bound to that device.
Instance numbers are encoded in a device's minor number. To keep instance numbers consistent across reboots, the system records them in the /etc/path_to_inst file. This file is read only at boot time, and is currently updated by the add_drv(1M) and drvconf"
So based upon that, I wrote the following script:
for device in /dev/dsk/*s2
do
dpath="$(ls -l $device | nawk '{print $11}')"
dpath="${dpath#*devices/}"
dpath="${dpath%:*}"
iname="$(nawk -v dpath=$dpath '{
if ($0 ~ dpath) {
gsub("\"", "", $3)
print $3 $2
}
}' /etc/path_to_inst)"
echo "$(basename ${device}) = ${iname}"
done
By reading the information directly out of the path_to_inst file, we are allowing for adding and deleting devices, which will skew the instance numbers if you simply count the instances in the /devices directory tree.
I think simplest way to find descriptive name having instance name is:
# iostat -xn sd0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
4.9 0.2 312.1 1.9 0.0 0.0 3.3 3.5 0 1 c1t1d0
#
The last column shows descriptive name for provided instance name.
sd0 sdd0 are instance names of devices.. you can check /etc/path_to_inst to get instance name mapping to physical device name, then check link in /dev/dsk (to which physical device it is pointing) it is 100% sure method, though i dont know how to code it ;)
I found this snippet on the internet some time ago, and it does the trick. This was on Solaris 8:
#!/bin/sh
cd /dev/rdsk
/usr/bin/ls -l *s0 | tee /tmp/d1c |awk '{print "/usr/bin/ls -l "$11}' | \
sh | awk '{print "sd" substr($0,38,4)/8}' >/tmp/d1d
awk '{print substr($9,1,6)}' /tmp/d1c |paste - /tmp/d1d
rm /tmp/d1[cd]
A slight variation to allow for disk names that are longer than 8 characters (encountered when dealing with disk arrays on a SAN)
#!/bin/sh
cd /dev/rdsk
/usr/bin/ls -l *s0 | tee /tmp/d1c | awk '{print "/usr/bin/ls -l "$11}' | \
sh | awk '{print "sd" substr($0,38,4)/8}' >/tmp/d1d
awk '{print substr($9,1,index($9,"s0)-1)}' /tmp/d1c | paste - /tmp/d1d
rm /tmp/d1[cd]
I have a series of text files for which I'd like to know the lines in common rather than the lines which are different between them. Command line Unix or Windows is fine.
File foo:
linux-vdso.so.1 => (0x00007fffccffe000)
libvlc.so.2 => /usr/lib/libvlc.so.2 (0x00007f0dc4b0b000)
libvlccore.so.0 => /usr/lib/libvlccore.so.0 (0x00007f0dc483f000)
libc.so.6 => /lib/libc.so.6 (0x00007f0dc44cd000)
File bar:
libkdeui.so.5 => /usr/lib/libkdeui.so.5 (0x00007f716ae22000)
libkio.so.5 => /usr/lib/libkio.so.5 (0x00007f716a96d000)
linux-vdso.so.1 => (0x00007fffccffe000)
So, given these two files above, the output of the desired utility would be akin to file1:line_number, file2:line_number == matching text (just a suggestion; I really don't care what the syntax is):
foo:1, bar:3 == linux-vdso.so.1 => (0x00007fffccffe000)
On *nix, you can use comm. The answer to the question is:
comm -1 -2 file1.sorted file2.sorted
# where file1 and file2 are sorted and piped into *.sorted
Here's the full usage of comm:
comm [-1] [-2] [-3 ] file1 file2
-1 Suppress the output column of lines unique to file1.
-2 Suppress the output column of lines unique to file2.
-3 Suppress the output column of lines duplicated in file1 and file2.
Also note that it is important to sort the files before using comm, as mentioned in the man pages.
I found this answer on a question listed as a duplicate. I find grep to be more administrator-friendly than comm, so if you just want the set of matching lines (useful for comparing CSV files, for instance) simply use
grep -F -x -f file1 file2
Or the simplified fgrep version:
fgrep -xf file1 file2
Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two.
Some other handy variations include
-n flag to show the line number of each matched line
-c to only count the number of lines that match
-v to display only the lines in file2 that differ (or use diff).
Using comm is faster, but that speed comes at the expense of having to sort your files first. It isn't very useful as a 'reverse diff'.
It was asked here before: Unix command to find lines common in two files
You could also try with Perl (credit goes here):
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
I just learned the comm command from the answers, but I wanted to add something extra: if the files are not sorted, and you don't want to touch the original files, you can pipe the output of the sort command. This leaves the original files intact. It works in Bash, but I can't say about other shells.
comm -1 -2 <(sort file1) <(sort file2)
This can be extended to compare command output, instead of files:
comm -1 -2 <(ls /dir1 | sort) <(ls /dir2 | sort)
The easiest way to do it is:
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
Files are not necessary to be sorted.
I think diff utility itself, using its unified (-U) option, can be used to achieve effect. Because the first column of output of diff marks whether the line is an addition, or deletion, we can look for lines that haven't changed.
diff -U1000 file_1 file_2 | grep '^ '
The number 1000 is chosen arbitrarily, big enough to be larger than any single hunk of diff output.
Here's the full, foolproof set of commands:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax "$f1" "$f2" | grep '^ ' | less
# Alternatively, use this grep to ignore the lines starting
# with +, -, and # signs.
# grep -vE '^[+#-]'
If you want to include the lines that are just moved around, you can sort the input before diffing, like so:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax <(sort "$f1") <(sort "$f2") | grep '^ ' | less
In Windows, you can use a PowerShell script with CompareObject:
compare-object -IncludeEqual -ExcludeDifferent -PassThru (get-content A.txt) (get-content B.txt)> MATCHING.txt | Out-Null #Find Matching Lines
CompareObject:
IncludeEqual without -ExcludeDifferent: Everything
ExcludeDifferent without -IncludeEqual: Nothing
Just for information, I made a little tool for Windows doing the same thing as "grep -F -x -f file1 file2" (As I haven't found anything equivalent to this command on Windows)
Here it is:
http://www.nerdzcore.com/?page=commonlines
Usage is "CommonLines inputFile1 inputFile2 outputFile"
Source code is also available (GPL).