Distribution of CRC checksums

Distribution of CRC checksums - hash

I am investigating about the collision propability of CRC checksums when they are used as a hashes. I know how to calculate the collision propability for hash algorithms that are evenly distributed (which means the chance to get all possible checksums for random input data is the same).
What I do not know (and I couldn't find in the web):
Are CRC checksums generally [not] evenly distributed?
Does the distribution depend from the polynomial?
Does the distribution depend from the input data size?
P.S.: I am aware of the restrictions when using CRCs as hashes, so this is not part of this question.

Aside from malicious intent (you can force any CRC you like by changing bits in the message), CRCs are evenly distributed over all values. The polynomial does not matter, so long as it is a valid CRC polynomial, and the input only needs to be the size of the CRC or larger.

I was also curious about this, so I did some tests using the crc32 command on Linux:
# I am printing each number several times so the data is longer than 32-bits:
$ for N in {000001..999999}; do echo -n $N$N$N$N | crc32 /dev/stdin; done >crcs
# There are no complete (8-character) collisions:
$ cat crcs | sort | uniq -d | wc -l
0
# There are no 7-character collisions:
$ for COL in 1 2; do cat crcs | awk "{print substr(\$1,$COL,7)}" | sort | uniq -d; done | wc -l
0
# There are exactly 32k 6-character collisions:
$ for COL in 1 2 3; do cat crcs | awk "{print substr(\$1,$COL,6)}" | sort | uniq -d; done | wc -l
32768
# Also, the distribution of the letters in each column is *extremely* uniform.
# Each column has results similar to these:
$ cat crcs | awk '{print substr($1,1,1)}' | sort | uniq -c
62440 0
62439 1
62440 2
62440 3
62560 4
62560 5
62560 6
62560 7
62560 8
62560 9
62560 a
62560 b
62440 c
62440 d
62440 e
62440 f
...So my conclusion is that CRC32 does a very good job of evenly distributing the checksums.

Related

Renaming files/dir's from one date format to another

So I am a coding newbie and have, for some time, wanted to edit the formatting of my fairly extensive live music library. I have looked around on here and various other resources to get to where I am, but I have hit a snag. I have directories named in the following ways:
02.10.90 | 23 East Caberet - Ardmore, PA
02.16.90 | The Paradise - Boston, MA
and I would like to rename these simply to
1990-02-10 | 23 East Caberet - Ardmore, PA
1990-02-16 | The Paradise - Boston, MA
I have been able to rename the date correctly using:
ls -1 | grep 90 | awk '{print $1}' | awk -F. '{printf "%s-%s-%s\n", "19"$3,$1,$2}' > list1.txt
and then pull the rest of the name using
ls -1 | grep 90 | awk '{first = $1; $1 = ""; print $0}'>list2.txt
So, I have a list of directories ranging from years 1990-2004 that I would like to apply this to (they are all in different sub directories so I don't mind manually changing the "grep 90". However, from the two separate lists that I generate, I cant figure out how to make it loop through each row and print "mv original_name list1.txt+list2.txt" so that it would read:
mv 02.10.90 | 23 East Caberet - Ardmore, PA 1990-02-10 | 23 East Caberet - Ardmore, PA
I scanned through many previous posts and couldn't quite figure out the last bit - or better yet, a more elegant solution! Any help is greatly appreciated, thank you in advance!

Don't parse the output from ls, google why, and you don't need grep when you're using awk, nor do you need chains of awk commands but for this task you wouldn't use any of those commands anyway.
The UNIX command to find files is named find so start with that. This will find all directories with names starting with the given globbing pattern:
find . -type d -name '[0-9][0-9].[0-9][0-9].90 *'
Now that you've found the files you need to do something with them. For your needs IMHO the simplest approach is best and that'd be:
find . -type d -name '[0-9][0-9].[0-9][0-9].90 *' -print0 |
while IFS= read -r -d '' old; do
path="$(dirname "$old")"
oldDirName="$(basename "$old")"
if [[ oldDirName =~ ([0-9]+)\.([0-9]+)\.([0-9]+)( .*) ]]; then
newDirName="19${BASH_REMATCH[3]}-${BASH_REMATCH[1]}-${BASH_REMATCH[2]}${BASH_REMATCH[4]}
echo mv -- "${path}/${oldDirName}" "${path}/${newDirName}"
fi
done
The above is using GNU find for -print0 and bash for BASH_REMATCH. Remove the echo when you've debugged it if necessary and are happy with what it's going to do.

How can I emulate `uniq -d` in awk?

I've got a busybox system which doesn't have uniq and I'd like to generate a unique list of duplicated lines.
A plain uniq emulated in awk would be:
sort <filename> | awk '!($0 in a){a[$0]; print}'
How can I use awk (or sed for that matter, not perl) to accomplish:
sort <filename> | uniq -d

On a busybox system, you might need to save bytes. ;-)
awk ++a[\$0]==2

Could do this (needn't sort it):
awk '{++a[$0]; if(a[$0] == 2) print}'

This might work for you:
# make some test data
seq 25 >/tmp/a
seq 3 3 25 >>/tmp/a
seq 5 5 25 >>/tmp/a
# run old command
sort -n /tmp/a | uniq -d
3
5
6
9
10
12
15
18
20
21
24
25
# run sed command
sort -n /tmp/a |
sed ':a;$bb;N;/^\([^\n]*\)\(\n\1\)*$/ba;:b;/^\([^\n]*\)\(\n\1\)*/{s//\1/;P};D'
3
5
6
9
10
12
15
18
20
21
24
25

Optimal solution to extract value from line using Grep

Could you please tell me what is the optimal way to extract idle value from this line using grep?
CPU states: 0.1% user, 0.1% system, 0.0% nice, 99.8% idle

awk should do the trick:
top -n 1 | grep "idle" | awk '{ print $9 }'
Since the idle-percentage is the ninth value, it's $9.

You can use grep alone:
grep -Po '[0-9.%]+(?= idle)'

What are some one-liners that can output unique elements of the nth column to another file?

I have a file like this:
1 2 3
4 5 6
7 6 8
9 6 3
4 4 4
What are some one-liners that can output unique elements of the nth column to another file?
EDIT: Here's a list of solutions people gave. Thanks guys!
cat in.txt | cut -d' ' -f 3 | sort -u
cut -c 1 t.txt | sort -u
awk '{ print $2 }' cols.txt | uniq
perl -anE 'say $F[0] unless $h{$F[0]}++' filename

In Perl before 5.10
perl -lane 'print $F[0] unless $h{$F[0]}++' filename
In Perl after 5.10
perl -anE 'say $F[0] unless $h{$F[0]}++' filename
Replace 0 with the column you want to output.
For j_random_hacker, here is an implementation that will use very little memory (but will be a slower and requires more typing):
perl -lane 'BEGIN {dbmopen %h, "/tmp/$$", 0600; unlink "/tmp/$$.db" } print $F[0] unless $h{$F[0]}++' filename
dbmopen creates an interface between a DBM file (that it creates or opens) and the hash named %h. Anything stored in %h will be stored on disc instead of in memory. Deleting the file with unlink ensures that the file will not stick around after the program is done, but has no effect on the current process (since, according to POSIX rules, open filehandles are respected by the filesystem as real files).

Corrected: Thank you Mark Rushakoff.
$ cut -c 1 t.txt | sort | uniq
or
$ cut -c 1 t.txt | sort -u
1
4
7
9

Taking the unique values of the third column:
$ cat in.txt | cut -d' ' -f 3 | sort -u
3
4
6
8
cut -d' ' means to separate the input delimited by spaces, and the -f 3 part means take the third field. Finally, sort -u sorts the output, keeping only unique entries.

Say your file is "cols.txt" and you want the unique elements of the second column:
awk '{ print $2 }' cols.txt | uniq
You might find the following article useful for learning more about such utilities:
Simplify data extraction using Linux text utilities

if using awk, no need to use other commands
awk '!_[$2]++{print $2}' file

Map sd?/sdd? names to Solaris disk names?

Some commands in Solaris (such as iostat) report disk related information using disk names such as sd0 or sdd2. Is there a consistent way to map these names back to the standard /dev/dsk/c?t?d?s? disk names in Solaris?
Edit: As Amit points out, iostat -n produces device names such as eg c0t0d0s0 instead of sd0. But how do I found out that sd0 actually is c0t0d0s0? I'm looking for something that produces a list like this:
sd0=/dev/dsk/c0t0d0s0
...
sdd2=/dev/dsk/c1t0d0s4
...
Maybe I could run iostat twice (with and without -n) and then join up the results and hope that the number of lines and device sorting produced by iostat is identical between the two runs?

Following Amit's idea to answer my own question, this is what I have come up with:
iostat -x|tail -n +3|awk '{print $1}'>/tmp/f0.txt.$$
iostat -nx|tail -n +3|awk '{print "/dev/dsk/"$11}'>/tmp/f1.txt.$$
paste -d= /tmp/f[01].txt.$$
rm /tmp/f[01].txt.$$
Running this on a Solaris 10 server gives the following output:
sd0=/dev/dsk/c0t0d0
sd1=/dev/dsk/c0t1d0
sd4=/dev/dsk/c0t4d0
sd6=/dev/dsk/c0t6d0
sd15=/dev/dsk/c1t0d0
sd16=/dev/dsk/c1t1d0
sd21=/dev/dsk/c1t6d0
ssd0=/dev/dsk/c2t1d0
ssd1=/dev/dsk/c3t5d0
ssd3=/dev/dsk/c3t6d0
ssd4=/dev/dsk/c3t22d0
ssd5=/dev/dsk/c3t20d0
ssd7=/dev/dsk/c3t21d0
ssd8=/dev/dsk/c3t2d0
ssd18=/dev/dsk/c3t3d0
ssd19=/dev/dsk/c3t4d0
ssd28=/dev/dsk/c3t0d0
ssd29=/dev/dsk/c3t18d0
ssd30=/dev/dsk/c3t17d0
ssd32=/dev/dsk/c3t16d0
ssd33=/dev/dsk/c3t19d0
ssd34=/dev/dsk/c3t1d0
The solution is not very elegant (it's not a one-liner), but it seems to work.

One liner version of the accepted answer (I only have 1 reputation so I can't post a comment):
paste -d= <(iostat -x | awk '{print $1}') <(iostat -xn | awk '{print $NF}') | tail -n +3

Try using the '-n' switch. For eg. 'iostat -n'

As pointed out in other answers, you can map the device name back to the instance name via the device path and information contained in /etc/path_to_inst. Here is a Perl script that will accomplish the task:
#!/usr/bin/env perl
use strict;
my #path_to_inst = qx#cat /etc/path_to_inst#;
map {s/"//g} #path_to_inst;
my ($device, $path, #instances);
for my $line (qx#ls -l /dev/dsk/*s2#) {
($device, $path) = (split(/\s+/, $line))[-3, -1];
$path =~ s#.*/devices(.*):c#$1#;
#instances =
map {join("", (split /\s+/)[-1, -2])}
grep {/$path/} #path_to_inst;
*emphasized text*
for my $instance (#instances) {
print "$device $instance\n";
}
}

I found the following in the Solaris Transistion Guide:
"Instance Names
Instance names refer to the nth device in the system (for example, sd20).
Instance names are occasionally reported in driver error messages. You can determine the binding of an instance name to a physical name by looking at dmesg(1M) output, as in the following example.
sd9 at esp2: target 1 lun 1
sd9 is /sbus#1,f8000000/esp#0,800000/sd#1,0
<SUN0424 cyl 1151 alt 2 hd 9 sec 80>
Once the instance name has been assigned to a device, it remains bound to that device.
Instance numbers are encoded in a device's minor number. To keep instance numbers consistent across reboots, the system records them in the /etc/path_to_inst file. This file is read only at boot time, and is currently updated by the add_drv(1M) and drvconf"
So based upon that, I wrote the following script:
for device in /dev/dsk/*s2
do
dpath="$(ls -l $device | nawk '{print $11}')"
dpath="${dpath#*devices/}"
dpath="${dpath%:*}"
iname="$(nawk -v dpath=$dpath '{
if ($0 ~ dpath) {
gsub("\"", "", $3)
print $3 $2
}
}' /etc/path_to_inst)"
echo "$(basename ${device}) = ${iname}"
done
By reading the information directly out of the path_to_inst file, we are allowing for adding and deleting devices, which will skew the instance numbers if you simply count the instances in the /devices directory tree.

I think simplest way to find descriptive name having instance name is:
# iostat -xn sd0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
4.9 0.2 312.1 1.9 0.0 0.0 3.3 3.5 0 1 c1t1d0
#
The last column shows descriptive name for provided instance name.

sd0 sdd0 are instance names of devices.. you can check /etc/path_to_inst to get instance name mapping to physical device name, then check link in /dev/dsk (to which physical device it is pointing) it is 100% sure method, though i dont know how to code it ;)

I found this snippet on the internet some time ago, and it does the trick. This was on Solaris 8:
#!/bin/sh
cd /dev/rdsk
/usr/bin/ls -l *s0 | tee /tmp/d1c |awk '{print "/usr/bin/ls -l "$11}' | \
sh | awk '{print "sd" substr($0,38,4)/8}' >/tmp/d1d
awk '{print substr($9,1,6)}' /tmp/d1c |paste - /tmp/d1d
rm /tmp/d1[cd]

A slight variation to allow for disk names that are longer than 8 characters (encountered when dealing with disk arrays on a SAN)
#!/bin/sh
cd /dev/rdsk
/usr/bin/ls -l *s0 | tee /tmp/d1c | awk '{print "/usr/bin/ls -l "$11}' | \
sh | awk '{print "sd" substr($0,38,4)/8}' >/tmp/d1d
awk '{print substr($9,1,index($9,"s0)-1)}' /tmp/d1c | paste - /tmp/d1d
rm /tmp/d1[cd]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Distribution of CRC checksums - hash

Aside from malicious intent (you can force any CRC you like by changing bits in the message), CRCs are evenly distributed over all values. The polynomial does not matter, so long as it is a valid CRC polynomial, and the input only needs to be the size of the CRC or larger.

Related

Renaming files/dir's from one date format to another

How can I emulate `uniq -d` in awk?

Optimal solution to extract value from line using Grep

What are some one-liners that can output unique elements of the nth column to another file?

Map sd?/sdd? names to Solaris disk names?

Categories

Resources