How determine if files/folders have a ascending order numbering pattern? - text-processing

I am trying to determine if all the files/folders present in a directory have a ascending order numbering pattern at same place throughout in their name
If the numbers were always present at a constant place in every case , this would have been super easy
ls $HOME/dir
1. Some String
2. Some String- Part 4
3. Some String- Part 5
Here i would just simply use something like
ls $HOME/dir | sort -V | grep -Eo '^[0-9]'
The command will output 1 2 3 and The files/folders have ascending order numbering pattern is a easy conclusion
Now there are 2 problems here :
Its not necessary that these numbers would always be at start like above
There could be sometimes random numbers in between
==========================================
ls $HOME/dir
Lecture 1 - Some String
Lecture 2 - Some String - Part 4
Lecture 3 - Some String - Part 5
Expected Output - 1 2 3
I main thing is that i need grep to only output numbers if they are present in ascending order at the very same position in filenames throughout
==========================================
ls $HOME/dir
1. Some String
Some String - Part 2
Some String - Part 3
For something like this , grep shouldn't output anything at all because even though it has ascending numbers in name, they are not present at same place throughout
==========================================
PS / The 'Some String' part in all my example would be different for each file/folders. Only the position of the ascending numbers being constant (If any ) is to be considered
One More final example
ls $HOME/dir
CB) Lecture 1 xyz
CB) Lecture 2 abc-part 8
CB) Lecture 3 pqr-part 9
Expected Output - 1 2 3

Here is one solution using AWK:
printnumbers.awk
BEGIN {
numberRegex = "[0-9]+([^A-Za-z0-9]|$)"
}
NR == 1 {
numberPos = match($0, numberRegex)
}
match($0, numberRegex) == numberPos {
matchedString = substr($0, RSTART, RLENGTH)
match(matchedString, "[0-9]+")
result = result substr(matchedString, RSTART, RLENGTH) "\n"
next
}
{
result = ""
exit 1
}
END {
printf("%s", result)
}
Then run
$ ls $HOME/dir | sort -V | awk -f printnumbers.awk
Edit 2021-05-14
A second approach is to split each line into fields with non-digits as separators. Then each field is either a number or an empty string. For each line we check the fields to see if a sequence of consecutive numbers starting from one is formed.
Here is the logic:
BEGIN {
FS = "[^0-9]+"
}
{
for (i = 1; i <= NF; i++) {
numbersConsecutive[i] = ($i == NR) && ((NR == 1) || numbersConsecutive[i])
}
if (NF > numbersConsecutiveLen) {
numbersConsecutiveLen = NF
}
}
END {
consecutiveNumbersFound = 0
for (i = 1; i <= numbersConsecutiveLen; i++) {
if (numbersConsecutive[i]) {
consecutiveNumbersFound = 1
}
}
if (consecutiveNumbersFound) {
for (i = 1; i <= NR; i++) {
print i
}
}
}

Related

Using AWK to select records from file1 with obs.# in file 2

I have these to simple test files:
This is fil1, containing records I want to select some from
01
02
07
05
10
20
30
25
This is keepNR, containing the record numbers I want to extract from fil1
1
4
7
What I want are these records from fil1
01 (observation/record # 1)
05 (observation/record # 4)
30 (observation/record # 7)
I am a novice to AWK, but I have tried these programs:
using this, I can see that the observations are there
awk 'FNR==NR {a[$1]; next } { for (elem in a) { print "elem=",elem,"FNR=",FNR,"$1=",$1 }} ' keepNR fil1
I had hoped this would work, but I get more than the 2 records:
awk 'FNR==NR {a[$1]; next } { for (elem in a) { if (FNR ~ a[elem]) print elem,FNR,$1; next }} END{ for (elem in a) { print "END:", elem }}' keepNR fil1
1 1 01
1 2 02
1 3 07
1 4 05
1 5 10
1 6 20
1 7 30
1 8 25
I first tried using the == instead of the ~, but then no result ??
as you can see here:
gg#gg:~/bin/awktest$ awk 'FNR==NR {a[$1]; next } { for (elem in a) { if (FNR == a[elem]) print elem,FNR,$1; next }} ' keepNR fil1
gg#gg:~/bin/awktest$
I have also tried (FNR - a[elem])==0 with no output
So I have 2 questions
why does if (FNR ~ a[elem]) work, but if (FNR == a[elem]) does not ?
why do I get 8 lines of output, instead of 2 ?
If you have a better solution, I would love to see it
Kind Regards :)
You don't assign to a[$1] so its value is empty: FNR==NR { a[$1]; next }
for (elem in a) sets elem to the keys of a.
if (FNR == a[elem]) compares against the value in a[elem]. The value is empty, so there is no match.
if (FNR ~ a[elem]) tests if FNR matches the empty regex (//), so it always matches.
A simpler method is to test if FNR is a key of a:
awk '
FNR==NR { a[$1]; next }
FNR in a { print "FNR="FNR, "$1="$1 }
' keepNR fil1
which should output:
FNR=1 $1=01
FNR=4 $1=05
FNR=7 $1=30

what is mean by swift condition is if ((status & 0x3F) == 1 ){ }

if ((status & 0x3F) == 1 ){ }..
the status is variable in swift language.
what is mean about this condition, & mean and (status & 0x3F) value return
& is the bitwise AND operator. It compares the bits of the two operands and sets the corresponding bit to 1 if it is 1 in both operands, or to 0 if either or both are 0.
So this statement:
((status & 0x3F) == 1)
is combining status with 0b111111 (the binary equivalent of 0x3F and checking if the result is exactly 1. This will only be true if the last 6 bits of status are 0b000001.
In this if:
if( (dtc24_state[2] & 0x8) == 0x8 ) {
self.haldexABCDTC24State.text = status_str + " - UNKNOWN"
self.haldexABCDTC24State.textColor = text_color
active_or_stored_dtc = true
}
dct24_state is an array of values. The value of dct24_state[2] is combined with 0x8 or 0b1000 and checked against 0x8. This is checking if the 4th bit from the right is set. Nothing else matters. If the 4th bit from the right is set, the if is true and the code block is executed.
0x3F is 111111. So, it means this:
for each bit of yourNumber in binary system presentation use and method.
This way truncates the left part of the number. and the result compares with 1.
e.g.
7777 is 1111001100001 after executing and this number converts into
100001. So the result is false.
But for 7745 (1111001000001) the result is 1. The result is true.
The rule for 'and' function: 0 & 0 = 0 ; 0 & 1 = 0; 1 & 0 = 1; 1 & 1 = 1.

speed up prime number generating

I have written a program that generates prime numbers . It works well but I want to speed it up as it takes quite a while for generating the all the prime numbers till 10000
var list = [2,3]
var limitation = 10000
var flag = true
var tmp = 0
for (var count = 4 ; count <= limitation ; count += 1 ){
while(flag && tmp <= list.count - 1){
if (count % list[tmp] == 0){
flag = false
}else if ( count % list[tmp] != 0 && tmp != list.count - 1 ){
tmp += 1
}else if ( count % list[tmp] != 0 && tmp == list.count - 1 ){
list.append(count)
}
}
flag = true
tmp = 0
}
print(list)
Two simple improvements that will make it fast up through 100,000 and maybe 1,000,000.
All primes except 2 are odd
Start the loop at 5 and increment by 2 each time. This isn't going to speed it up a lot because you are finding the counter example on the first try, but it's still a very typical improvement.
Only search through the square root of the value you are testing
The square root is the point at which a you half the factor space, i.e. any factor less than the square root is paired with a factor above the square root, so you only have to check above or below it. There are far fewer numbers below the square root, so you should check the only the values less than or equal to the square root.
Take 10,000 for example. The square root is 100. For this you only have to look at values less than the square root, which in terms of primes is roughly 25 values instead of over 1000 checks for all primes less than 10,000.
Doing it even faster
Try another method altogether, like a sieve. These methods are much faster but have a higher memory overhead.
In addition to what Nick already explained, you can also easily take advantage of the following property: all primes greater than 3 are congruent to 1 or -1 mod 6.
Because you've already included 2 and 3 in your initial list, you can therefore start with count = 6, test count - 1 and count + 1 and increment by 6 each time.
Below is my first attempt ever at Swift, so pardon the syntax which is probably far from optimal.
var list = [2,3]
var limitation = 10000
var flag = true
var tmp = 0
var max = 0
for(var count = 6 ; count <= limitation ; count += 6) {
for(var d = -1; d <= 1; d += 2) {
max = Int(floor(sqrt(Double(count + d))))
for(flag = true, tmp = 0; flag && list[tmp] <= max; tmp++) {
if((count + d) % list[tmp] == 0) {
flag = false
}
}
if(flag) {
list.append(count + d)
}
}
}
print(list)
I've tested the above code on iswift.org/playground with limitation = 10,000, 100,000 and 1,000,000.

If statement inside awk to change a value

I have the following file
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
I want to do the following task
Check if after the sequence ERG= there is a number or a string.
If it's a string, find the sequence SI1 L and change the value after that, using values that the user inputs.
If it's a number, change the number using values that the user inputs.
Note that if after ERG= there is a number, there will be no SI1 L sequence.
For instance number 2 can be accomplished using the following
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
for val in "${vals[#]}"; do
awk -vval="$val" '$1=="SI1"{$3=val}1' 20
done
How can the above algorithm be achieved?
#!/bin/bash
val="$#"
awk -v val="$val" '
BEGIN { i=1; split (val,v," ") }
# If it is a string, find the sequence SI1 L and change the value after that, using values that the user inputs
/SDEF POS.*ERG=[a-zA-Z]+/ { flag="y" ; }
/SI1 L/ { if (flag=="y") { $3=v[i]; i++; flag="n"; } }
# If it is a number, change the number using values that the user inputs.
/SDEF POS.*ERG=[0-9]+ / { sub(/ERG=[0-9]*/, "ERG="v[i],$0);i++; }
1
' file
hints:
If the rule find ERG with at least one or more letters ([a-zA-Z]+, it will set the flag.
The /SI1 L/ rule will only triggers, if the flag is set. If the rule triggered, it would unset the flag again, so that any following /SI L/ wouldn't trigger again.
.* stands for 0-n sign or character
[A-Za-z]+ stands for 1-n alphabetic character in lower or upper case
awk -F '[[:blank:]=]' -v string_value="foo" -v number_value=42 '
/ERG=/ {
for (i=1; i<NF; i++)
if ($i == "ERG") {
isstring = ($(i+1) ~ /[^[:digit:]]/)
break
}
if (!isstring)
$(i+1) = number_value
}
/SI1 L/ && isstring { $NF = string_value }
1
' filename

Get regions from a file that are part of regions in other file (Without loops)

I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks
Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)
Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.
Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt
Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.