Why does Perl's Inline::C sort 4.0e-5 after 4.4e-5? - perl

I built a Perl Inline::C module, but there is some oddity with the sorting. Does anyone know why it would sort like this? Why is the 4.0e-5 is not first?
my $ref = [ 5.0e-5,4.2e-5,4.3e-5,4.4e-5,4.4e-5,4.2e-5,4.2e-5,4.0e-5];
use Inline C => <<'END_OF_C_CODE';
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
// sort
sortsv(AvARRAY(data),arrayLen+1,Perl_sv_cmp_locale);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE
test($ref);
0.000042
0.000042
0.000042
0.000043
0.000044
0.000044
0.000040
0.000050

Because you are sorting lexically, Try this code:
#!/usr/bin/perl
use strict;
use warnings;
my $ref = [ 5.0e-5,4.2e-5,4.3e-5,4.4e-5,4.4e-5,4.2e-5,4.2e-5,4.0e-5];
print "Perl with cmp\n";
for my $val (sort #$ref) {
printf "%f \n", $val;
}
print "Perl with <=>\n";
for my $val (sort { $a <=> $b } #$ref) {
printf "%f \n", $val;
}
print "C\n";
test($ref);
use Inline C => <<'END_OF_C_CODE';
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
// sort
sortsv(AvARRAY(data),av_len(data)+1,Perl_sv_cmp_locale);
arrayLen = av_len(data);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE
Of course, lexically 0.00040 is smaller than 0.00042 as well, but you aren't comparing 0.00040 to 0.00042; you are comparing the number 0.00040 converted to a string with the number 0.00042 converted to a string. When a number gets too large or small, Perl's stringifying logic resorts to using scientific notation. So you are sorting the set of strings
"4.2e-05", "4.2e-05", "4.2e-05", "4.3e-05", "4.4e-05", "4.4e-05", "4e-05", "5e-05"
which are properly sorted. Perl happily turns those strings back into their numbers when you ask it to with the %f format in printf. You could stringify the numbers yourself, but since you have stated you want this to be faster, that would be a mistake. You should not to be trying to optimize the program before you know where it slow (premature optimization is the root of all evil*). Write your code then run Devel::NYTProf against it to find where it is slow. If necessary, rewrite those portions in XS or Inline::C (I prefer XS). You will find that you get more speed out of choosing the right data structure than micro-optimizations like this.
* Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.

Perl_sv_cmp_locale is your sorting function which I suspect is lexical comparison. Look for numeric sorting one or write your own.

Have an answer with help from the people over at http://www.perlmonks.org/?node_id=761015
I ran some profiling (DProf) and it's a 4x improvement in speed
Total Elapsed Time = 0.543205 Seconds
User+System Time = 0.585454 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
100. 0.590 0.490 100000 0.0000 0.0000 test_inline_c_pkg::percent2
Total Elapsed Time = 2.151647 Seconds
User+System Time = 1.991647 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
104. 2.080 1.930 100000 0.0000 0.0000 main::percent2
Here is the code
use Inline C => <<'END_OF_C_CODE';
#define SvSIOK(sv) ((SvFLAGS(sv) & (SVf_IOK|SVf_IVisUV)) == SVf_IOK)
#define SvNSIV(sv) (SvNOK(sv) ? SvNVX(sv) : (SvSIOK(sv) ? SvIVX(sv) : sv_2nv(sv)))
static I32 S_sv_ncmp(pTHX_ SV *a, SV *b) {
const NV nv1 = SvNSIV(a);
const NV nv2 = SvNSIV(b);
return nv1 < nv2 ? -1 : nv1 > nv2 ? 1 : 0;
}
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
/* sort descending (send numerical sort function S_sv_ncmp) */
sortsv(AvARRAY(data),arrayLen+1, S_sv_ncmp);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE

Related

CRC-32 algorithm from HDL to software

I implemented a Galois Linear-Feedback Shift-Regiser in Verilog (and also in MATLAB, mainly to emulate the HDL design). It's been working great, and as of know I use MATLAB to calculate CRC-32 fields, and then include them in my HDL simulations to verify a data packet has arrived correctly (padding data with CRC-32), which produces good results.
The thing is I want to be able to calculate the CRC-32 I've implemented in software, because I'll be using a Raspberry Pi to input data through GPIO in my FPGA, and I haven't been able to do so. I've tried this online calculator, using the same parameters, but never get to yield the same result.
This is the MATLAB code I use to calculate my CRC-32:
N = 74*16;
data = [round(rand(1,N)) zeros(1,32)];
lfsr = ones(1,32);
next_lfsr = zeros(1,32);
for i = 1:length(data)
next_lfsr(1) = lfsr(2);
next_lfsr(2) = lfsr(3);
next_lfsr(3) = lfsr(4);
next_lfsr(4) = lfsr(5);
next_lfsr(5) = lfsr(6);
next_lfsr(6) = xor(lfsr(7),lfsr(1));
next_lfsr(7) = lfsr(8);
next_lfsr(8) = lfsr(9);
next_lfsr(9) = xor(lfsr(10),lfsr(1));
next_lfsr(10) = xor(lfsr(11),lfsr(1));
next_lfsr(11) = lfsr(12);
next_lfsr(12) = lfsr(13);
next_lfsr(13) = lfsr(14);
next_lfsr(14) = lfsr(15);
next_lfsr(15) = lfsr(16);
next_lfsr(16) = xor(lfsr(17), lfsr(1));
next_lfsr(17) = lfsr(18);
next_lfsr(18) = lfsr(19);
next_lfsr(19) = lfsr(20);
next_lfsr(20) = xor(lfsr(21),lfsr(1));
next_lfsr(21) = xor(lfsr(22),lfsr(1));
next_lfsr(22) = xor(lfsr(23),lfsr(1));
next_lfsr(23) = lfsr(24);
next_lfsr(24) = xor(lfsr(25), lfsr(1));
next_lfsr(25) = xor(lfsr(26), lfsr(1));
next_lfsr(26) = lfsr(27);
next_lfsr(27) = xor(lfsr(28), lfsr(1));
next_lfsr(28) = xor(lfsr(29), lfsr(1));
next_lfsr(29) = lfsr(30);
next_lfsr(30) = xor(lfsr(31), lfsr(1));
next_lfsr(31) = xor(lfsr(32), lfsr(1));
next_lfsr(32) = xor(data2(i), lfsr(1));
lfsr = next_lfsr;
end
crc32 = lfsr;
See I use a 32-zeroes padding to calculate the CRC-32 in the first place (whatever's left in the LFSR at the end is my CRC-32, and if I do the same replacing the zeroes with this CRC-32, my LFSR becomes empty at the end too, which means the verification passed).
The polynomial I'm using is the standard for CRC-32: 04C11DB7. See also that the order seems to be reversed, but that's just because it's mirrored to have the input in the MSB. The results of using this representation and a mirrored one are the same when the input is the same, only the result will be also mirrored.
Any ideas would be of great help.
Thanks in advance
Your CRC is not a CRC. The last 32 bits fed in don't actually participate in the calculation, other than being exclusive-or'ed into the result. That is, if you replace the last 32 bits of data with zeros, do your calculation, and then exclusive-or the last 32 bits of data with the resulting "crc32", then you will get the same result.
So you will never get it to match another CRC calculation, since it isn't a CRC.
This code in C replicates your function, where the data bits come from the series of n bytes at p, least significant bit first, and the result is a 32-bit value:
unsigned long notacrc(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
reg ^= (unsigned long)*dat++ << 24;
n--;
}
return reg;
}
You can immediately see that the last byte of data is simply exclusive-or'ed with the final register value. Less obvious is that the last four bytes are just exclusive-or'ed. This exactly equivalent version makes that evident:
unsigned long notacrc_xor(void const *p, unsigned n) {
unsigned char const *dat = p;
// initial register values
unsigned long const init[] = {
0xffffffff, 0x2dfd1072, 0xbe26ed00, 0x00be26ed, 0xdebb20e3};
unsigned xor = n > 3 ? 4 : n; // number of bytes merely xor'ed
unsigned long reg = init[xor];
while (n > xor) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
switch (n) {
case 4:
reg ^= *dat++;
case 3:
reg ^= (unsigned long)*dat++ << 8;
case 2:
reg ^= (unsigned long)*dat++ << 16;
case 1:
reg ^= (unsigned long)*dat++ << 24;
}
return reg;
}
There you can see that the last four bytes of the message, or all of the message if it is three or fewer bytes, is exclusive-or'ed with the final register value at the end.
An actual CRC must use all of the input data bits in determining when to exclusive-or the polynomial with the register. The inner part of that last function is what a CRC implementation looks like (though more efficient versions make use of pre-computed tables to process a byte or more at a time). Here is a function that computes an actual CRC:
unsigned long crc32_jam(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
return reg;
}
That one is called crc32_jam because it implements a particular CRC called "JAMCRC". That CRC is the closest to what you attempted to implement.
If you want to use a real CRC, you will need to update your Verilog implementation.

What is RMAGICAL?

I'm trying to understand some XS code that I inherited. I've been trying to add comments to a section that invokes Perl magic stuff, but I can't find any documentation to help me understand this line:
SvRMAGICAL_off((SV *) myVar);
What is RMAGICAL for? When should one turn in on or off when working with Perl magic variables?
Update
Perlguts Illustrated is very interesting and has a little bit of info on RMAGICAL (the 'R' is for 'random'), but it doesn't say when to mess with it: http://cpansearch.perl.org/src/RURBAN/illguts-0.42/index.html
It's a flag that indicates whether a variable has "clear" magic, magic that should be called when the variable is cleared (e.g. when it's destroyed). It's used by mg_clear which is called when one attempts to do something like
undef %hash;
delete $a[4];
etc
It's derived information calculated by mg_magical that should never be touched. mg_magical will be called to update the flag when magic is added to or removed from a variable. If any of the magic attached to the scalar has a "clear" handler in its Magic Virtual Table, the scalar gets RMAGICAL set. Otherwise, it gets turned off. Effectively, this caches the information to save Perl from repeatedly checking all the magic attached to a scalar for this information.
One example use of clear magic: When a %SIG entry is cleared, the magic removes the signal handler for that signal.
Here's mg_magical:
void
Perl_mg_magical(pTHX_ SV *sv)
{
const MAGIC* mg;
PERL_ARGS_ASSERT_MG_MAGICAL;
PERL_UNUSED_CONTEXT;
SvMAGICAL_off(sv);
if ((mg = SvMAGIC(sv))) {
do {
const MGVTBL* const vtbl = mg->mg_virtual;
if (vtbl) {
if (vtbl->svt_get && !(mg->mg_flags & MGf_GSKIP))
SvGMAGICAL_on(sv);
if (vtbl->svt_set)
SvSMAGICAL_on(sv);
if (vtbl->svt_clear)
SvRMAGICAL_on(sv);
}
} while ((mg = mg->mg_moremagic));
if (!(SvFLAGS(sv) & (SVs_GMG|SVs_SMG)))
SvRMAGICAL_on(sv);
}
}
The SVs_RMG flag (which is what SvRMAGICAL tests for and SvRMAGICAL_on/SvRMAGICAL_off sets/clears) means that the variable has some magic associated with it other than a magic getter method (which is indicated by the SVs_GMG flag) and magic setter method (indicated by SVs_SMG).
I'm getting out of my depth, here, but examples of variables where RMAGIC is on include most of the values in %ENV (the ones that are set when the program begins, but not ones you define at run-time), the values in %! and %SIG, and stash values for named subroutines (i.e., in the program
package main;
sub foo { 42 }
$::{"foo"} is RMAGICAL and $::{"bar"} is not). Using Devel::Peek is a little bit, but not totally enlightening about what this magic might be:
$ /usr/bin/perl -MDevel::Peek -e 'Dump $ENV{HOSTNAME}'
SV = PVMG(0x8003e910) at 0x800715f0
REFCNT = 1
FLAGS = (SMG,RMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x80072790 "localhost"\0
CUR = 10
LEN = 12
MAGIC = 0x800727a0
MG_VIRTUAL = &PL_vtbl_envelem
MG_TYPE = PERL_MAGIC_envelem(e)
MG_LEN = 8
MG_PTR = 0x800727c0 "HOSTNAME"
Here we see that the scalar held in $ENV{HOSTNAME} has an MG_TYPE and MG_VIRTUAL that give you the what, but not the how and why of this variable's magic. On a "regular" magical variable, these are usually (always?) PERL_MAGIC_sv and &PL_vtbl_sv:
$ /usr/bin/perl -MDevel::Peek -e 'Dump $='
SV = PVMG(0x8008e080) at 0x80071de8
REFCNT = 1
FLAGS = (GMG,SMG)
IV = 0
NV = 0
PV = 0
MAGIC = 0x80085aa8
MG_VIRTUAL = &PL_vtbl_sv
MG_TYPE = PERL_MAGIC_sv(\0)
MG_OBJ = 0x80071d58
MG_LEN = 1
MG_PTR = 0x80081ad0 "="
There is one place in the perl source where SvRMAGICAL_off is used -- in perlio.c, in the XS(XS_io_MODIFY_SCALAR_ATTRIBUTES).
XS(XS_io_MODIFY_SCALAR_ATTRIBUTES)
{
dXSARGS;
SV * const sv = SvRV(ST(1));
AV * const av = newAV();
MAGIC *mg;
int count = 0;
int i;
sv_magic(sv, MUTABLE_SV(av), PERL_MAGIC_ext, NULL, 0);
SvRMAGICAL_off(sv);
mg = mg_find(sv, PERL_MAGIC_ext);
mg->mg_virtual = &perlio_vtab;
mg_magical(sv);
Perl_warn(aTHX_ "attrib %" SVf, SVfARG(sv));
for (i = 2; i < items; i++) {
STRLEN len;
const char * const name = SvPV_const(ST(i), len);
SV * const layer = PerlIO_find_layer(aTHX_ name, len, 1);
if (layer) {
av_push(av, SvREFCNT_inc_simple_NN(layer));
}
else {
ST(count) = ST(i);
count++;
}
}
SvREFCNT_dec(av);
XSRETURN(count);
}
where for some reason (again, I'm out of my depth), they want that magic turned off during the mg_find call.

Why is the call to array_view::synchronize() so slow?

i've started experimenting with C++ AMP. I've created a simple test app just to see what it can do, however the results are quite surprising to me. Consider the following code:
#include <amp.h>
#include "Timer.h"
using namespace concurrency;
int main( int argc, char* argv[] )
{
uint32_t u32Threads = 16;
uint32_t u32DataRank = u32Threads * 256;
uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads;
uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ];
for ( uint32_t i = 0; i < u32DataRank * u32DataRank; i++ )
{
pu32Data[i] = 1;
}
uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ];
Timer tmr;
tmr.Start();
array< uint32_t, 1 > source( u32DataRank * u32DataRank, pu32Data );
array_view< uint32_t, 1 > sum( u32Threads, pu32Sum );
printf( "Array<> deep copy time: %.6f\n", tmr.Stop() );
tmr.Start();
parallel_for_each(
sum.extent,
[=, &source](index<1> idx) restrict(amp)
{
uint32_t u32Sum = 0;
uint32_t u32Start = idx[0] * u32DataSize;
uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize;
for ( uint32_t i = u32Start; i < u32End; i++ )
{
u32Sum += source[i];
}
sum[idx] = u32Sum;
}
);
double dDuration = tmr.Stop();
printf( "gpu computation time: %.6f\n", dDuration );
tmr.Start();
sum.synchronize();
dDuration = tmr.Stop();
printf( "synchronize time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
tmr.Start();
for ( uint32_t idx = 0; idx < u32Threads; idx++ )
{
uint32_t u32Sum = 0;
for ( uint32_t i = 0; i < u32DataSize; i++ )
{
u32Sum += pu32Data[(idx * u32DataSize) + i];
}
pu32Sum[idx] = u32Sum;
}
dDuration = tmr.Stop();
printf( "cpu computation time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
delete [] pu32Sum;
delete [] pu32Data;
return 0;
}
Note that Timer is a simple timing class using QueryPerformanceCounter. Anyway, the output of the code is the following:
Array<> deep copy time: 0.089784
gpu computation time: 0.000449
synchronize time: 8.671081
first and second row sum = 1048576, 1048576
cpu computation time: 0.006647
first and second row sum = 1048576, 1048576
Why is the call to synchronize() taking so long? Is there a way how to get around this? Other than that the performance of the computation performance is amazing, however the synchronize() overhead makes it unusable for me.
It is also possible that i am doing something terribly wrong, if so, please tell me. Thanks in advance.
Function synchronize() is probably taking so long because it is waiting for the actual kernel to complete its work.
From parallel_for_each from amp.h:
Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the [code after the parallel_for_each] continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads.
So, measuring the time spent in parallel_for_each is not particularly meaningful.
EDIT: The way the algorithm is written, it won't benefit much from GPU acceleration. The read of source[i] is non-coalesced, and so it will be almost 16x slower than a coalesced read. It is possible to coalesce the read by using shared memory, but it is not quite trivial. I'd recommend reading up on GPU programming.
If you just want a simple example that demonstrates the utility of C++ AMP, try matrix multiplication.
Of course, the performance you'll observe also greatly depends on the model of you GPU hardware.
In addition to Igor's response on your specific algorithm, please note that there are multiple incorrect aspects of the way you are measuring C++ AMP performance in general (no runtime initialization exclusion, no discarding of initial JIT, no warmup of data, and the already pointed out assumption of p_f_e being synchronous), so please follow our guidelines here:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

does kyoto cabinet support key range search?

Does Kyoto Cabinet support searching for a range of keys?
If so, what types of keys do support range search?
Can I do range search on a long (64bit) key?
Thanks
RG
it supports key prefix query, however, the efficiency of prefix query depends on what internal storage structure is. If you are using hashdb, it may be not a good idea, as keys & values are scattered around in the underline file.
Yes, for integers.
B+ tree database supports sequential access in order of the keys, which realizes forward matching search for strings and range search for integers - from docs
Yes you can, you just need a forward jump.
An example using C. Stores 5 records with 64 bits keys (from 1 to 5) and then apply a filter (from 2 to 4):
#include <kclangc.h>
#include <inttypes.h>
int main(void)
{
KCDB *db;
KCCUR *cur;
char *kbuf;
size_t ksiz, vsiz;
const char *cvbuf;
int64_t i, val, min, max;
int64_t keys[] = {1, 2, 3, 4, 5};
const char *values[] = {"one", "two", "three", "four", "five"};
char i64[8]; /* A buffer to store byte sequences */
/* create the database object */
db = kcdbnew();
/* open the database */
if (!kcdbopen(db, "db64.kct", KCOWRITER | KCOCREATE)) {
fprintf(stderr, "open error: %s\n", kcecodename(kcdbecode(db)));
}
/* store records */
for (i = 0; i < 5; i++) {
memcpy(i64, &keys[i], 8);
if (!kcdbset(db, i64, 8, values[i], strlen(values[i]))) {
fprintf(stderr, "set error: %s\n", kcecodename(kcdbecode(db)));
exit(EXIT_FAILURE);
}
}
/* traverse records */
min = 2;
max = 4;
printf("Range from %" PRId64 " to %" PRId64 "\n", min, max);
memcpy(i64, &min, 8);
cur = kcdbcursor(db);
kccurjumpkey(cur, i64, 8);
while ((kbuf = kccurget(cur, &ksiz, &cvbuf, &vsiz, 1)) != NULL) {
memcpy(&val, kbuf, 8);
if (val > max) {
break;
}
printf("Found %s\n", cvbuf);
kcfree(kbuf);
}
kccurdel(cur);
/* close the database */
if (!kcdbclose(db)) {
fprintf(stderr, "close error: %s\n", kcecodename(kcdbecode(db)));
}
/* delete the database object */
kcdbdel(db);
return 0;
}
LevelDB supports binary keys and ranged queries.
Edit: I forgot to mention that in order for the range query to work, the binary value needs to be packed in a comparable way. For your long example, you need to make sure it's big-endian encoded.

Carefully deleting N items from a "circular" vector (or perhaps just an NSMutableArray)

Imagine a std:vector, say, with 100 things on it (0 to 99) currently. You are treating it as a loop. So the 105th item is index 4; forward 7 from index 98 is 5.
You want to delete N items after index position P.
So, delete 5 items after index 50; easy.
Or 5 items after index 99: as you delete 0 five times, or 4 through 0, noting that position at 99 will be erased from existence.
Worst, 5 items after index 97 - you have to deal with both modes of deletion.
What's the elegant and solid approach?
Here's a boring routine I wrote
-(void)knotRemovalHelper:(NSMutableArray*)original
after:(NSInteger)nn howManyToDelete:(NSInteger)desired
{
#define ORCO ((NSInteger)[original count])
static NSInteger kount, howManyUntilLoop, howManyExtraAferLoop;
if ( ... our array is NOT a loop ... )
// trivial, if messy...
{
for ( kount = 1; kount<=desired; ++kount )
{
if ( (nn+1) >= ORCO )
return;
[original removeObjectAtIndex:( nn+1 )];
}
return;
}
else // our array is a loop
// messy, confusing and inelegant. how to improve?
// here we go...
{
howManyUntilLoop = (ORCO-1) - nn;
if ( howManyUntilLoop > desired )
{
for ( kount = 1; kount<=desired; ++kount )
[original removeObjectAtIndex:( nn+1 )];
return;
}
howManyExtraAferLoop = desired - howManyUntilLoop;
for ( kount = 1; kount<=howManyUntilLoop; ++kount )
[original removeObjectAtIndex:( nn+1 )];
for ( kount = 1; kount<=howManyExtraAferLoop; ++kount )
[original removeObjectAtIndex:0];
return;
}
#undef ORCO
}
Update!
InVariant's second answer leads to the following excellent solution. "starting with" is much better than "starting after". So the routine now uses "start with". Invariant's second answer leads to this very simple solution...
N times do if P < currentsize remove P else remove 0
-(void)removeLoopilyFrom:(NSMutableArray*)ra
startingWithThisOne:(NSInteger)removeThisOneFirst
howManyToDelete:(NSInteger)countToDelete
{
// exception if removeThisOneFirst > ra highestIndex
// exception if countToDelete is > ra size
// so easy thanks to Invariant:
for ( do this countToDelete times )
{
if ( removeThisOneFirst < [ra count] )
[ra removeObjectAtIndex:removeThisOneFirst];
else
[ra removeObjectAtIndex:0];
}
}
Update!
Toolbox has pointed out the excellent idea of working to a new array - super KISS.
Here's an idea off the top of my head.
First, generate an array of integers representing the indices to remove. So "remove 5 from index 97" would generate [97,98,99,0,1]. This can be done with the application of a simple modulus operator.
Then, sort this array descending giving [99,98,97,1,0] and then remove the entries in that order.
Should work in all cases.
This solution seems to work, and it copies all remaining elements in the vector only once (to their final destination).
Assume kNumElements, kStartIndex, and kNumToRemove are defined as const size_t values.
vector<int> my_vec(kNumElements);
for (size_t i = 0; i < my_vec.size(); ++i) {
my_vec[i] = i;
}
for (size_t i = 0, cur = 0; i < my_vec.size(); ++i) {
// What is the "distance" from the current index to the start, taking
// into account the wrapping behavior?
size_t distance = (i + kNumElements - kStartIndex) % kNumElements;
// If it's not one of the ones to remove, then we keep it by copying it
// into its proper place.
if (distance >= kNumToRemove) {
my_vec[cur++] = my_vec[i];
}
}
my_vec.resize(kNumElements - kNumToRemove);
There's nothing wrong with two loop solutions as long as they're readable and don't do anything redundant. I don't know Objective-C syntax, but here's the pseudocode approach I'd take:
endIdx = after + howManyToDelete
if (Len <= after + howManyToDelete) //will have a second loop
firstloop = Len - after; //handle end in the first loop, beginning in second
else
firstpass = howManyToDelete; //the first loop will get them all
for (kount = 0; kount < firstpass; kount++)
remove after+1
for ( ; kount < howManyToDelete; kount++) //if firstpass < howManyToDelete, clean up leftovers
remove 0
This solution doesn't use mod, does the limit calculation outside the loop, and touches the relevant samples once each. The second for loop won't execute if all the samples were handled in the first loop.
The common way to do this in DSP is with a circular buffer. This is just a fixed length buffer with two associated counters:
//make sure BUFSIZE is a power of 2 for quick mod trick
#define BUFSIZE 1024
int CircBuf[BUFSIZE];
int InCtr, OutCtr;
void PutData(int *Buf, int count) {
int srcCtr;
int destCtr = InCtr & (BUFSIZE - 1); // if BUFSIZE is a power of 2, equivalent to and faster than destCtr = InCtr % BUFSIZE
for (srcCtr = 0; (srcCtr < count) && (destCtr < BUFSIZE); srcCtr++, destCtr++)
CircBuf[destCtr] = Buf[srcCtr];
for (destCtr = 0; srcCtr < count; srcCtr++, destCtr++)
CircBuf[destCtr] = Buf[srcCtr];
InCtr += count;
}
void GetData(int *Buf, int count) {
int srcCtr = OutCtr & (BUFSIZE - 1);
int destCtr = 0;
for (destCtr = 0; (srcCtr < BUFSIZE) && (destCtr < count); srcCtr++, destCtr++)
Buf[destCtr] = CircBuf[srcCtr];
for (srcCtr = 0; srcCtr < count; srcCtr++, destCtr++)
Buf[destCtr] = CircBuf[srcCtr];
OutCtr += count;
}
int BufferOverflow() {
return ((InCtr - OutCtr) > BUFSIZE);
}
This is pretty lightweight, but effective. And aside from the ctr = BigCtr & (SIZE-1) stuff, I'd argue it's highly readable. The only reason for the & trick is in old DSP environments, mod was an expensive operation so for something that ran often, like every time a buffer was ready for processing, you'd find ways to remove stuff like that. And if you were doing FFT's, your buffers were probably a power of 2 anyway.
These days, of course, you have 1 GHz processors and magically resizing arrays. You kids get off my lawn.
Another method:
N times do {remove entry at index P mod max(ArraySize, P)}
Example:
N=5, P=97, ArraySize=100
1: max(100, 97)=100 so remove at 97%100 = 97
2: max(99, 97)=99 so remove at 97%99 = 97 // array size is now 99
3: max(98, 97)=98 so remove at 97%98 = 97
4: max(97, 97)=97 so remove at 97%97 = 0
5: max(96, 97)=97 so remove at 97%97 = 0
I don't program iphone for know, so I image std::vector, it's quite easy, simple and elegant enough:
#include <iostream>
using std::cout;
#include <vector>
using std::vector;
#include <cassert> //no need for using, assert is macro
template<typename T>
void eraseCircularVector(vector<T> & vec, size_t position, size_t count)
{
assert(count <= vec.size());
if (count > 0)
{
position %= vec.size(); //normalize position
size_t positionEnd = (position + count) % vec.size();
if (positionEnd < position)
{
vec.erase(vec.begin() + position, vec.end());
vec.erase(vec.begin(), vec.begin() + positionEnd);
}
else
vec.erase(vec.begin() + position, vec.begin() + positionEnd);
}
}
int main()
{
vector<int> values;
for (int i = 0; i < 10; ++i)
values.push_back(i);
cout << "Values: ";
for (vector<int>::const_iterator cit = values.begin(); cit != values.end(); cit++)
cout << *cit << ' ';
cout << '\n';
eraseCircularVector(values, 5, 1); //remains 9: 0,1,2,3,4,6,7,8,9
eraseCircularVector(values, 16, 5); //remains 4: 3,4,6,7
cout << "Values: ";
for (vector<int>::const_iterator cit = values.begin(); cit != values.end(); cit++)
cout << *cit << ' ';
cout << '\n';
return 0;
}
However, you might consider:
creating new loop_vector class, if you use this kind of functionality enough
using list if you perform many deletions (or few deletions (not from end, that's simple pop_back) but large array)
If your container (NSMutableArray or whatever) is not list, but vector (i.e. resizable array), you most definitely don't want to delete items one by one, but whole range (e.g. std::vector's erase(begin, end)!
Edit: reacting to comment, to fully realize what must be done by vector, if you erase element other than the last one: it must copy all values after that element (e.g. 1000 items in array, you erase first, 999x copying (moving) of item, that is very costly).
Example:
#include <iostream>
#include <vector>
#include <ctime>
using namespace std;
int main()
{
clock_t start, end;
vector<int> vec;
const int items = 64 * 1024;
cout << "using " << items << " items in vector\n";
for (size_t i = 0; i < items; ++i) vec.push_back(i);
start = clock();
while (!vec.empty()) vec.erase(vec.begin());
end = clock();
cout << "Inefficient method took: "
<< (end - start) * 1.0 / CLOCKS_PER_SEC << " ms\n";
for (size_t i = 0; i < items; ++i) vec.push_back(i);
start = clock();
vec.erase(vec.begin(), vec.end());
end = clock();
cout << "Efficient method took: "
<< (end - start) * 1.0 / CLOCKS_PER_SEC << " ms\n";
return 0;
}
Produces output:
using 65536 items in vector
Inefficient method took: 1.705 ms
Efficient method took: 0 ms
Note it's very easy to get inefficient, look e.g. have at http://www.cplusplus.com/reference/stl/vector/erase/