Strange results with NEON code on iOS, jump into __ARCLite__load? - iphone

I am attempting to write some NEON code for optimal filling of word arrays on iPhone/iPad. What is so very strange about this issue is that the code seems to jump into a function named _ARCLite_load when a NEON instruction assigns a value to q3. Has anyone seen something like this before:
(test_time_asm.s compiled with xcode 4.6 and the -no-integrated-as flag)
.section __TEXT,__text,regular
.section __TEXT,__textcoal_nt,coalesced
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.text
.align 2
.globl _fill_neon_loop1
.private_extern _fill_neon_loop1
_fill_neon_loop1:
push {r4, r5, r6, r7, lr}
// r0 = wordPtr
// r1 = inWord
// r2 = numWordsToFill
mov r2, #1024
// Load r1 (inWord) into NEON registers
vdup.32 q0, r1
vdup.32 q1, r1
vdup.32 q2, r1
vdup.32 q3, r1 (Stepping into this instruction jumps into __ARCLite__load)
NEONFILL16_loop1:
vstm r0!, {d0-d7}
sub r2, r2, #16
cmp r2, #15
bgt NEONFILL16_loop1
mov r0, #0
pop {r4, r5, r6, r7, pc}
.subsections_via_symbols
Single stepping through the ASM instructions work up until the instruction that assigns to q3. When I step over that instruction, the code seems to jump here:
(gdb) bt
#0 0x0009a568 in __ARCLite__load () at /SourceCache/arclite_iOS/arclite-31/source/arclite.m:529
#1 0x0007b050 in test_time_run_cases () at test_time.h:147
This is really strange and I am really quite at a loss to understand why assigning to a NEON register would cause this. Does NEON use q3 for something special that I am unaware of?
I also tried to load up the registers using dN (64 bit regs), with the same results on assignment to d7.
vdup.32 d0, r1
vdup.32 d1, r1
vdup.32 d2, r1
vdup.32 d3, r1
vdup.32 d4, r1
vdup.32 d5, r1
vdup.32 d6, r1
vdup.32 d7, r1
(later)
After messing around with the suggested changes, I found the root cause of the problem. It was this branch label:
NEONFILL16_loop1:
vstm r0!, {d0-d7}
sub r2, r2, #16
cmp r2, #15
bgt NEONFILL16_loop1
For some reason, the branch label was causing a jump to another location in the code. Replacing the label above with the following fixed the problem:
1:
vstm r0!, {d0-d7}
sub r2, r2, #16
cmp r2, #15
bgt 1b
This could be some weird thing with the version of the ASM parser in clang delivered with xcode 4.6, but anyway just changing the label fixed it.

q3 is neither assigned to some special roles nor needs to be preserved. Don't worry about this.
I think auselen is right with his guess. Just looking at the disassembly will make it clear.
Try this below though :
.section __TEXT,__text,regular
.section __TEXT,__textcoal_nt,coalesced
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.text
.align 2
.globl _fill_neon_loop1
.private_extern _fill_neon_loop1
_fill_neon_loop1:
// r0 = wordPtr
// r1 = inWord
// r2 = numWordsToFill
mov r2, #1024
// Load r1 (inWord) into NEON registers
vdup.32 q0, r1
vdup.32 q1, r1
vdup.32 q2, r1
vdup.32 q3, r1
subs r2, r2, #16
bxmi lr
NEONFILL16_loop1:
vstm r0!, {d0-d7}
subs r2, r2, #16
bpl NEONFILL16_loop1
mov r0, #0
bx lr
.subsections_via_symbols
I removed the obsolete register preserving completely in addition to the cmp within the loop. (You know, I HAVE to optimize everything :))
If auselen's guess is right, this might have changed the tracing timing and stepping into ARClite will occur at a later point.

Almost every time I've jumped somewhere strange in handwritten ARM code it's been because I've fumbled the thumb interworking and the function has been executing in the wrong mode -- consequently the instruction stream looks like garbage to the CPU, and it jumps about randomly until it hurts itself and falls over.
For all labels which are function entrypoints, you should have this assembly directive:
.type _fill_neon_loop1, %function
This tells the linker that when it fixes up BL instructions, or when it computes the address of the function, it should make appropriate adjustments to ensure it's executed in the correct mode.

Related

Find content of C register after execution in microprocessor

Find content of 'C' register after execution of following assembly program
MVI A, 17H
LOOP: RLC
JNC LOOP
MOV C,A
HLT
Here my doubt is when at the end we did mov c,a then output should be the value of A that is in c rather than the previous result which we got after rlc instruction. But this didn't happen ..why so?

Ensure data doesn't cross page boundary

I'm trying to create a switch statement as below, which works well until something crosses a page. The switch destination is auto generated, which is why its in another file. 'structure, x' holds the offset (the case switch). In the case below, it will be either $00, $02, $04 or $06.
Is there anyway to ensure that the returnAddr isn't at $xx00? (Does that actually matter here?) And that the switchlist doesn't cross a boundary?
lda #>returnAddr
pha
lda #<returnAddr-1
pha
; store where we want to go
lda switchlist+1
pha
lda switchlist
clc
adc structure, x
pha
rts ; make call to the proc in the switchlist
returnAddr:
; ...
rts
and in another file I have (where case_x are function labels)
switchlist:
.word case_1
.word case_2
.word case_3
.word case_4
I was always taught to structure it like this:
LDA #2 ;say we want case_3
ASL
TAX
JSR handleSwitch ;pushes the return address for us.
;return here after we goto the desired case.
handleSwitch:
LDA switchlist+1,x
pha
LDA switchlist
pha
rts ;"return" to desired case
case_1:
rts ;each of these returns to after "jsr handleSwitch"
case_2:
rts
case_3:
rts
case_4:
rts
The key here is that you JSR to your trampoline, inlining it won't have the desired effect. If you build it this way it doesn't matter if your return address crosses a page boundary.

How to converting 8085 code to z80 assembly

I have 8085 assembly code for dividing 2 data 8 bit
:
MVI C,FFH
LXI H,1900H
MOV A,M (A=08)
INX H
MOV B,M (B=3)
REPEAT: INR C
SUB B
JNC REPEAT
ADD B
INX H
MOV M,C
INX H
MOV M,A
HLT
If you don't use the special opcodes RIM and SIM that only the 8085 has, the resulting machine code will run in almost all cases on the Z80 without changes. This is the case with your program.
However, if your task is to translate the mnemonics, just do a search-and-replace session. Start with the first one, MVI, and change it to LD. And so on.
You will need to change operands like M to (HL), too, because that is the syntax of the Z80 assembler.
Anyway, you need both instruction sets to do this.

Usage of roots and solve in MATLAB

I have an equation which goes like this:
Here a, b, c, d, e, f, g, h are constants. I want to vary k from say .6 to 10 with 0.1 interval and to find w. There are 2 ways to do this in MATLAB.
One way is convert this equation into equation of the form (something)w^8-(something else)w^6....-(something else again)w^0=0, and then make use of command 'roots' in MATLAB (Method 1).
Another way is that defining symbolic functions and then executing the program. When you are using the this method, you may not need simplify the expression any further, you can just put it in first form itself (Method 2).
Both ways are shown in the script below:
%%% defining values
clear; clc;
a=0.1500;
b=0.20;
c=0.52;
d=0.5;
e=6;
f=30;
g=18;
h=2;
%% Method 1: varying k using roots
tic
i=0;
for k=.6:.1:10
i=i+1;
t8=a;
t7=0;
t6=-(1+e+a*(c+g))*(k^2) ;
t5=0;
t4=(k^2*(b+f+(c*e+g)*k^2)-a*(d+h-c*g*k^4));
t3=0;
t2=k^2*(d*(e+a*g)+h+a*c*h-(c*f+b*g)*k^2);
t1=0;
t0=(a*d*h)-(d*f+b*h)*k^2;
q=[t8 t7 t6 t5 t4 t3 t2 t1 t0];
r(i,:)=roots(q);
end
krho1(:,1)=.6:.1:10;
r_real=real(r);
r_img=imag(r);
dat1=[krho1 r_real(:,1) r_real(:,2) r_real(:,3) r_real(:,4) r_real(:,5) r_real(:,6) r_real(:,7) r_real(:,8)];
fnameout=('stack_using_roots.dat');
fid1=fopen(fnameout,'w+');
fprintf(fid1,'krho\t RR1\t RR2\t RR3\t RR4\t RR5\t RR6\t RR7\t RR8\t \r');
fprintf(fid1,'%6.4f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f \n',dat1');
fclose(fid1);
plot(krho1, r_real(:,1),krho1, r_real(:,2),krho1, r_real(:,3),krho1, r_real(:,4),krho1, r_real(:,5),krho1, r_real(:,6),krho1, r_real(:,7),krho1, r_real(:,8))
toc
%% Method 2: varying k using solve
tic
syms w k
i=0;
for k=.6:.1:10
i=i+1;
first=a/k^2;
second=(w^2-b)/(w^4-k^2*c*w^2-d) ;
third=(e*w^2-f)/(w^4-k^2*g*w^2-h);
n(i,:)=double(solve(first-second-third, w));
end
krho1(:,1)=.6:.1:10;
r_real=real(n);
r_img=imag(n);
dat1=[krho1 r_real(:,1) r_real(:,2) r_real(:,3) r_real(:,4) r_real(:,5) r_real(:,6) r_real(:,7) r_real(:,8)];
fnameout=('stack_using_solve.dat');
fid1=fopen(fnameout,'w+');
fprintf(fid1,'krho\t RR1\t RR2\t RR3\t RR4\t RR5\t RR6\t RR7\t RR8\t \r');
fprintf(fid1,'%6.4f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f %7.10f \n',dat1');
fclose(fid1);
figure;
plot(krho1, r_real(:,1),krho1, r_real(:,2),krho1, r_real(:,3),krho1, r_real(:,4),krho1, r_real(:,5),krho1, r_real(:,6),krho1, r_real(:,7),krho1, r_real(:,8))
toc
Method 1 uses roots command and Method 2 uses symbolic and solve. My question are as follows:
You can see that first section plots are coming in a short time, where the second one is taking a greater time. Is there any way to increase the speed?
The plots of both the section seems very different, and you may be forced to believe that I have made mistakes while carrying out the calculation from (a/k^2)-((w^2-b)/(w^4-k^2*cw^2-d))-((ew^2-f)/(w^4-k^2*g*w^2-h)) to (something)w^8-(something else)w^6....-(something else again)w^0. I can assure you that I have put it correctly. You can see what really happens, if you look for any particular value of krho in both the dat file (stack_using_roots and stack_using_solve). For, lets say, krho=3.6, the roots is the same in both the dat files, but the way in which it is 'written' is not in a proper way. That is why the plots looks awkward. In short, while using 'roots' command, the solutions are given in a orderd format, on the other hand while using 'solve', it is getting shifted randomly. What is really happening? Is there any way to get around this problem?
I have ran the program with
i) syms w along with n(i,:)=double(solve(first-second-third==0, w));
ii) syms w k along with n(i,:)=double(solve(first-second-third==0, w));
iii) syms w k along with n(i,:)=double(solve(first-second-third, w));
In all these 3 cases, results seem to be same. Then what is the thing that we have to define as symbolic? And when do we use and do not use the expression '==0'?
Are there any ways to increase the speed?
Several. Some trivial speed improvements would come from defining variables before the loop. The big bottleneck is solve. Unfortunately, there isn't an obvious analytical solution to your problem without knowing k beforehand, so there's no obvious way to pull solve outside the for loop.
In short, while using 'roots' command, the solutions are given in a ordered format, on the other hand while using 'solve', it is getting shifted randomly. Why is that?
It is not really getting "shifted". Your function is symmetric about w = 0. So, for every root r there is another root at -r. Every time you call solve, it gives you the first, second, third then fourth roots, and then the same thing but this time the roots are multiplied by -1.
Sometimes solve chooses to take out -1 as a common factor. In these cases, it first gives you the roots multiplied by -1, then the positive roots. Why it sometimes takes out -1, sometimes doesn't, I don't know, but in your case (since you don't care about the imaginary part) you can fix this by replacing double(solve(first-second-third, w)) with sort(real(double(solve(first-second-third, w)))). The order of the roots won't be the same as in Method 1, but you won't get the weird switching behaviour.
In all these 3 cases, results seem to be same. Then what is the thing that we have to define as symbolic? And when do we use and do not use the expression '==0'?
syms w k vs. syms w doesn't make a difference because you redefine k as a numeric value (0.6, 0.7,... etc). Only w needs to be symbolic.
The reference page for solve dictates how the equation should be specified. If you scroll down to the section regarding the input variable eqns, it states
If any elements of eqns are symbolic expressions (without the right side), solve equates the element to 0.
This is why it makes no difference whether you write first-second-third==0 or first-second-third as the first input to solve.

Undefined result for Ripple Counter

I am writing a test bench for Ripple counter using d flip flop. My program is compiling without errors, however, I get undefined result. How can I solve this problem?
Here is the code:
module RCounter;
reg d,d2,d3,d4,clk;
wire q,q2,q3,q4;
DFlipFlop a(d,q,clk);
DFlipFlop a1(d2,q2,q);
DFlipFlop a2(d3,q3,q2);
DFlipFlop a3(d4,q4,q3);
initial
begin
clk =1;
d=0;d2=0;d3=0;d4=0;
#2 d=1;d2=~q2; d3=~q3; d4=~q4;
#2 d=0;d2=~q2; d3=~q3; d4=~q4;
#2 d=1;d2=~q2; d3=~q3; d4=~q4;
#2 d=0;d2=~q2; d3=~q3; d4=~q4;
#2 d=1;d2=~q2; d3=~q3; d4=~q4;
#2 d=0;d2=~q2; d3=~q3; d4=~q4;
#2 d=1;d2=~q2; d3=~q3; d4=~q4;
end
always
begin
#2 assign clk = ~ clk;
end
endmodule
What am I doing wrong and how can I solve this?
What you have there is not a ripple counter, and it seems that you don't really understand the boundary between your testbench and your DUT (design under test, or in your case, the 'ripple counter').
What you have is a testbench that is simulating four independent flip flops. If you were simulating a ripple counter, you should have a module called something like 'RCounter', which is instanced inside something else called 'RCounter_TB'. The testbench should only drive the inputs (for a counter, clock and reset), it should not drive the d pins of the individual flops, as the connections between these flops is what you want to test.
Inside the ripple counter module, you define the wired connections between your flip flops. There should not be any # time delays inside this module, because a module does not have a concept of fixed time delays. If you want the d2 pin to be driven from ~q2, then you just assign it as such:
assign d2 = ~q2
Because in hardware, this is just a wire looping back from the output of ~q2 back to d2. It always exists, and it has no concept of time.
As to specifically why you're getting X's on your output, I'll guess that it comes from the flip flop design you posted in your last question.
module DFlipFlop(d,q,clk);
input d,clk;
output q;
assign q = clk?( (d==1)? 1:0) : q;
endmodule
This is not a flip flop, as there is no state retention here, it's just an assign statement with an infinite feedback loop, (you essentially just have a wire driving itself).
If you want to model a flip flop, you need to use an always #(posedge clk) block, to imply that you want some state retention. I'll leave it to you to look up how to use an always block to model a flip flop.