I've been experimenting with writing to an external EEPROM using SPI and I've had mixed success. The data does get shifted out but in an opposite manner. The EEPROM requires a start bit and then an opcode which is essentially a 2-bit code for read, write and erase. Essentially the start bit and the opcode are combined into one byte. I'm creating a 32-bit unsigned int and then bit-shifting the values into it. When I transmit these I see that the actual data is being seen first and then the SB+opcode and then the memory address. How do I reverse this to see the opcode first then the memory address and then the actual data. As seen in the image below, the data is BCDE, SB+opcode is 07 and the memory address is 3F. The correct sequence should be 07, 3F and then BCDE (I think!).
Here is the code:
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint32_t write_package = (ERASE << 24 | mem_addr << 16 | data);
while (1)
{
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
HAL_SPI_Transmit(&hspi1, &write_package, 2, HAL_MAX_DELAY);
HAL_Delay(10);
}
/* USER CODE END 3 */
It looks like as your SPI interface is set up to process 16 bit halfwords at a time. Therefore it would make sense to break up the data to be sent into 16 bit halfwords too. That would take care of the ordering.
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint16_t write_package[2] = {
(ERASE << 8) | mem_addr,
data
};
HAL_SPI_Transmit(&hspi1, (uint8_t *)write_package, 2, HAL_MAX_DELAY);
EDIT
Added an explicit cast. As noted in the comments, without the explicit cast it wouldn't compile as C++ code, and cause some warnings as C code.
You're packing your information into a 32 bit integer, on line 3 of your code you have the decision about which bits of data are placed where in the word. To change the order you can replace the line with:
uint32_t write_package = ((data << 16) | (mem_addr << 8) | (ERASE));
That is shifting data 16 bits left into the most significant 16 bits of the word, shifting mem_addr up by 8 bits and or-ing it in, and then adding ERASE in the least significant bits.
Your problem is the Endianness.
By default the STM32 uses little edian so the lowest byte of the uint32_t is stored at the first adrress.
If I'm right this is the declaration if the transmit function you are using:
HAL_StatusTypeDef HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size, uint32_t Timeout)
It requires a pointer to uint8_t as data (and not a uint32_t) so you should get at least a warning if you compile your code.
If you want to write code that is independent of the used endianess, you should store your data into an array instead of one "big" variable.
uint8_t write_package[4];
write_package[0] = ERASE;
write_package[1] = mem_addr;
write_package[2] = (data >> 8) & 0xFF;
write_package[3] = (data & 0xFF);
I'm kinda new to the beaglebone black world running on a AM335X Cortex A8 processor and I would like to use the PRU for fast analog read with the maximum sampling rate possible.
I would like to read all 7 inputs in a loop form like:
while( n*7 < sampling_rate){ //initial value for n = 0
read(AIN0); //and store it in shared memory(7*n + 0)
read(AIN1); //and store it in shared memory(7*n + 1)
read(AIN2); //and store it in shared memory(7*n + 2)
read(AIN3); //and store it in shared memory(7*n + 3)
read(AIN4); //and store it in shared memory(7*n + 4)
read(AIN5); //and store it in shared memory(7*n + 5)
read(AIN6); //and store it in shared memory(7*n + 6)
n++;
}
so that I can read them from a host program running on the main processor. Any idea how to do so? I tried using a ready code called ADCCollector.c from a package named AM335x_pru_package but I can't figure out how to get all the addresses and values of the registers used.
This is the code I was trying to modify (ADCCollector.p):
.origin 0 // offset of the start of the code in PRU memory
.entrypoint START // program entry point, used by debugger only
#include "ADCCollector.hp"
#define BUFF_SIZE 0x00000fa0 //Total buff size: 4kbyte(Each buffer has 2kbyte: 500 piece of data
#define HALF_SIZE BUFF_SIZE / 2
#define SAMPLING_RATE 1 //Sampling rate(16khz) //***//16000
#define DELAY_MICRO_SECONDS (1000000 / SAMPLING_RATE) //Delay by sampling rate
#define CLOCK 200000000 // PRU is always clocked at 200MHz
#define CLOCKS_PER_LOOP 2 // loop contains two instructions, one clock each
#define DELAYCOUNT DELAY_MICRO_SECONDS * CLOCK / CLOCKS_PER_LOOP / 1000 / 1000 * 3 //if sampling rate = 98000 --> = 3061.224
.macro DELAY
MOV r10, DELAYCOUNT
DELAY:
SUB r10, r10, 1
QBNE DELAY, r10, 0
.endm
.macro READADC
//Initialize buffer status (0: empty, 1: first buffer is ready, 2: second buffer is ready)
MOV r2, 0
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
INITV:
MOV r5, 0 //Shared RAM address of ADC Saving position
MOV r6, BUFF_SIZE //Counting variable
READ:
//Read ADC from FIFO0DATA
MOV r2, 0x44E0D100
LBBO r3, r2, 0, 4
//Add address counting
ADD r5, r5, 4
//Write ADC to PRU Shared RAM
SBCO r3, CONST_PRUSHAREDRAM, r5, 4
DELAY
SUB r6, r6, 4
MOV r2, HALF_SIZE
QBEQ CHBUFFSTATUS1, r6, r2 //If first buffer is ready
QBEQ CHBUFFSTATUS2, r6, 0 //If second buffer is ready
QBA READ
//Change buffer status to 1
CHBUFFSTATUS1:
MOV r2, 1
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA READ
//Change buffer status to 2
CHBUFFSTATUS2:
MOV r2, 2
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA INITV
//Send event to host program
MOV r31.b0, PRU0_ARM_INTERRUPT+16
HALT
.endm
// Starting point
START:
// Enable OCP master port
LBCO r0, CONST_PRUCFG, 4, 4 //#define CONST_PRUCFG C4 taken from ADCCollector.hp
CLR r0, r0, 4
SBCO r0, CONST_PRUCFG, 4, 4
//C28 will point to 0x00012000 (PRU shared RAM)
MOV r0, 0x00000120
MOV r1, CTPPR_0
ST32 r0, r1
//Init ADC CTRL register
MOV r2, 0x44E0D040
MOV r3, 0x00000005
SBBO r3, r2, 0, 4
//Enable ADC STEPCONFIG 1
MOV r2, 0x44E0D054
MOV r3, 0x00000002
SBBO r3, r2, 0, 4
//Init ADC STEPCONFIG 1
MOV r2, 0x44E0D064
MOV r3, 0x00000001 //continuous mode
SBBO r3, r2, 0, 4
//Read ADC and FIFOCOUNT
READADC
Another question is: if I simply changed the #define Sampling_rate from 16000 to any other number below or equal to 200000 in the (.p) file, I will get that sampling rate? or should I change other things?
Thanks in advance.
I used the c wrappers from libpruio: http://www.freebasic.net/forum/viewtopic.php?f=14&t=22501
and then use this code to get all my ADC values:
#include "stdio.h"
#include "c_wrapper/pruio.h" // include header
#include "sys/time.h"
//! The main function.
int main(int argc, char **argv) {
struct timeval start, now;
long mtime, seconds, useconds;
gettimeofday(&start, NULL);
int i,x;
pruIo *io = pruio_new(PRUIO_DEF_ACTIVE, 0x98, 0, 1); //! create new driver structure
if (pruio_config(io, 1, 0x1FE, 0, 4)){ // upload (default) settings, start IO mode
printf("config failed (%s)\n", io->Errr);}
else {
do {
gettimeofday(&now, NULL);
seconds = now.tv_sec - start.tv_sec;
useconds = now.tv_usec - start.tv_usec;
mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
printf("%lu",mtime);
for(i = 1; i < 9; i++) {
printf(",%d", io->Adc->Value[i]); //0-66504 for 0-1.8v
}
printf("\n");
x++;
}while (mtime < 100);
printf("count: %d \n", x);
pruio_destroy(io); /* destroy driver structure */
}
return 0;
}
In your example you use libpruio in IO mode (synchronous) and therefore you have no control over the sampling rate, since the host CPU doesn't work in real-time.
To get the maximum sampling rate (as mentioned in the OP) you have to use either RB or MM mode. In those modes libpruio buffers the samples in memory and the host can access them asynchronously. See example rb_file.c (or triggers.bas) in the libpruio package.
I'm trying to optimize an image format conversion on iOS using the NEON vector instruction set. I assumed this would map well to that because it processes a bunch of similar data.
My attempts haven't gone that well, though, achieving only a marginal speedup vs the naive c implementation:
for(int i = 0; i < pixelCount; ++i, ++inPixel32) {
const unsigned int r = ((*inPixel32 >> 0 ) & 0xFF);
const unsigned int g = ((*inPixel32 >> 8 ) & 0xFF);
const unsigned int b = ((*inPixel32 >> 16) & 0xFF);
*outPixel16++ = ((r >> 3) << 11) | ((g >> 2) << 5) | ((b >> 3) << 0);
}
1 megapixel image array on iPad 2:
format is [min avg max n=number of timer samples] in milliseconds
C:
[14.446 14.632 18.405 n=1000]ms
NEON:
[11.920 12.032 15.336 n=1000]ms
My attempt at a NEON implementation is below:
int i;
const int pixelsPerLoop = 8;
for(i = 0; i < pixelCount; i += pixelsPerLoop, inPixel32 += pixelsPerLoop, outPixel16 += pixelsPerLoop) {
//Read all r,g,b pixels into 3 registers
uint8x8x4_t rgba = vld4_u8(inPixel32);
//Right-shift r,g,b as appropriate
uint8x8_t r = vshr_n_u8(rgba.val[0], 3);
uint8x8_t g = vshr_n_u8(rgba.val[1], 2);
uint8x8_t b = vshr_n_u8(rgba.val[2], 3);
//Widen b
uint16x8_t r5_g6_b5 = vmovl_u8(b);
//Widen r
uint16x8_t r16 = vmovl_u8(r);
//Left shift into position within 16-bit int
r16 = vshlq_n_u16(r16, 11);
r5_g6_b5 |= r16;
//Widen g
uint16x8_t g16 = vmovl_u8(g);
//Left shift into position within 16-bit int
g16 = vshlq_n_u16(g16, 5);
r5_g6_b5 |= g16;
//Now write back to memory
vst1q_u16(outPixel16, r5_g6_b5);
}
//Do the remainder on normal flt hardware
Code was compiled via LLVM 3.0 into the following (.loc and extra labels removed):
_DNConvert_ARGB8888toRGB565:
push {r4, r5, r7, lr}
mov r9, r1
mov.w r12, #0
add r7, sp, #8
cmp r2, #0
mov.w r1, #0
it ne
movne r1, #1
cmp r0, #0
mov.w r3, #0
it ne
movne r3, #1
cmp.w r9, #0
mov.w r4, #0
it ne
movne r4, #1
tst.w r9, #3
bne LBB0_8
ands r1, r3
ands r1, r4
cmp r1, #1
bne LBB0_8
movs r1, #0
lsr.w lr, r9, #2
cmp.w r1, r9, lsr #2
bne LBB0_9
mov r3, r2
mov r5, r0
b LBB0_5
LBB0_4:
movw r1, #65528
add.w r0, lr, #7
movt r1, #32767
ands r1, r0
LBB0_5:
mov.w r12, #1
cmp r1, lr
bhs LBB0_8
rsb r0, r1, r9, lsr #2
mov.w r9, #63488
mov.w lr, #2016
mov.w r12, #1
LBB0_7:
ldr r2, [r5], #4
subs r0, #1
and.w r1, r9, r2, lsl #8
and.w r4, lr, r2, lsr #5
ubfx r2, r2, #19, #5
orr.w r2, r2, r4
orr.w r1, r1, r2
strh r1, [r3], #2
bne LBB0_7
LBB0_8:
mov r0, r12
pop {r4, r5, r7, pc}
LBB0_9:
sub.w r1, lr, #1
movs r3, #32
add.w r3, r3, r1, lsl #2
bic r3, r3, #31
adds r5, r0, r3
movs r3, #16
add.w r1, r3, r1, lsl #1
bic r1, r1, #15
adds r3, r2, r1
movs r1, #0
LBB0_10:
vld4.8 {d16, d17, d18, d19}, [r0]!
adds r1, #8
cmp r1, lr
vshr.u8 d20, d16, #3
vshr.u8 d21, d17, #2
vshr.u8 d16, d18, #3
vmovl.u8 q11, d20
vmovl.u8 q9, d21
vmovl.u8 q8, d16
vshl.i16 q10, q11, #11
vshl.i16 q9, q9, #5
vorr q8, q8, q10
vorr q8, q8, q9
vst1.16 {d16, d17}, [r2]!
Ltmp28:
blo LBB0_10
b LBB0_4
Full code is available at https://github.com/darknoon/DNImageConvert I would appreciate any help, thanks!
Here you are, hand-optimized NEON implementation ready for XCode :
/* IT DOESN'T WORK!!! USE THE NEXT VERSION BELOW.
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
rg .req red
gb .req blu
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vshr.u8 red, red, #3
vext.8 rg, grn, red, #5
vshr.u8 grn, grn, #2
vext.8 gb, blu, grn, #3
vst2.8 {gb, rg}, [pDst]!
bgt loop
bx lr
This version will be many times faster than what you suggested :
increased cache hit rate via PLD
conversion to "long" not necessary
fewer instructions within the loop
There is still some room for optimizations though, you could modify the loop so that it converts 16 pixels per iteration instead of 8.
Then you can schedule the instructions to avoid the two stalls completely (which is simply not possible in this 8/iteration version above) and benefit from NEON's dual-issue capability in addition.
I didn't do this because it would make the code hard to understand.
It's important to know what VEXT is supposed to do.
Now it's up to you. :)
I verified this code to be properly compiled under Xcode.
Although I'm pretty sure it works correctly as well, I cannot guarantee this since I don't have the test environment.
In case of malfunctioning, please let me know. I'll correct it accordingly then.
cya
==============================================================================
Well, here is the improved version.
Due to the nature of the VSRI instruction not allowing two operands other than the target, it was not possible to create a more robust one regarding the register assignment.
Please check the image format of your source image. (exact byte order of the elements)
If it's not B, G, R, A, which is the default and native one on iOS, your application will suffer heavily from internal conversions by iOS.
If it's absolutely not possible to change this for whatever the reason, let me know.
I'll write a new version matching it.
PS : I forgot to remove the underscore at the start of the function prototype. Now it's gone.
/*
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*
* Version 1.1
* - bug fix
*
* Version 1.0
* - initial release
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
gb .req grn
rg .req red
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
.loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vsri.8 red, grn, #5
vshl.u8 gb, grn, #3
vsri.8 gb, blu, #3
vst2.8 {gb, rg}, [pDst]!
bgt .loop
bx lr
If you are on iOS or OS X, then you may be delighted to discover vImageConvert_RGBA8888toRGB565() and friends, in Accelerate.framework. This function rounds the 8-bit values to nearest 565 value.
For even better dithering, the quality of which is nearly indistinguishable from 8-bit color, try vImageConvert_AnyToAny():
vImage_CGImageFormat RGBA8888Format =
{
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.bitmapInfo = kCGBitmapByteOrderDefault | kCGImageAlphaNoneSkipLast,
.colorSpace = NULL, // sRGB or substitute your own in
};
vImage_CGImageFormat RGB565Format =
{
.bitsPerComponent = 5,
.bitsPerPixel = 16,
.bitmapInfo = kCGBitmapByteOrder16Little | kCGImageAlphaNone,
.colorSpace = RGBA8888Format.colorSpace,
};
err = vImageConverterRef converter = vImageConverter_CreateWithCGImageFormat(
&RGBA8888Format, &RGB565Format, NULL, kvImageNoFlags, &err );
err = vImageConvert_AnyToAny( converter, &src, &dest, NULL, kvImageNoFlags );
Either of these approaches will be vectorized and multithreaded for best performance.
You might want to use vld4q_u8() instead of vld4_u8() and adjust the rest of your code accordingly. It's hard to tell where the problem might be, but the assembler doesn't look too bad otherwise.
(I'm not familiar with NEON, nor deeply with the memory system of the Ipad2, but this is what we used to do with 88110 pixel-ops, which were an early precursor to today's SIMD extensions)
How big is the memory latency?
Could you hide it by unrolling the inner loop and running the NEON instructions on the "previous" values while the ARM pulls the "next" values from memory? A brief scan of the NEON manual implies you can run ARM and NEON instructions in parallel.
I don't think converting vld4_u8 to vld4q_u8 would lead to any bettering of the performance.
The code seems simple enough. I am not good at ASM and so it would take some time to look into it deeply.
The neon seems simple enough. But I am not quiet sure about r5_g6_b5 |= g16 being used instead of vorrq_u16
Please have a look at the optimization level too. As far as what I heard neon code optimization level goes to a maximum of 1. So the performance may differ when default optimization is being taken into account for both the reference code and neon code, as the level of optimization of reference by DEFAULT may be different.
I doesnt find any area in neon that can better the current code.
I work on Atom-32bit-intel, I have to port MicroC OS II, so there is no code to make any configuration on the Atom (No GDT, no LDT...):
my question is more about the state of the Atom-32bit after a reset, is the Atom in protecte mode or not ? and the most important how do i check which mode is it (which registers have to be checked nad how)?
Remark:
The CR0.PE = 1 (I checked it), is that enough to prove that the Atom is in protected mode ?
************ UPDATE : *****************
/*Read the IDTR*/
sidt (idt_ptr)
/*Read the GDTR*/
sgdt (gdt_ptr)
So I tried just to use IDT's address to link my ISR to the IDT :
fill_interrupt(ISR_Nbr,(unsigned int) isr33, 0x08, 0x8E);
static void fill_interrupt(unsigned char num, unsigned int base, unsigned short sel, unsigned char flags)
{
unsigned short *Interrupt_Address;
/*address = idt_ptr.base + num * 8 byte*/
Interrupt_Address = (unsigned short *)(idt_ptr.base + num*8);
*(Interrupt_Address) = base&0xFFFF;
*(Interrupt_Address+1) = sel;
*(Interrupt_Address+1) = (flags>>8)&0xFF00;
*(Interrupt_Address+1) = (base>>16)&0xFFFF;
}
my ISR a imple one :
isr33:
nop
nop
cli
push %ebp //save the context to swith back
mov %esp,%ebp
pop %ebp //Return to the calling function
sti
ret
Chapter 9 of volume 3 of the Intel Software Developer's Manual says that the reset value of CR0 is 60000010H. As you can see, bit 0, aka PE, is clear.
Regardless, you can setup the descriptor tables in Protected Mode as well as in Real Mode. You just have to be more careful about it.
I suggest you check if the BIOS or OS are setting this bit at a stage before you read it.
Atom is x86 instruction set, and as such, should be starting in real mode for compatibility. I don't have one on hand to test with though.
Resolved, I use N450 Atom board, it has already a BIOS, the BIOS configures the board in Protected Mode.