OracleSolaris 11.2 -- is /usr/kernel/drv/driver.conf required for PCI? - solaris

I'm implementing a small PCI driver for academic purposes, and one thing I'm not clear about if we actually have to provide driver.conf? Different materials which I read (including http://blog.csdn.net/hotsolaris/article/details/1763716), say that for PCI the driver config file is optional, however in my case it seems that pci_config_setup() is successful only with driver.conf provided:
name="mydrv" parent="/pci#0,0/pci8086,2e11"
Then I do:
% add_drv -i 'pciXXXX,YY' mydrv
and it adds in the system with no warning or error messages.
So I assume that some properties of a PCI device can't be derived automatically by the system, e.g. parent bus?
I would appreciate if anybody could shed some light on this. Thanks.

If you look at a random selection of very small files under /kernel/drv for actual physical hardware, you'll see that they almost always only contain the line
ddi_forceattach=1;
Pseudo drivers will have a driver.conf(4) file which reflects their parentage in the system. I really recommend reading that manpage, it goes into good detail about what's required here.

Related

How do I add a missing peripheral register to a STM32 MCU model in Renode?

I am trying out this MCU / SoC emulator, Renode.
I loaded their existing model template under platforms/cpus/stm32l072.repl, which just includes the repl file for stm32l071 and adds one little thing.
When I then load & run a program binary built with STM32CubeIDE and ST's LL library, and the code hits the initial function of SystemClock_Config(), where the Flash:ACR register is being probed in a loop, to observe an expected change in value, it gets stuck there, as the Renode Monitor window is outputting:
[WARNING] sysbus: Read from an unimplemented register Flash:ACR (0x40022000), returning a value from SVD: 0x0
This seems to be expected, not all existing templates model nearly everything out of the box. I also found that the stm32L071 model is missing some of the USARTs and NVIC channels. I saw how, probably, the latter might be added, but there seems to be not a single among the default models defining that Flash:ACR register that I could use as example.
How would one add such a missing register for this particular MCU model?
Note1: For this test, I'm using a STM32 firmware binary which works as intended on actual hardware, e.g. a devboard for this MCU.
Note2:
The stated advantage of Renode over QEMU, which does apparently not emulate peripherals, is also allowing to stick together a more complex system, out of mocked external e.g. I2C and other devices (apparently C# modules, not yet looked into it).
They say "use the same binary as on the real system".
Which is my reason for trying this out - sounds like a lot of potential for implementing systems where the hardware is not yet fully available, and also automatted testing.
So the obvious thing, commenting out a lot of parts in init code, to only test some hardware-independent code while sidestepping such issues, would defeat the purpose here.
If you want to just provide the ACR register for the flash to pass your init, use a tag.
You can either provide it via REPL (recommended, like here https://github.com/renode/renode/blob/master/platforms/cpus/stm32l071.repl#L175) or via RESC.
Assuming that your software would like to read value 0xDEADBEEF. In the repl you'd use:
sysbus:
init:
Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
In the resc or in the Monitor it would be just:
sysbus Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
If you want more complex logic, you can use a Python peripheral, as described in the docs (https://renode.readthedocs.io/en/latest/basic/using-python.html#python-peripherals-in-a-platform-description):
flash: Python.PythonPeripheral # sysbus 0x40022000
size: 0x1000
initable: false
filename: "script_with_complex_python_logic.py"
```
If you really need advanced implementation, then you need to create a complete C# model.
As you correctly mentioned, we do not want you to modify your binary. But we're ok with mocking some parts we're not interested in for a particular use case if the software passes with these mocks.
Disclaimer: I'm one of the Renode developers.

Specman beginner's questions

I am new to Specman.
I have a couple of questions:
I am trying to use the agent methodology. After writing the env,agent,bfm etc - what is the recommended way to create clock and reset? by writing a tb.v (calling the top verilog module) or is there a better way?
How do I link the specman env file to the tb (or maybe its just enough to link the ports of the different specman files with a signals_map to the verilog files?
Most important how do I run the environment with irun?
I was thinking of creating a file listing all the verilog files, e.g. - veri.lst
the specman top shall import all the specman files, e.g - spec_top.e
irun -access +wrc veri.lst spec_top.e
should be ok?
should I mention the top level module in the command?
Should I put the test name in a special way in the command?
Thanks alot for all the help!!
Cadence recommends driving clocks from inside an HDL testbench (i.e. written in Verilog in your case). This is because every time the simulator yields control to Specman to execute it wastes processor time. You want to minimize the number of switches as much as possible.
Linking the env to the TB is done by connecting the Verilog signals of interest to the corresponding Specman ports (using hdl_path()).
W.r.t. running it, there are 2 things to keep in mind. e code can be executed in compiled or in interpreted mode. Also, compiled code is faster, but can't be debugged. You have to tell irun what you want compiled and what you want interpreted:
irun -f veri.lst \
compiled_top.e \
-snload interpreted_top.e
What you typically compile are files which you don't expect to change (verification components that you buy or reuse from other projects, for example). The rest of your files you'd load interpreted to be able to easily debug.
Adding to Tudor's great answer -
First - yes, connecting The e TB to the DUT is done using hdl_path(), and connecting the ports to external. You usually would have one unit designated for the interface, so configuring it would look something like this:
extend signal_map {
// name of the instance of the verilog module you interface
keep hdl_path() == "sub_system_a";
keep bind (sig_clock, external);
// name of the clock signal
keep sig_clock.hdl_path == "clk";
};
Please take a look in the IES release, at the UVM Examples.
They are in
specman/uvm/uvm_examples
For example, check out the specman/uvm/uvm_examples/xserial/e/xserial_collector_h.e:
And about the clock -
Connecting a clock in the e TB to the design is very simple. Something like this -
unit synch {
sig_clock : in simple_port of bit is instance;
keep bind(sig_clock, external);
event clock is rise(sig_clock$) #sim;
// can define also on fall or change
};
Now the clock event can be used as sampling event for TCMs and Temporals. This is a simple fast way for using the clock in the TB.
Another way to use the clock, is more "acceleration ready". In this methodology, you would implement a clock agent in verilog, and it will provide "clock services" to the TB. According to this methodology, the TB will not have any "wait cycles" in it. instead - it will call the Clock Agent task "wait_cycles()" - and wait for indication that required number of clock cycles passed.
This is a rather new methodology, oriented to be Acceleration Ready.
It will be demonstrated in the UVM Examples in next IES release, 15.1.
/efrat

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

A trivial SYSENTER/SYSCALL question

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.
If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.
As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia
They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo.
I think your question is related to this question or this one. I think it is clear that - unless you can rely on a OS or library-function - you will want to use the CPUID instruction, but the question then becomes exactly what information you are looking for. - And of course, AMD's and Intel's implementations don't need to agree. This page suggests using Cpuid.1.EBX[15:8] (i.e., BH) for finding out on Intel and function 80000005h on AMD. In addition, on Intel, CPUID.2... seems to contain the relevant information, but it looks like a real pain to parse out the desired information.
I think, from what I've read, both AMD and Intel CPUID instructions will support CPUID.1.EBX[15:8], which returns the size of one cache line in QUADWORDs as used by the CLFLUSH instruction (which isn't present on all processors, so I don't know whether you'll always find something there). So, after executing CPUID.1, you'd have to multiply BH by 8 to get the cache line size in bytes. This hinges on my implicit assumption (please can anyone say whether it is really valid?) that the definition of one cache line size is always the same for CLFLUSH and PREFETCHh instructions.
Also, Intel's manuals states that PREFETCHh is only a hint, but that, if it prefetches anything, it will always be a minimum of 32 bytes.
EDIT1:
Another useful resource (even if not directly answering your question) for the optimised use of PREFETCHh is Intel's optimisation manual here.