IEEE 802.3bt compliant PD with Low Power Consumption - ethernet

I am designing a flow transmitter with Power over Ethernet(PoE) output. I am assuming the power consumption of the device is going to be low (< 11 Watts) for which power demand fall under 802.3at standard.
Now, the customer is demanding flow transmitter's supply designed to be compliant with 802.3bt (class 4, single signature) standard. I am well aware that PSEs are backward compatible.
My doubt is,
If I design transmitter's power supply 802.3bt compliant (which is for high power PDs) and if the total power consumption in run mode to be (let's say) 5 watts, will there be any problem in operation?
The reference design I am using https://www.ti.com/lit/ug/slvub75a/slvub75a.pdf. which is based on TPS2372-4 interface IC.
If any further detail required to answer, please let me know.
Regards,
Harish KS

The PSE type and highest class defines the upper limit of its power supply, it's your choice.
See 802.3bt Clause 145.2.1 - a bt-compliant PSE is either type 3 or 4. It can be limited to any power class, from 1 through 8.
So, a 5 W PSE can only supply class 1 PDs.

Related

Relation between CPI and number of execution units when looking at SIMD intrinsics [duplicate]

This question already has answers here:
latency vs throughput in intel intrinsics
(1 answer)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Closed 10 days ago.
I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.
Now, let's get concrete by taking the example of _mm256_shuffle_ps, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.
Questions:
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
How can a programmer know which separate instructions involve the same executions units?
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Thanks in advance for the transfer of knowledge.
Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)
These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
No, you can't assume that. You can assume that there are two execution units which are capable of executing _mm256_shuffle_ps, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform _mm_add_epi32. However, only 2 can perform _mm_shuffle_ps, and only 1 can perform _mm_div_ps, so they are clearly not the same....
How can a programmer know which separate instructions involve the same executions units?
Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.
Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:
void func(float* r, float* a, float* b) {
// basic integer ops: increment and less-than
for(int i = 0; i < 128; ++i) {
// 2 address manipulation instructions
float* addr_a = a + i * 4;
float* addr_b = b + i * 4;
// 2 load instructions
__m128 A = _mm_load_ps(addr_a);
__m128 B = _mm_load_ps(addr_b);
// an addition
__m128 R = _mm_add_ps(A, B);
// another address manipulation op
float* addr_r = r + i * 4;
// a store instruction
_mm_store_ps(addr_r, R);
}
}
Providing 12 execution units that are all capable of executing an _mm_add_ps instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).

perf power consumption measure: How does it work?

I noticed that perf list now has the option to measure power consumption. You can use it as follows:
$ perf stat -e power/energy-cores/ ./a.out
Performance counter stats for 'system wide':
8.55 Joules power/energy-cores/
0.949871058 seconds time elapsed
How accurate is this measurement, and how does perf estimate the power consumption?
The power/energy-cores/ perf counter is based on an MSR register called MSR_PP0_ENERGY_STATUS, which is part of the Intel RAPL interface (Intel seems to call each individual RAPL MSR a RAPL interface). A complicated model based on system activity events is used to estimate (static and dynamic) energy consumption. The MSR register name has PP0 in it, which refers to power plane 0, which is one of the RAPL domains that contains all the cores of the socket including the private caches of the cores. PP0, however, excludes the last-level cache, the interconnect, the memory controller, the graphics processor, and everything else that is in the uncore. It's impossible to measure the accuracy of MSR_PP0_ENERGY_STATUS because there is no other way to estimate the energy consumption of power plane 0 only.
It's possible to measure the accuracy of other RAPL domains though. These include the Package, DRAM, and PSys domains. For example, the accuracy of the Package domain energy estimation can be measured by comparing against the energy consumption of the whole system (which can be measured using a power meter) and running a workload that keeps the energy consumption of everything outside the package a known constant as much as possible. The accuracy of MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS have been measured in different ways by different people on many different processors. You can refer to the recent paper entitled RAPL in Action: Experiences in Using RAPL for Power Measurements for more information, which also includes summaries of previous works. The paper covers Sandy Bridge, Ivy Bridge, Haswell, and Skylake. The conclusion is that MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS appear to be accurate on Haswell and Skylake (the implementation has changed on Haswell, see : An Energy Efficiency Feature Survey of the Intel Haswell Processor). But this is not necessarily true on all kinds of workloads, P states, and processors. So the accuracy does not just depend on the microarchitecture.
The RAPL interface is discussed in Section 14.9 of the Intel Manual Volume 3. I noticed there are errors in the section. For example, it says client processors don't support the DRAM domain, which is not true. The client Haswell processor I'm using to write this answer supports the DRAM domain. The section is probably outdated and applies only Sandy Bridge and Ivy Bridge processors. I think it's better to read the datasheet of the processor on which you want to use RAPL.
The power/energy-pkg/ perf counter can be used to measure energy consumption of the package domain. This is the only domain that is known be supported on all Intel processors starting from Sandy Bridge.
On x86 systems, these values are based on RAPL (Running Average Power Limit) - an interface that provides built in CPU energy counters. While originally designed by Intel, AMD also provides a compatible interface on Zen systems.
The accuracy depends on the actual microarchitecture. Originally, RAPL was backed by a model with certain biases. On Intel CPUs since the Haswell architecture, it is based on measurements which are quite accurate. As far as I know there is no good understanding of the accuracy on AMD's Zen RAPL implementation.
One important thing you have to consider is the scope of the measurements. On most systems, only package and DRAM is covered1. So if you need to know how much power / energy your entire system consumes - you usually cannot easily answer that with RAPL.
Also note that RAPL is updated every 1 ms, so short workloads will have significant errors from the update rate.
1 - Skylake Desktop systems can implement a full-system RAPL. It's accuracy depends on the manufacturer.

Data transmission using RF with raspberryPi

I have a project that consisted of transmitting data wirelessly from 15 tractors to a station, the maximum distance between tractor and station is 13 miles. I used a raspberry pi 3 to collect data from tractors. with some research I found that there is no wifi or GSM coverage so the only solution is to use RF communication using VHF. so is that possible with raspberry pi or I must add a modem? if yes, what is the criterion for choosing a modem? and please if you have any other information tell me?
and thank you for your time.
I had a similar issue but possibly a little more complex. I needed to cover a maximum distance of 22 kilometres and I wanted to monitor over 100 resources ranging from breeding stock to fences and gates etc. I too had no GSM access plus no direct line of sight access as the area is hilly and the breeders like the deep valleys. The solution I used was to make my own radio network using cheap radio repeaters. Everything was battery operated and was driven by the receivers powering up the transmitters. This means that the units consume only 40 micro amps on standby and when the transmitters transmit, in my case they consume around 100 to 200 milliamps.
In the house I have a little program that transmits a poll to the receivers every so often and waits for the units to reply. This gives me a big advantage because I can, via the repeater trail (as each repeater, the signal goes through, adds its code to the returning message) actually determine were my stock are.
Now for the big issue, how long do the batteries last? Well each unit has a 18650 battery. For the fence and gate controls this is charged by a small 5 volt solar panel and after 2 years running time I have not changed any of them. For the cattle units the length of time between charges depends solely on how often you poll the units (note each unit has its own code) with one exception (a bull who wants to roam and is a real escape artist) I only poll them once or twice a day and I swap the battery every two weeks.
The frequency I use is 433Mhz and the radio transmitters and receivers are very cheap ( less then 10 cents a pair if you by them in Australia) with a very small Attiny (I think) arduino per unit (around 30 cents each) and a length on wire (34.6cm long as an aerial) for the cattle and 69.2cm for the repeaters. Note these calculations are based on the frequency used i.e. 433Mhz.
As I had to install lots of the repeaters I contacted an organisation in China (sorry they no longer exist) and they created a tiny waterproof and rugged capsule that contained everything, while also improving on the design (range wise while reducing power) at a cost of $220 for 100 units not including batterys. I bought one lot as a test and now between myself and my neighbours we bought another 2000 units for only $2750.
In my case this was paid for in less then three months when during calving season I knew exactly were they were calving and was on site to assist. The first time I used it we saved a mother who was having a real issue.
To end this long message I am not an expert but I had an idea and hired people who were and the repeater approach certainly works over long distances and large areas (42 square kilometres).
Following on from the comments above, I'm not sure where you are located but spectrum around the 400mhz range is licensed in many countries so it would be worth checking exactly what you can use.
If this is your target then this is UHF rather than VHF so if you search for 'Raspberry PI UHF shield' or 'Raspberry PI UHF module' you will find some examples of cheap hardware you can add to your raspberry pi to support communication over these frequencies. Most of the results should include some software examples also.
There are also articles on using the pins on the PI to transmit directly by modulating the voltage them - this is almost certainly going to interfere with other communications so I doubt it would meet your needs.

Compute e^x for float values in System Verilog?

I am building a neural network running on an FPGA, and the last piece of the puzzle is running a sigmoid function in hardware. This is either:
1/(1 + e^-x)
or
(atan(x) + 1) / 2
Unfortunately, x here is a float value (a real value in SystemVerilog).
Are there any tips on how to implement either of these functions in SystemVerilog?
This is really confusing to me since both of these functions are complex, and I don't even know where to begin implementing them due to the added complexity of being float values.
One simpler way for this is to create a memory/array for this function. However that option can be highly inefficient.
x should be the input address for the memory and the value at that location can be the output of the function.
Suppose value of your function is as follow. (This is just an example)
x = 0 => f(0) = 1
x = 1 => f(0) = 2
x = 2 => f(0) = 3
x = 3 => f(0) = 4
So you can create an array for this, which stored the output values.
int a[4] = `{1, 2, 3, 4};
I just finished this by Vivado HLS, which allows you to write circuits in C.
Here is my C code.
#include math.h
void exp(float a[10],b[10])
{
int i;
for(i=0;i<10;i++)
{
b[i] = exp(a[i]);
}
}
But there is a question that it is impossible to create a unsized matrix. Maybe there is another way that I don't know.
As you seem to realize, type real is not synthesizable. you need to operate on the type integer mantissa and type integer exponent separately and combine them when you are done, having tracked the sign. Once you take care of (e^-x), the rest should be straight-forward.
try this page for a quick explanation: https://www.geeksforgeeks.org/floating-point-representation-digital-logic/
and search on "floating point digital design" for more explanations/examples.
Do you really need a floating number for this? Is fixed point sufficient?
Considering (atan(x) + 1) / 2, quite likely the only useful values of x are those where the exponent is fairly small. (if the exponent is large, your answer is pi/2).
atan of a fixed point number can be calculated in HW fairly easily; there are cordic methods (see https://zipcpu.com/dsp/2017/08/30/cordic.html) and direct methods; see for example https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/
FPGA design flows in which hardware (FPGA) is targeted generally do not support floating point numbers in the FPGA fabric. Fixed point with limited precision is more commonly used.
A limited precision fixed point approach:
Use Matlab to create an array of samples for your math function such that the largest value is +/- .99999. For 8 bit precision (actually 7 with sign bit), multiply those numbers by 128, round at the decimal point and drop the fractional part. Write those numbers to a text file in 2s complement hex format. In SystemVerilog you can implement a ROM using that text file. Use $readmemh() to read these numbers into a memory style variable (one that has both a packed and unpacked dimension). Link to a tutorial:
https://projectf.io/posts/initialize-memory-in-verilog/.
Now you have a ROM with limited precision samples of your function
Section 21.4 Loading memory array data from a file in the SystemVerilog specification provides the definition for $readmh(). Here is that doc:
https://ieeexplore.ieee.org/document/8299595
If you need floating point one possibility is to use a processor soft core with a floating point unit implemented in FPGA fabric, and run software on that core. The core interface to the rest of the FPGA fabric over a physical bus such as axi4 steaming. See:
https://www.xilinx.com/products/design-tools/microblaze.html to get started.
It is a very different workflow than ordinary FPGA design and uses different tools. C or C++ compiler with math libraries (tan, exp, div, etc) would be used along with the processor core.
Another possibility for fixed point is an FPGA with a hard core processor. Xilinx Zynq is one of them. This is a complex and powerful approach. A free free book provides knowledge on how to use Zynq
http://www.zynqbook.com/.
This workflow is even more complex that soft core approach because the Zynq is a more complex platform (hard processor & FPGA integrated on one chip).
Its pretty hard to implement non-linear functions like that in hardware, and on top of that floating point arithmetic is even more costly. Its definitely better(and recommended) to work with fixed point arithmetic as mentioned in answers before. The number of precision bits in fixed point arithmetic will depend on your result accuracy and the error tolerance.
For hardware implementations, any kind of non-linear function can be approximated as piecewise linear function, and use a ROM based implementation approach as described in previous answers. The number of sample points that you take from the non-linear function determines your accuracy. The more samples you store the better approximation of the function you get. Often in hardware , number of samples you can store can become restricted by the amount of fast/local memory available to you. In this case to optimize the memory resources, you can add a little extra compute resources and perform linear interpolation to calculate the needed values.

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.