SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read? - x86-64

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help.
Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation.
Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads?
Do CPU manufactures demand certain size of the memory bus, example, for a 64 bit CPU, would Intel require 128 bit bus for SSE memory bound operations?
Are these operations dependent on memory bus size, number of channels and number of memory modules?

Loads/stores don't go to directly to memory (unless you use them on an uncacheable memory region). Even NT stores go into a write-combining fill buffer.
Load/stores go between execution units and L1D cache. CPUs internally have wide data paths from cache to execution units, and from L1D to outer caches. See How can cache be that fast? on electronics.SE, about Intel IvyBridge.
e.g. IvB has 128b data paths between execution units and L1D. Haswell widened that to 256 bits. Unaligned loads/stores have full performance as long as they don't cross a cache-line boundary. Skylake-AVX512 widened that to 512 bits, so it can do 2 64-byte loads and a 64-byte store in a single clock cycle. (As long as data is hot in L1D cache).
AMD CPUs including Ryzen handle 256b vectors in 128b chunks (even in the execution units, unlike Intel after Pentium M). Older CPUs (e.g. Pentium III and Pentium-M) split 128b loads/stores (and vector ALU) into two 64-bit halves because their load/store execution units were only 64 bits wide.
The memory controllers are DDR2/3/4. The bus is 64-bits wide, but uses a burst mode with a burst size of 64 bytes (not coincidentally, the size of a cache line.)
Being a "64-bit" CPU is unrelated to the width of any internal or external data buses. That terminology did get used for other CPUs in the past, but even P5 Pentium had a 64-bit data bus. (aligned 8-byte load/store is guaranteed atomic as far back as P5, e.g. x87 or MMX.) 64-bit in this case refers to the width of pointers, and of integer registers.
Further reading:
David Kanter's Haswell deep dive compares data-path widths in Haswell vs. SnB cores.
What Every Programmer Should Know About Memory (but note that much of the software-prefetch stuff is obsolete, modern CPUs have better HW prefetchers than Pentium4). Still essential reading, especially if you want to understand how CPUs are connected to DDR2/3/4 memory.
Other performance links in the x86 tag wiki.
Enhanced REP MOVSB for memcpy for more about x86 memory bandwidth. Note especially that single-threaded bandwidth can be limited by max_concurrency / latency, rather than by the DRAM controller, especially on a many-core Xeon (higher latency to L3 / memory).

Related

If a 32 bit address processor can access 4GB how does this processor deal with hard disk of size 500 Gb?

A 32-bit register can store 232 different values. The signed range of integer values that can be stored in 32 bits is -2,147,483,648 through 2,147,483,647 (unsigned: 0 through 4,294,967,295). Hence, a processor with 32-bit memory addresses can directly access 4 GiB of byte-addressable memory. so how this kind of processor deal with disk of size more than 4 gb?
Hence, a processor with 32-bit memory addresses can directly access 4 GiB of byte-addressable memory. so how this kind of processor deal with disk of size more than 4 gb?
For disks; typically they're not byte addressable and the smallest amount that can be read or written (the block size) is 512 bytes or larger (maybe 4096 bytes). Block numbers may also be larger than 32 bits (e.g. maybe 48 bit block numbers).
With 512-byte blocks and 48 bit block numbers (which was common in the late 1990s; for ATA and SATA, etc) you'd end up with a maximum disk size of 134217728 GiB.
Of course the CPU probably (see note) can't directly access any of the data on disk. Software (file system) has to ask a device driver to fetch the block/s it wants, and device driver asks hardware (disk controller) to copy data between disk and memory. Depending on OS; this software interface (used by file system to ask device driver to read or write blocks) most likely uses 64-bit block numbers (e.g. two 32-bit registers joined together).
Note: More recently, the possibility of using non-volatile RAM (e.g. https://en.wikipedia.org/wiki/3D_XPoint ) as storage changed things (it is byte addressable and does use physical addresses); but modern hardware is all "64-bit" (with physical addresses that may be 48 bits or larger in practice) so even though the theoretical limit is much smaller it's still large enough in practice (e.g. maybe 200000 GiB).
On x86-64, the main CPU architecture of most desktop computers, an interface is standardized in the AHCI (Advanced Host Controller Interface). The computer accesses the hard-disk using this interface which is, in practice, a PCI-Express device compliant with the PCI-Express specification.
With PCI, the CPU has a memory controller which will write to the PCI devices registers instead of RAM when writing to some portions of RAM. Software (the operating-system) will write in uncached RAM and it will instead write to the registers of the PCI device. This way it can tell the devices, including an AHCI, to do some operations like a DMA operation from the hard-disk to RAM and vice-versa.
I didn't read the specification in full but the specification probably holds 64 bits registers that can be written as 2 32 bits words. More often, software will write to the lower part of the register then to the higher part of the register. This allows any 32 bits computers to still be able to interact with AHCI. On a 64 bits computer, the registers will be written as one 64 bits write.
A 32 bits computer is thus still able to trigger DMA writes to some portions of the hard-disk which are much higher than 4GB.

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome
Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

Differences between current gen Xeon Processors

What's the actual differences between Xeon W series, Bronze, Silver, Gold and Platinum series?
With earlier versions of Xeons, The E3 were single socket CPU's. whereas E5's could be used in motherboards with two sockets. The E7's were quad sockets supported (probably 8 too)
However, with the current generation Xeon's, Most of the lineup has a scalability of 2S (2 processors in one Motherboard)
If Xeon Silver and Xeon Platinum could be used in a dual-socket motherboard, why would I need a platinum processor, which is atleast 5X more expensive than the silver part? Unless there are other differences.
What are the differences between the current-gen Xeon processors? I see some differences in cache size. Other than that, I couldn't find anything else.
Gold/Platinum has more cores per socket, and/or higher base or turbo clocks. That's most of what you're paying for.
The extra UPI links that let them work in 4S or higher systems aren't relevant when being used in a 2 socket system, but that's not the only feature. Presumably it's only a small part of the cost. With the change from inclusive L3 cache to non-inclusive, Skylake Xeon and later already need a snoop filter separate from L3 tags even for single-socket, unlike Xeon E5 which just broadcast everything to the other socket. Presumably Xeon-SP's snoop filter can work for filtering snoops to the other socket as well so it didn't need to be a separate feature for 1S vs. 2S.
e.g. the top-end 2nd-gen (Cascade Lake) Intel® Xeon® Platinum 9282 Processor has 56 cores (112 threads), max turbo = 3.8 GHz, base clock = 2.6 GHz, and 77 MB of L3 Cache.
The top-end Silver is Intel® Xeon® Silver 4216: 16c/32t 3.2 GHz turbo, 2.10 GHz base, 22 MB L3 cache.
Despite have almost 4x the cores, sustained and peak turbo clocks are higher on the Platinum. (With a 400W TDP, vs. 100W for the Silver! Less-insane Platinum chips are lower TDP, e.g. a 32c/64t with 2.3GHz base / 3.7GHz turbo is 250W TDP).
Also, some (all?) Silver / Bronze CPUs only have one AVX512 FMA execution unit so throughput for 512-bit SIMD FP math instructions is reduced, including all FP math and int<->FP conversions, and _mm512_lzcnt_epi32. Look for the # of AVX-512 FMA Unit line on the Ark page for a specific CPU. For integer SIMD, only multiply is affected. (In hardware, SIMD integer multiply uops run on the FMA units.) Shifts, blends, shuffles, add/sub, compare, and boolean all have separate vector ALUs which are 512 bits wide and don't take as much die area as multipliers.
Even that top-end Silver 4216 Cascade Lake has only 1 512-bit FMA unit.
Running AVX2 code there's zero difference. Even AVX512 using only 256-bit vectors is fine. (gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256 because using 512-bit vectors at all reduces max turbo temporarily. It wants to avoid the case where one unimportant 512-bit-vectorized loop gimps the clock speed for the rest of the program that spends most of its time in scalar code.)
But if you're doing heavy AVX-512 FP number crunching you probably want a CPU with 2 FMA units and to compile with 512-bit vectors.
IDK why you tagged this Xeon Phi; that's a totally different microarchitecture.

Why do x86-64 systems have only a 48 bit virtual address space?

In a book I read the following:
32-bit processors have 2^32 possible addresses, while current 64-bit processors have a 48-bit address space
My expectation was that if it's a 64-bit processor, the address space should also be 2^64.
So I was wondering what is the reason for this limitation?
Because that's all that's needed. 48 bits give you an address space of 256 terabyte. That's a lot. You're not going to see a system which needs more than that any time soon.
So CPU manufacturers took a shortcut. They use an instruction set which allows a full 64-bit address space, but current CPUs just only use the lower 48 bits. The alternative was wasting transistors on handling a bigger address space which wasn't going to be needed for many years.
So once we get near the 48-bit limit, it's just a matter of releasing CPUs that handle the full address space, but it won't require any changes to the instruction set, and it won't break compatibility.
Any answer referring to the bus size and physical memory is slightly mistaken, since OP's question was about virtual address space not physical address space. For example the supposedly analogous limit on some 386's was a limit on the physical memory they could use, not the virtual address space, which was always a full 32 bits. In principle you could use a full 64 bits of virtual address space even with only a few MB of physical memory; of course you could do so by swapping, or for specialized tasks where you want to map the same page at most addresses (e.g. certain sparse-data operations).
I think the real answer is that AMD was just being cheap and hoped nobody would care for now, but I don't have references to cite.
Read the limitations section of the wikipedia article:
A PC cannot contain 4 petabytes of memory (due to the size of current memory chips if nothing else) but AMD envisioned large servers, shared memory clusters, and other uses of physical address space that might approach this in the foreseeable future, and the 52 bit physical address provides ample room for expansion while not incurring the cost of implementing 64-bit physical addresses
That is, there's no point implementing full 64 bit addressing at this point, because we can't build a system that could utilize such an address space in full - so we pick something that's practical for today's (and tomorrow's) systems.
The internal native register/operation width does not need to be reflected in the external address bus width.
Say you have a 64 bit processor which only needs to access 1 megabyte of RAM. A 20 bit address bus is all that is required. Why bother with the cost and hardware complexity of all the extra pins that you won't use?
The Motorola 68000 was like this; 32 bit internally, but with a 23 bit address bus (and a 16 bit data bus). The CPU could access 16 megabytes of RAM, and to load the native data type (32 bits) took two memory accesses (each bearing 16 bits of data).
There is a more severe reason than just saving transistors in the CPU address path: if you increase the size of the address space you need to increase the page size, increase the size of the page tables, or have a deeper page table structure (that is more levels of translation tables). All of these things increase the cost of a TLB miss, which hurts performance.
From my point of view, this is result from the page size.Each page at most contains 4096/8 =512 entries of page table. And 2^9 =512. So 9 * 4 + 12=48.
Many people have this misconception. But I am promising to you if you read this carefully, after reading this all your misconceptions will be cleart.
To say a processor 32 bit or 64 bit doesn't signify it should have 32 bit address bus or 64 bit address bus respectively!...I repeat it DOESN'T!!
32 bit processor means it has 32 bit ALU (Arithmetic and Logic Unit)...that means it can operate on 32 bit binary operand (or simply saying a binary number having 32 digits) and similarly 64 bit processor can operate on 64 bit binary operand. So weather a processor 32 bit or 64 bit DOESN'T signify the maximum amount of memory can be installed. They just show how large the operand can be...(for analogy you can think of a 10-digit calculator can calculate results upto 10 digits...it cannot give us 11 digits or any other bigger results... although it is in decimal but I am telling this analogy for simplicity)...but what you are saying is address space that is the maximum directly interfaceable size of memory (RAM). The RAM's maximum possible size is determined by the size of the address bus and it is not the size of the data bus or even ALU on which the processor's size is defined (32/64 bit). Yes if a processor has 32 bit "Address bus" then it is able to address 2^32 byte=4GB of RAM (or for 64 bit it will be 2^64)...but saying a processor 32 bit or 64 bit has nothing relevance to this address space (address space=how far it can access to the memory or the maximum size of RAM) and it is only depended on the size of its ALU. Of course data bus and address bus may be of same sized and then it may seem that 32 bit processor means it will access 2^32 byte or 4 GB memory...but it is a coincidence only and it won't be the same for all....for example intel 8086 is a 16 bit processor (as it has 16 bit ALU) so as your saying it should have accessed to 2^16 byte=64 KB of memory but it is not true. It can access upto 1 MB of memory for having 20 bit address bus....You can google if you have any doubts:)
I think I have made my point clear.Now coming to your question...as 64 bit processor doesn't mean that it must have 64 bit address bus so there is nohing wrong of having a 48 bit address bus in a 64 bit processor...they kept the address space smaller to make the design and fabrication cheap....as nobody gonna use such a big memory (2^64 byte)...where 2^48 byte is more than enough nowadays.
To answer the original question: There was no need to add more than 48 Bits of PA.
Servers need the maximum amount of memory, so let's try to dig deeper.
1) The largest (commonly used) server configuration is an 8 Socket system. An 8S system is nothing but 8 Server CPU's connected by a high speed coherent interconnect (or simply, a high speed "bus") to form a single node. There are larger clusters out there but they are few and far between, we are talking commonly used configurations here. Note that in the real world usages, 2 Socket system is one of the most commonly used servers, and 8S is typically considered very high end.
2) The main types of memory used by servers are byte addressable regular DRAM memory (eg DDR3/DDR4 memory), Memory Mapped IO - MMIO (such as memory used by an add-in card), as well as Configuration Space used to configure the devices that are present in the system. The first type of memory is the one that are usually the biggest (and hence need the biggest number of address bits). Some high end servers use a large amount of MMIO as well depending on what the actual configuration of the system is.
3) Assume each server CPU can house 16 DDR4 DIMMs in each slot. With a maximum size DDR4 DIMM of 256GB. (Depending on the version of server, this number of possible DIMMs per socket is actually less than 16 DIMMs, but continue reading for the sake of the example).
So each socket can theoretically have 16*256GB=4096GB = 4 TB.
For our example 8S system, the DRAM size can be a maximum of 4*8= 32 TB. This means that
the max number of bits needed to address this DRAM space is 45 (=log2 32TB/log2 2).
We wont go into the details of the other types of memory (MMIO, MMCFG etc), but the point here is that the most "demanding" type of memory for an 8 Socket system with the largest types of DDR4 DIMMs available today (256 GB DIMMs) use only 45 bits.
For an OS that supports 48 bits (WS16 for example), there are (48-45=) 3 remaining bits.
Which means that if we used the lower 45 bits solely for 32TB of DRAM, we still have 2^3 times of addressable memory which can be used for MMIO/MMCFG for a total of 256 TB of addressable space.
So, to summarize:
1) 48 bits of Physical address is plenty of bits to support the largest systems of today that are "fully loaded" with copious amounts of DDR4 and also plenty of other IO devices that demand MMIO space. 256TB to be exact.
Note that this 256TB address space (=48bits of physical address) does NOT include any disk drives like SATA drives because they are NOT part of the address map, they only include the memory that is byte-addressable, and is exposed to the OS.
2) CPU hardware may choose to implement 46, 48 or > 48 bits depending on the generation of the server. But another important factor is how many bits does the OS recognize.
Today, WS16 supports 48 bit Physical addresses (=256 TB).
What this means to the user is, even though one has a large, ultra modern server CPU that can support >48 bits of addressing, if you run an OS that only supports 48 bits of PA, then you can only take advantage of 256 TB.
3) All in all, there are two main factors to take advantage of higher number of address bits (= more memory capacity).
a) How many bits does your CPU HW support? (This can be determined by CPUID instruction in Intel CPUs).
b) What OS version are you running and how many bits of PA does it recognize/support.
The min of (a,b) will ultimately determine the amount of addressable space your system can take advantage of.
I have written this response without looking into the other responses in detail. Also, I have not delved in detail into the nuances of MMIO, MMCFG and the entirety of the address map construction. But I do hope this helps.
Thanks,
Anand K Enamandram,
Server Platform Architect
Intel Corporation
It's not true that only the low-order 48 bits of a 64 bit VA are used, at least with Intel 64. The upper 16 bits are used, sort of, kind of.
Section 3.3.7.1 Canonical Addressing in the Intel® 64 and IA-32 Architectures Software Developer’s Manual says:
a canonical address must have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one)
So bits 47 thru 63 form a super-bit, either all 1 or all 0. If an address isn't in canonical form, the implementation should fault.
On AArch64, this is different. According to the ARMv8 Instruction Set Overview, it's a 49-bit VA.
The AArch64 memory translation system supports a 49-bit virtual address (48 bits per translation table). Virtual addresses are sign- extended from 49 bits, and stored within a 64-bit pointer. Optionally, under control of a system register, the most significant 8 bits of a 64-bit pointer may hold a “tag” which will be ignored when used as a load/store address or the target of an indirect branch
A CPU is considered "N-bits" mainly upon its data-bus size, and upon big part of it's entities (internal architecture): Registers, Accumulators, Arithmetic-Logic-Unit (ALU), Instruction Set, etc. For example: The good old Motorola 6800 (or Intel 8050) CPU is a 8-bits CPU. It has a 8-bits data-bus, 8-bits internal architecture, & a 16-bits address-bus.
Although N-bits CPU may have some other than N-size entities. For example the impovments in the 6809 over the 6800 (both of them are 8-bits CPU with a 8-bits data-bus). Among the significant enhancements introduced in the 6809 were the use of two 8-bit accumulators (A and B, which could be combined into a single 16-bit register, D), two 16-bit index registers (X, Y) and two 16-bit stack pointers.

Size of microprocessor

I have read that the microprocessor consists of several components, each having same/different "sizes". But what really confuses me is what determines the stated size of a microprocessor as 16-bit, 32-bit or 64-bit...
Is it:
the the ALU's capacity?
the size of the data bus?
the size of the address bus?
the "least common denominator" of the above?
or some other factor that I hitherto don't know about?
Generally the bit-size of a processor is the size of its general purpose registers. this often corresponds to the size of the memory bus and possibly the address bus, but doesn't necessarily.
For example, Intel sold a version of the 386 chip called the 386SX (http://en.wikipedia.org/wiki/Intel_80386#The_i386SX_variant) that internally was a 386 with 32-bit registers, but only has a 16 bit data bus. I think that most people would consider the chip to still be a 32-bit processor instead of a 16 bit processor.
I think I'll have a go. Traditionally I think that "size" meant the width (number of bits) in the register set. On "my" first computer the DEC PDP-8/E the single register available - the accumulator - was 12 bits wide and it was a 12-bit computer, on the PDP-11 the registers were 16 bits wide and it was a 16-bit computer. IBM 370 and VAX had 32-bit registers and were 32-bit computers.
Starting with the 80386 things became difficult. Depending on operating mode it could present itself to be a real mode 8086, a protected mode (PM) 80286 or a PM 80386. With the 64-bit processors using AMD64 or x86-64 you have all of the above and also 64-bit PM. So what are they? It should depend on the basic operating mode of the OS that's running on it. Windows NT 2000, Windows XP-32, Vista-32 7-32 make the processor 32-bit. OS with "64" in them make the processor 64 bit.
As to buses and stuff. There are two physical buses on x86-processors address+data and two logical buses memory+i/o. Special pins on the processor determine if the operation is memory or i/o, read or write and so on. On the 8086/8088 the data and address buses shared the same pins A0-A15 with D0-D15/A0-A7 with D0-D7 with bits A16-A19/A8-A19 being strictly address. On the 80286 they were separate, not sure about the 80186/80188. On the 80286 there were 24 address and 16 data lines. On the 80386 and 80486 there were 32 each for address and data. The 80386SX had the same external configuration as the 80286.
After this buses get complicated. The processors run so fast internally that they are more or less constantly waiting for their caches which in turn are more or less constantly waiting on external RAM. To satisfy the caches' insatiable hunger for data external memory began delivering it in 64 bit wide chunks starting with the Pentium and Pentium MMX that are both 32 bit processors with 32 address lines but with 64 data lines.
With later processors the number of address lines were increased to 36 allowing a total addressable external memory of 64 GB. The processors remained 32-bit internally.
On multi-core processors the hunger for data is even more pronounced so they may have several sets of address and data buses to faciltiate data being crammed into the processor. Desktop processors may have two or three and server processors three to four. I'm not sure but I believe some have switched to 128 bit wide data buses.
For modern 64-bit processors it is not feasible to also have 64 address lines since that would allow memory up to 16 billion Gigabytes which is not possible today. Some motherboards allow 128 GB which means that the processor needs at least 37 address lines.
As you can see address and data buses are no longer really usable to determine processor size. They actually haven't for the last 25 (80386 modes) years.
In C the int type is supposed to be equivalent to the register width. On AMD64 it isn't because there just isn't that great a need for 64 bit ints: 32 bit ints do just nicely in most cases. The width of a pointer in C on AMD64 is 64 bits though.
Generally this refers to the amount of the memory (2^n) bytes that is addressable by the CPU. Usually it's the same as data bus, but the hardware may do multiple accesss to retrieve that amount so is not 100% guaranteed. Sometimes it also corresponds to CPU register size, however it too can be different.