Memory Index in Brainfuck Language - brainfuck

I am new to this language.
command < is used to decrement the pointer of memory index.
Below code is for adding two single digit number.
According to this program, it first takes input and then stores it at memory[0], as initially memory index is at position 0. It then decrements the memory index, making it now -1 so it should show runtime. Why it is running successfully in IDEone?
Does the memory blocks are in cycle?
, ;read character and store it in p1
------------------------------------------------ ;return ascii to Dec
< ;move pointer to p2 (second byte)
, ;read character and store it in p2
------------------------------------------------ ;return ascii to Dec
[ ; enter loop
- ; decrement p2
> ; move to p1
+ ; increment p1
< ; move to p2
] ; we exit the loop when the last cell is empty
> ;go back to p1
++++++++++++++++++++++++++++++++++++++++++++++++ ;return Dec to ascii
. ;print p1

That code clearly has a bug all < should be > and vice versa.
What happens if you try decrementing the pointer past 0 is not defined. Some interpreters crash, some wrap around.
It's best to always assume the interpreter will crash. There is a simple reason for this. Some implementations do not bound the tape to 30000 cells and will keep adding memory as soon as it it requested. Therefore there is no "end" to the tape, so going down from 0 cannot wrap around to the end (because there isn't one)

Well, answer to your question basis fully on theory.
First of all, you should know, that brainfuck is designed to be turing-complete.
Turing completness of brainfuck assumes that language by itself can be used to simulate any single-taped Turing machine
Definition of turng machine assumes infinite tape.
Infinite memory is impossible to get. In order to pretend that the tape is infinite you should use as much memory as possible.
If your memory is full, you should throw an exception, because once memory is finite, brainfuck makes no sense at all.

It really depends on the interpreter. The original language consists of an array of 30,000 memory cells. Many choose to loop around to the last cell if decrementing at cell zero to avoid throwing an exception.

Related

Why does instruction cache alignment improve performance in set associative cache implementations?

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.
I understand the concept of cache hits and their importance in computing speed.
But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.
So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.
Is there a cost in switching cache lines?
Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?
Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.
Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)
Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.
See Modern Microprocessors
A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.
Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).
...
.top_of_loop:
insn1 ; at address 16*n + 12
; 16-byte boundary here
insn2 ; at address 16*n + 0
insn3 ; at address 16*n + 4
b .top_of_loop ; at address 16*n + 8
... after loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.
But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.
...
nop ; at address 16*n + 12 ; NOP padding for alignment
.top_of_loop: ; 16-byte boundary here
insn1 ; at address 16*n + 0
insn2 ; at address 16*n + 4
insn3 ; at address 16*n + 8
b .top_of_loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
... after loop ; at address 16*n + 4
With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.
In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.
This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.
In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.
Footnote 1:
The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).
But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.
https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...
Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.
Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

Sequence detector independent of cycle

How do you code a FSM that can detect 1010, but can stay '1' or '0' for multiple cycles. Typical FSMs detect 1010 patterns for consecutive clock cycles. Is it possible to use the same /similar FSM to detect 1010 patterns even though '1' can stay '1' for two cycles and '0'can stay '0' for two cycles ...
While, in principle, you can use a very similar sequence detecting FSM for detecting sequences in which each symbol stays on the line for two or more cycles rather than one, this only works if the symbol is on the line for a fixed period. For example, if either a 1 or 0 is consider a sequence character if it stays on the line for two cycles, then either you can clock your FSM using a divided clock so its half the frequency or expanding your sequence to 11001100 would solve your problem (for detecting 1010). This works for any sequence in which you define a value of 1 or 0 over a fixed number of cycles, ie all values will be held for a known period of time.
If, however, you want to detect the sequence with values defined over a variable number of cycles, this is not really possible. To say you have found the sequence 1010 is to say the line became one, then zero, then one, then zero. If the definition of the line becoming one is that it is one for one or more cycles, then it becomes impossible to determine if seeing 1 in the first cycle and then 1 in the second cycle is a single one or two ones. For example, take the short sequence below:
11001100110000
In the fixed case, with 1 symbol every clock (which with you are familiar), the sequence is just how it reads: 11001100110000. In the case I described above with a fixed period of 1 symbol every two clocks, the sequence is now 1010100. However, if we say that a symbol for the sequence can be a variable number of clocks, that sequence above can be resolved to any number of sequences:
101010
1101010,
1001010,
1011010,
1010010,
1010110,
1010100,
10101000,
101010000,
11001010,
11011010,
11010010,
...,
11001100110000.
Without defining a what a symbol is, it becomes difficult to create an FSM as any input sequence can quick permute into a very large number of possible sequences. It thus becomes impossible to determine sender intent without more information. Either you need another signal to determine when to sample the line or some other what of distinguishing what the interpretation is suppose to be.
If you do want to make an FSM can determines if any of these permutations matches a given sequence, then you can do that but boiling your sequence down to the minimum requires. In our example sequence above (11001100110000), if we wanted to see if that sequence CAN be interpreted as your sequence (1010), we can find the needed elements to make a possibly sequence 1010 by seeing that you have to see 1 for some time (ie, see at least one 1), then 0 for some time (at least one 0), then 1 for some time (at least one 1), then 0 for some time (at least one 0). Note that any alternating line would thus match.
In the case of an non-alternating sequence like 11001, we could look at the line sequence (1100110011000) looking for at least 2 ones, then at least 2 zeroes, then at least 1 one. In the FSMs youre familiar with, seeing the pattern 11000 would result in a return arc back to start, ie the pattern 11001 was not seen. However, in the variable length case, the FSM would stay in the "seen 1100" state as 11000 could be interpreted as 1100.
Im not sure how useful this would be as the intended sequence is sent on the line cannot be determined due to the problem I mentioned, you cannot determine what constitutes a symbol.

Making knnsearch fast when one argument remains constant

I have the following problem.
for i=1:3000
[~,dist(i,1)]=knnsearch(C(selectedIndices,:),C);
end
Let me explain the code above. Matrix C is a huge matrix (300000 x 1984). C(selectedIndices,:) is a subset of 100 elements of C depending on the value of i. It means: For i=1, first 100 points of C are selected, for i==2, C(101:200,:) is selected. As you can see, the second argument remains constant.
Is there any way to make this work faster. I have tried the following:
- [~,dist(i,1)]=knnsearch(C,C); %obviously goes out of memory
send a bigger chunk of selectedIndices instead of sending just 100. This adds a little bit post-processing which I am not worried about. But this doesn't work since it takes equivalent amount of time. For example, if I send 100 points of C at a time, it takes 60 seconds. If I send 500, it takes 380 seconds with the post-processing.
Tried using parfor as: different sets of selectedIndices will be executed parallely. It doesn't work as two copies of big matrix C may have got created (not sure how parfor works), but I am sure that computer becomes very slow in turn negating the advantage of parfor.
Haven't tried yet: break both arguments into smaller chunks and send it in parfor. Do you think this will make any difference?
I am open to any suggestion i.e. if you feel braking a matrix in some different way may speed up the computation, do suggest it. Since, at the end I only care about finding closest point from a set of points (here each set has 100 points) for each point in C.

Does MATLAB execute basic array operations in constant space?

I am getting an out of memory error on this line of MATLAB code:
result = (A(1:xmax,1:ymax,1:zmax) .* B(2:xmax+1,2:ymax+1,2:zmax+1) +
A(2:xmax+1,2:ymax+1,2:zmax+1) .* B(1:xmax,1:ymax,1:zmax)) ./ C
where C is another array. This is on 32 bit MATLAB (I can't seem to get the 64 bit version at the moment, which would temporarily fix my problems).
The arrays result, A, B, and C are pre-initialized and never change size. It is then my guess that this computation is not being performed in constant space.
Is this correct? Is there a way to make it run or check if it is running in constant space?
These arrays of are approximate size (250, 250, 250).
If MATLAB does not run this in constant size, does anyone have any experience as to whether Octave or Julia or (insert similar language) does?
edit 1:
I eliminated excess arrays. There are 10 arrays that are 258 x 258 x 338, which corresponds to 1.67 GB. There are a bunch of other variables but they are much smaller. The calculation presented is simplified, the form of the calculation is:
R = (A(3Drange) .* B(3Drange) + A(new_3Drange) .* D(new_3Drange) + . . . ) ./ C
where the ranges generally just differ by a shift of plus or minus 1 or 2.
The output of memory command:
Maximum possible array: 669 MB (7.013e+08 bytes) *
Memory available for all arrays: 1541 MB (1.616e+09 bytes) **
Memory used by MATLAB: 2209 MB (2.316e+09 bytes)
Physical Memory (RAM): 8154 MB (8.550e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
Apparently I should be violating the second line. However, the code runs fine until the first operation that I actually do with the arrays. Perhaps MATLAB is being lazy and not allocating when I type:
A=zeros(xmax+2,ymax+2,zmax+2);
but still telling me in the workspace that the variable is allocated.
This code has worked before with smaller arrays. (edit: but it seems the actual memory size is the problem, not the size of each individual array).
The very curious thing to me is why it does not error during allocation, but instead errors during the first calculation.
edit 2:
I have confirmed that the loop is not running constant in space. There is about a .8 GB of memory being allocated during the calculation. Here is an image of resource usage while the command is being executed in a loop:
However, I tried breaking up the computation into multiple lines. I split the computation at each addition and added on each part in a new command, treating R as a accumulator. The result is that less memory is allocated at one time, but presumably more often. Here is the picture:
I am still curious as to why MATLAB doesn't want to execute this in constant space. I think it perhaps has something to do with the indexing being shifted - I am planning on investigating it more later and then putting this all together in an answer, but someone may beat me to it, which would be great also. Now, though, I can run the array size I was looking for and can finish my project.
I guess that most of the question has already been answered:
Does it operate in constant space?
No as you verified, it does not.
Why doesn't it operate in constant space?
Matlab claims to be fast at vectorized matrix operations, not so much emphasis is placed on memory efficiency.
What to do now?
Here are different options, the first one is preferred if possible, the other two are certainly possible.
Make it fit, for example by upgrading to 64 bit matlab or by not putting other stuf in your workspace
Work on parts of the matrix, so for example cut it in half
Dont use vectorization at all but make a simple for loop
If you don't vectorize, you will have a minimal space solution.

What's the biggest number in a computer?

Just asked by my 5 year old kid: what is the biggest number in the computer?
We are not talking about max number for a specific data types, but the biggest number that a computer can represent.
Infinity is not allowed.
UPDATE my kid always wants to print as
well, so lets say the computer needs
to print this number and the kid to
know that its a big number. Of course,
in practice we won't print because
theres not enough trees.
This question is actually a very interesting one which mathematicians have devoted a fair bit of thought to. You can read about it in this article, which is a fascinating and accessible read.
Briefly, a guy named Tibor Rado set out to find some really big, but still well-defined, numbers by defining a sequence called the Busy Beaver numbers. He defined BB(n) to be the largest number of steps any Turing Machine could take before halting, given an input of n symbols. Note that this sequence is by its very nature not computable, so the numbers themselves, while well-defined, are very difficult to pin down. Here are the first few:
BB(1) = 1
BB(2) = 6
BB(3) = 21
BB(4) = 107
... wait for it ...
BB(5) >= 8,690,333,381,690,951
No one is sure how big exactly BB(5) is, but it is finite. And no one has any idea how big BB(6) and above are. But at least these numbers are completely well-defined mathematically, unlike "the largest number any human has ever thought of, plus one." ;)
So how about this:
The biggest number a computer can represent is the most instructions a program small enough to fit in its available memory can perform before halting.
Squared.
No, wait, cubed. No, raised to the power of itself!
Dammit!
Bits are not numbers. You, as a programmer, give them the meaning you want, possibly numbers.
Now, I decide that 1 represents "the biggest number ever thought by a human plus one".
Errr this is a five year old?
How about something along the lines of: "I'd love to tell you but the number is so big and would take so long to say, I'd die before I finished telling you".
// wait to see
for(;;)
{
printf("9");
}
roughly 2^AVAILABLE_MEMORY_IN_BITS
EDIT: The above is for actually storing a number and treats all media (RAM, HD, cloud etc.) as memory. Subtracting the OS footprint (measured in KB) doesn't make "roughly" less accurate...
If you want to "represent" a number in a meaningful way, then you probably want to go with what the CPU provides: unsigned 32 bit integers (roughly 4 Gigs) or unsigned 64 bit integers for most computers your kid will come into contact with.
NOTE for talking to 5-year-olds: Often, they just want a factoid. Give him a really big and very accurate number (lots of digits), like 4'294'967'295. Then, once the glazing leaves his eyes, try to see how far you can get with explaining how computers represent numbers.
EDIT #2: I once read this article: Who Can Name the Bigger Number that should provide a whole lot of interesting information for your kid. Obviously he's not your normal five-year-old. So this might get you started in a cool direction about numbers and computation.
The answer to life (and this kids question): 42
That depends on the datatype you use to represent it. The computer only stores bits (0/1). We, as developers, give the bits meaning. (65 can be a number or the letter A).
For example, I can define my datatype as 1^N where N is unsigned and represented by an array of bits of arbitrary size. The next person can come up with 10^N which would be ten times larger than my biggest number.
Sure, there would be gaps but if you don't need them, that doesn't matter.
Therefore, the question is meaningless since it doesn't have context.
Well I had the same question earlier this day, so thought why not to make a little c++ codes to see where the computer gonna stop ...
But my laptop wasn't with me in class so I used another, well the number was to big but it never ends, i'll run it again for a night then i'll share the number
you can try the code is stupid
#include <stdlib.h>
#include <stdio.h>
int main() {
int i = 0;
for (i = 0; i <= i; i++) {
printf("%i\n", i);
i++;
}
}
And let it run till it stops ^^
The size will obviously be limited by the total size of hard drives you manage to put into your PC. After all, you can store a number in a text file occupying all disk space.
You can have 4x2Tb drives even in a simple box so around 8Tb available. if you store as binary, then the biggest number is 2 pow 64000000000000.
If your hard drive is 1 TB (8'000'000'000'000 bits), and you would print the number that fits on it on paper as hex digits (nobody would do that, but let's assume), that's 2,000,000,000,000 hex digits.
Each page would contain 4000 hex digits (40 x 100 digits). That's 500,000,000 pages.
Now stack the pages on top of each other (let's say each page is 0.004 inches / 0.1 mm thick), then the stack would be as 5 km (about 3 miles) tall.
I'll try to give a practical answer.
Common Lisp number crunching is particularly powerful. It has something called "bignums" which are integers that can be arbitrarily large, limited by the amount of available.
See: http://en.wikibooks.org/wiki/Common_Lisp/Advanced_topics/Numbers#Fixnums_and_Bignums
Don't know much about theory, but I far as I understood from your question, is: what is the largest number that the computer can represent (and I add: in a reasonable time, and not printing "9" until the Earth will "be eaten by the Sun"). And I put my PC to make one simple calculation (in PHP or whatever language): echo pow(2,1023) - resulting: 8.9884656743116E+307. So I guess this is the largest number that my PC can calculate. On the other side, I think the respresentation of the largest negative number can be: -0,(0)1
LE: That computed value was obataind through PHP, but I tried to figure out what's the largest number that my windows calculator can compute, and it is pow(2, 33219) = 8.2304951207588748764521361245002E+9999. Now I guess this is the largest number my PC can handle.
I think you should be very proud that your 5 year old is already asking questions like this.
And you should continue to promote that! This is truly amazing! With that said, I would say that saying Infinity does not
count is thinking incorrectly about what numbers mean in computer memory.
I feel like this way of thinking is a handicap.
Mathematicians will never be able to write out ALL the digits of pi or eulers number, BUT we FULLY understand it.
Pi, as an example, is perfectly represented by infinite this series: (Pi / 4) = 1 - 1/3 + 1/5 - 1/7 + 1/9 - …
Just because you literally can’t go to inf. or print every single digit in a console means nothing.
You could have printed the symbol representing pi and therefore capturing the inf. series.
Computer Algebra Systems (CAS) represent numbers symbolically all the time. Pi, for instance,
may be a Symbolic object in memory (the binary in memory did not DIRECTLY represent the number. It represents an "mathematical algorithm" for producing the answer to arbitrary precision).
Then you do some math with it, transforming from one expression to the next.
At no point in time did we not represent the number COMPLETELY.
At the end, you can do 2 things with this:
A) Evaluate the expression, turning it into a number of some kind (or Matrix or whatever).
BUT this number could very well be an approximation (say like 20 digits of pi).
B) Keep it in its symbolic form for reference. Obviously we don’t like staring at symbols because we
need to eventually turn the nobs on the apparatii.
NOTE: sometimes you can get a finite (non-irrational) number perfectly represented in memory (like number 1)
by taking limits or going to inf. Not literally having an inf. number in memory, but symbolically representing it.
Just throw this in Wolfram alpha: Lim[Exp[-x], x --> Inf]; It gives you the number 0. Which is EXACT.
In short:
It was the HUMANS need to have some binary in memory that DIRECTLY represented the number that caused
the number to degrade. Symbolically it was perfectly represented. You could design some algorithm that
just continues to calculate the next digits of pi or eulers number giving you an arbitrary amount of precision (Now, this is obviously not practical of course).
I hope this was at least somewhat useful or interesting to you, even if you disagree =)
Depends on how much the computer can handle. Although there are some times when the computer can handle numbers greater than (2^(bits-1)-1)... For example:
My computer is 64 bit (9223372036854775807), however the calculator that comes with the computer itself can handle numbers of up to 10^9999.
Many other supercomputers can exceed these limits, and the one with the most memory (bits) might as well be the one with the record (current largest number that can be held by computers).
Or, if it comes to visually seeing it on computers, you can just make a program that, on monitor, repeats writing 9 and not skips that line to form an ever-growing bunch of 9. :P
go on chrome then go on three dots above and click them then go on tools and then go on developer tool click on console and type Number.MAX_VALUE