Atomically setting a variable without comparing first - atomic

I've been reading up on and experimenting with atomic memory access for synchronization, mainly for educational purposes. Specifically, I'm looking at Mac OS X's OSAtomic* family of functions. Here's what I don't understand: Why is there no way to atomically set a variable instead of modifying it (adding, incrementing, etc.)? OSAtomicCompareAndSwap* is as close as it gets -- but only the swap is atomic, not the whole function itself. This leads to code such as the following not working:
const int N = 100000;
void* threadFunc(void *data) {
int *num = (int *)data;
// Wait for main thread to start us so all spawned threads start
// at the same time.
while (0 == num) { }
for (int i = 0; i < N; ++i) {
OSAtomicCompareAndSwapInt(*num, *num+1, num);
}
}
// called from main thread
void test() {
int num = 0;
pthread_t threads[5];
for (int i = 0; i < 5; ++i) {
pthread_create(&threads[i], NULL, threadFunc, &num);
}
num = 1;
for (int i = 0; i < 5; ++i) {
pthread_join(threads[i], NULL);
}
printf("final value: %d\n", num);
}
When run, this example would ideally produce 500,001 as the final value. However, it doesn't; even when the comparison in OSAtomicCompareAndSwapInt in thread X succeeds, another thread Y can come in set the variable first before X has a chance to change it.
I am aware that in this trivial example I could (and should!) simply use OSAtomicAdd32, in which case the code works. But, what if, for example, I wanted to set a pointer atomically so it points to a new object that another thread can then work with?
I've looked at other APIs, and they seem to be missing this feature as well, which leads me to believe that there is a good reason for it and my confusion is just based on lack of knowledge. If somebody could enlighten me, I'd appreciate it.

I think that you have to check the OSAtomicCompareAndSwapInt result to guarantee that the int was actually set.

Related

How to write a local branch predictor?

I am trying to use runspec test my local branch predictor, but only find a disappointing result.
By now I have tried use a 64 terms LHT, and when the LHT is full, I use FIFO tactics replace a terms in LHT.I don't know if I use a tiny LHT or my improper replacement tactics makes it a terrible precision, anyway it's only 60.9095.
for (int i = 0; i < 1 << HL; i++)
{
if (tag_lht[i] == (addr&(1-(1<<HL))))
{
addr = addr ^ LHT[i].getVal();
goto here;
break;
}
}
index_lht = index_lht%(1<<HL);
tag_lht[index_lht] = (addr&(1-(1<<HL)));
LHT[index_lht] = ShiftReg<2>();
addr = addr ^ LHT[index_lht].getVal();
index_lht++;
here:
for (int i = 0; i < 1 << L; i++)
{
if (tag[i] == (addr))
{
return bhist[i].isTaken();
}
}
index = index % (1 << L);
tag[index] = (addr);
bhist[index].reset();
return bhist[index++].isTaken();
Here I make some explain about the code. bhist is a table store 2-bit status about each branch instructions when the table is full, use FIFO replacement tactics. tag is where the table store address of each instruction. Besides, likely I use tag_lht to store address of each instruction that stored in LHT. Function isTaken() can easily get the predict result.
Thank you all guys, I find that stupid mistake I make, and the code above is correct, but may not seem work prefect. The mistake bellow:
for (int i = 0; i < (1 << L); i++)
{
if (tag[i] == (addr))
{
if (takenActually)
{
LHT[j].shiftIn(1);
bhist[i].increase();
}
else
{
LHT[j].shiftIn(0);
bhist[i].decrease();
}
}
break;
}
But it should be like this:
for (int i = 0; i < (1 << L); i++)
{
if (tag[i] == (addr))
{
if (takenActually)
{
LHT[j].shiftIn(1);
bhist[i].increase();
}
else
{
LHT[j].shiftIn(0);
bhist[i].decrease();
}
break;
}
}
I am so stupid that I waste you helpful people' s time, I spent so much time to figure out why it don't work, at first I thought that wrong variable or argument are used, now I just think I am a careless man.
Again I thank all you ardent fellows. Then I will answer the question with my full code.
PS. wish that my terrible English have not confuse anyone.:)

Bad address error when comparing Strings within BPF

I have an example program I am running here to see if the substring matches the string and then print them out. So far, I am having trouble running the program due to a bad address. I am wondering if there is a way to fix this problem? I have attached the entire code but my problem is mostly related to isSubstring.
#include <uapi/linux/bpf.h>
#define ARRAYSIZE 64
struct data_t {
char buf[ARRAYSIZE];
};
BPF_ARRAY(lookupTable, struct data_t, ARRAYSIZE);
//char name[20];
//find substring in a string
static bool isSubstring(struct data_t stringVal)
{
char substring[] = "New York";
int M = sizeof(substring);
int N = sizeof(stringVal.buf) - 1;
/* A loop to slide pat[] one by one */
for (int i = 0; i <= N - M; i++) {
int j;
/* For current index i, check for
pattern match */
for (j = 0; j < M; j++)
if (stringVal.buf[i + j] != substring[j])
break;
if (j == M)
return true;
}
return false;
}
int Test(void *ctx)
{
#pragma clang loop unroll(full)
for (int i = 0; i < ARRAYSIZE; i++) {
int k = i;
struct data_t *line = lookupTable.lookup(&k);
if (line) {
// bpf_trace_printk("%s\n", key->buf);
if (isSubstring(*line)) {
bpf_trace_printk("%s\n", line->buf);
}
}
}
return 0;
}
My python code here:
import ctypes
from bcc import BPF
b = BPF(src_file="hello.c")
lookupTable = b["lookupTable"]
#add hello.csv to the lookupTable array
f = open("hello.csv","r")
contents = f.readlines()
for i in range(0,len(contents)):
string = contents[i].encode('utf-8')
print(len(string))
lookupTable[ctypes.c_int(i)] = ctypes.create_string_buffer(string, len(string))
f.close()
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="Test")
b.trace_print()
Edit: Forgot to add the error: It's really long and can be found here: https://pastebin.com/a7E9L230
I think the most interesting part of the error is near the bottom where it mentions:
The sequence of 8193 jumps is too complex.
And a little bit farther down mentions: Bad Address.
The verifier checks all branches in your program. Each time it sees a jump instruction, it pushes the new branch to its “stack of branches to check”. This stack has a limit (BPF_COMPLEXITY_LIMIT_JMP_SEQ, currently 8192) that you are hitting, as the verifier tells you. “Bad Address” is just the translation of kernel's errno value which is set to -EFAULT in that case.
Not sure how to fix it though, you could try:
With smaller strings, or
On a 5.3+ kernel (which supports bounded loops): without unrolling the loop with clang (I don't know if it would help).

How do I DumpLive results of a long running process?

I've tried Observable.Create
waits to finish before showing any results.
Possibly because the example I'm trying to follow is a changing live value, not a changing live collection.
and
ObservableCollection<FileAnalysisResult> fileAnalysisResults = new ObservableCollection<FileAnalysisResult>();
I can't seem to apply because .DumpLive() isn't applicable to an ObservableCollection.
Short answer: use LINQPad's DumpContainer:
var dc = new DumpContainer().Dump();
for (int i = 0; i < 100; i++)
{
dc.Content = i;
Thread.Sleep(100);
}
Long answer: DumpContainer writes to LINQPad's standard HTML results window, so you can see the value change in place while the main thread is blocked, whereas calling DumpLive on an IObservable uses a WPF control to render the updates, so the main thread must remain unblocked in order to see updates as they occur.
It's also possible to dump a WPF or Windows Forms control and update it in place:
var txt = new TextBox().Dump();
for (int i = 0; i < 100; i++)
{
txt.Text = i.ToString();
await Task.Delay(100);
}
Just as with DumpLive, you must be careful not to block the main thread. If you replaced await Task.Delay with Thread.Sleep, you'd block the UI thread and nothing would appear until the end.

Bounded Buffers (Producer Consumer)

In the shared buffer memory problem , why is it that we can have at most (n-1) items in the buffer at the same time.
Where 'n' is the buffer's size .
Thanks!
In an OS development class in college, I had an adjunct teacher that claimed it was impossible to have a software-only solution that could use all N elements in the buffer.
I proved him wrong with something I decided to call the race track solution (inspired by the fact that I like to run track).
On a race track, you are not limited to a 400 meter race; a race can consist of more than one lap. What happens if two runners are neck and neck
in a race? How do you know whether they are tied, or whether one runner has lapped the other? The answer is simple: in a race, we don't monitor a runner's position
on the track; we monitor the distance each runner has traversed. Thus, when two runners are neck and neck, we can disambiguafy between a tie and when one runner has
lapped the other.
So, our algorithm has an N-element array, and manages a 2N race. We don't restart the producer/consumer's counter back to zero until they finish their respective 2N race.
We don't allow the producer to be more than one lap ahead of the consumer, and we don't allow the consumer to be ahead of the producer.
Actually, we only have to monitor the distance between the producer and consumer.
The code is as follows:
Item track[LAP];
int consIdx = 0;
int prodIdx = 0;
void consumer()
{ while(true)
{ int diff = abs(prodIdx - consIdx);
if(0 < diff) //If the consumer isn't tied
{ track[consIdx%LAP] = null;
consIdx = (consIdx + 1) % (2*LAP);
}
}
}
void producer()
{ while(true)
{ int diff = (prodIdx - consIdx);
if(diff < LAP) //If prod hasn't lapped cons
{ track[prodIdx%LAP] = Item(); //Advance on the 1-lap track.
prodIdx = (prodIdx + 1) % (2*LAP);//Advance in the 2-lap race.
}
}
}
It's been a while since I originally solved the problem, so this is according to my best recollection. Hopefully I didn't overlook any bugs.
Hope this helps!
Oops, here's a bug fix:
Item track[LAP];
int consIdx = 0;
int prodIdx = 0;
void consumer()
{ while(true)
{ int diff = prodIdx - consIdx; //When prodIdx wraps to 0 before consIdx,
diff = 0<=diff? diff: diff + (2*LAP); //think in 3 Laps until consIdx wraps to 0.
if(0 < diff) //If the consumer isn't tied
{ track[consIdx%LAP] = null;
consIdx = (consIdx + 1) % (2*LAP);
}
}
}
void producer()
{ while(true)
{ int diff = prodIdx - consIdx;
diff = 0<=diff? diff: diff + (2*LAP);
if(diff < LAP) //If prod hasn't lapped cons
{ track[prodIdx%LAP] = Item(); //Advance on the 1-lap track.
prodIdx = (prodIdx + 1) % (2*LAP);//Advance in the 2-lap race.
}
}
}
Well, theoretically a bounded buffer can hold elements upto its size. But what you are saying could be related to certain implementation quirks like a clean way of figuring out when the buffer is empty/full. This question -> Empty element in array-based bounded buffer deals with a similar thing. See if it helps.
However you can of course have implementations that have all n slots filled up. That's how the bounded buffer problem is defined anyway.

Why threads give different number on my program using ThreadPool?

why Number has different value?
Thx
class Program
{
static DateTime dt1;
static DateTime dt2;
static Int64 number = 0;
public static void Main()
{
dt1 = DateTime.Now;
for (int i = 0; i < 10; i++)
{
ThreadPool.QueueUserWorkItem(new WaitCallback(WorkThread), DateTime.Now);
}
dt2 = DateTime.Now;
Console.WriteLine("***");
Console.ReadLine();
}
public static void WorkThread(object queuedAt)
{
number = 0;
for (Int64 i = 0; i < 2000000; i++)
{
number += i;
}
Console.WriteLine("number is:{0} and time:{1}",number,DateTime.Now - dt1);
}
}
number is being shared between all of your threads, and you're not doing anything to synchronize access to it from each thread. So one thread might not have even started it's i loop (it may or may not have reset number to 0 at this point), while another can be half way through, and another might have finished it's loop completely and be at the Console.WriteLine part.
Here you have 10 threads acting on the static variable number at indeterminate times. One thread could on its 10000 iteration while another could just be beginning execution. And your routine begins by resetting number to 0. This logic would produce interesting results but nothing predictable.
If multiple threads access the same variable all at once, there is a risk of race conditions. A race condition is basically when the operations of the two threads are interwoven such that they interfere with eachother. To add a value to "number", the old value must be read, the sum computed, and the new value set. If those steps are being done by many threads at the same time, the value-setting can overwrite work done by previous threads, and the final result can change. You must use a lock (also called a critical section, mutex, or monitor) to protect the variable so this can't happen.