Lock-free data structures in Zig - Lukas Pietzschmann

Lock-free data structures in Zig - Lukas Pietzschmann01:01:17

Download information and video details for Lock-free data structures in Zig - Lukas Pietzschmann

Uploader:

Zigtoberfest

Published at:

10/30/2025

Views:

2.7K

Video Transcription

Speaker 2

All right.

Hi from me too.

Good to see you all.

Maybe again on the highest university of Germany.

I'm working at Ulm University and we also have like unofficially the highest tram station of Germany.

But I think that's just because nobody else has bothered to check if their tram station is actually higher than ours.

So there's an unofficial like A4 paper hanging at the station that just says their highest tram station in Germany.

um yeah thanks for the introduction david um i'll be talking a bit about synchronization um we already talked about concurrency and async code and now we will look at how to synchronize access to shared data structures or shared variables

shared segments of memory.

If you want to follow along I also uploaded the slides already.

Maybe a short anecdote on that.

At last year's Zigtoberfest I also gave a talk and for that I registered zigtoberfest.fun for a single year and of course I forgot to cancel it and now I've got zigtoberfest.fun for another year.

So, um, so that I don't have just wasted my money, please go to zigzobothest.fun and there you will find these slides and can follow along if you want to.

Great.

Um, so far so good.

For a little motivation, um, who of you has already cooked with, uh, with someone else together?

Yeah, that's most of you.

And I've also done that quite a few times.

Me and a friend group of mine, we do this regularly.

And I hope they don't watch the stream or watch the YouTube video because I actually hate it.

It's always like a too small of a space and there's only one stove or only one pan and then four or five people have to work on the same pan and on the same stove and it always gets confusing.

And then one person is cutting the tomato wrong and another one is doing whatever.

So I don't enjoy this as much.

I do like cooking in general, but I try to keep it for myself.

Sure.

I said you need to coordinate access to the shared resource, which is in our case the pan or the stove.

And this is obviously the same with programming.

There we don't have pans or stoves, but we do have shared memory or shared resources that need to be coordinated between threads so there can be safe access for them.

and this is what we will look at today um so just to get started a little programming example um this is a yeah uncomplete stack implementation where let me get that pointer um where we have a backing data structure which which i've just chosen arraylist for and i implemented the push operation and traditionally this would just be itself but backing.append and then the value

but in order to see what's going on um i wrote down three lines more which probably is what's what append is implemented with so we will start by getting the length of our current data structure so all items and we modified the last three slot in our data structure and then we increment the lengths by one and with that our push operation succeeded

So if we now look at how two different threads could access that, I have a blue and a green pointer for that.

The first one gets the length, then gets the new value, and then just gets escheduled for whatever reason.

Now the second thread appears, also gets the length and now gets the value too.

But since our first thread, the blue one, didn't update the length already, the green thread now just overrides what the blue thread did in the first place.

And then green just continues updating the length.

And after that blue continues also updating the length.

But in the end, what blue did just didn't have any effect because as I said, green overrode it.

This obviously is bad.

And so we need a way to notify blue and say, Hey, please stop.

There's another threat currently accessing this resource.

You need to continue later.

And that's exactly what locks are for.

For this example, I chose the possibly easiest log out there, the mutex.

And with just three added lines, so the declaration up here and then a log and an unlock, this unlock would typically be implemented within a defer block, but to make it

more visible when it appears or when it's evaluated or run.

I put it explicitly at the last statement.

So now again, we have our blue and our green thread.

The blue starts by acquiring this log.

So blue now has the permission to continue.

gets the length, sets the item, and now gets descheduled again and green appears.

Also tries to acquire the lock, but since blue is already in the critical segment of our code, this is where bad stuff can happen.

um green cannot acquire the lock again and also gets descheduled and blue can continue updating the length and then as blue is now outside of the critical segment it can unlock the lock again which leads to green being able to continue and running through our lines just regularly

And now green did not override what blue said initially.

And we pushed two elements to the stack, and we can all be happy.

So this is what locks for.

It solved our initial problem.

But as I said, we used the most simple one, the mutex.

There are obviously different solutions for more complex synchronization problems or locking problems.

Just to point out a few, there are, for example, semaphores, with which a thread can also acquire multiple locks, or the semaphore can be locked multiple times, if that makes sense.

There are non-blocking locks, which is often called trilock, which does not block the thread, but just returns false in the case that it can't be acquired.

There are also read-write locks to differentiate between read operations and write operations.

where multiple read operations can occur concurrently, but if a write operation occurs, then also read operations have to be blocked.

There are also reentrant logs.

I think in Zig they are called, or the mutex variant that is reentrant is called recursive mutex, because a thread can then acquire the log multiple times, which would not work with the standard mutex implementation.

There are also phases or barriers and so many more stuff.

So as I said, with different problems, you have to choose the right tools.

And what we want to look at today, if logs are even the right tool at all.

So there are different solutions.

And this is what we will investigate because logs do have hidden costs.

It's not just, yeah, I throw a mutex on there and everything's fine.

That might be often the case, but it does not need to be the case all of the time.

So there are hidden challenges or hidden costs associated with that.

For example, there can be bottlenecks.

So if multiple or if we have like many threads trying to acquire the same lock, then all but typically one thread needs to get descheduled and only one can do work.

And this just takes sometimes a long, much time before then all other threads can proceed.

There are more failure modes, which we will get into detail in, I think, a few animations.

And it's easy to just think that we solved the problem by adding a lock.

But in practice, as I said, there might be implications which we are then not thinking about anymore.

Because it does look simple, doesn't it?

Just putting on lock and then unlock and then you're finished.

Yeah, that doesn't need to be the case.

So flashbang, I said we will look at a few additional failure modes that locks can introduce in our code base.

There can be problems with contention.

So contention is, as I said, when many threads do want to acquire the same lock and this will just won't work.

And then all but one thread needs to be descheduled.

um and this just introduces performance degradation as just one party can continue with the work and all others have to wait um we can also capture this in a more theoretical way so you might know from school or university or wherever you have learned computer science stuff that

this seems backwards, that we can like calculate the speed improvement by introducing concurrency in our program.

And there's Amdahl's law for that.

I'll just quickly sketch it out.

You cannot read this, can you?

But is this the fault of group chalk?

Maybe if I try something with more contrast like red.

Now that's also probably not readable, whatever.

And I'm not using the board.

So in Andal's law, it's roughly like 1 over 1 minus the time that can be parallelized and so on.

And since the time or the portion of the code that cannot be parallelized is like in the bottom, if it gets bigger, then Andal's law leads to a smaller value, so a lesser value.

speedup, if that makes sense.

So if more of our code has to wait until another part of our code is finished, then we obviously gain less speedup from our concurrency.

And that's bad.

If we do implement concurrency, then we want to gain as much as possible from it.

Then there's starvation.

That's kind of similar.

I just skip over that for time reasons.

Priority inversion is pretty interesting because that's often unnoted.

This is when we have, for example, three threads.

One has a low priority, the other one has a medium priority and the third one has a high priority.

The thread with the lowest priority acquires the lock then gets descheduled for whatever reason.

Next the thread that has the high priority does want to acquire the lock but cannot do so as the lower priority thread has already done so and gets also descheduled because it has to wait.

And now the medium thread gets run and preempts the lower priority thread as it has a higher priority.

And H, or the high priority thread, has to wait indefinitely because L just doesn't get scheduled anymore because it has a lower priority than the medium thread.

That's priority inversion.

And as I said, that often gets unnoticed because it's hidden from us.

Lastly, composability.

Composability is when a function that does acquire a log or it's rarely a good idea to call that function from within a critical section.

um and this again leads to just more coarse grained logs so we are incentivized to lock a larger critical section which in turn just makes all those problems even worse and long story short summary are logs good maybe maybe not we will look at an alternative this is lock free programming so locking without locks synchronizing without locks

We already looked at the first part.

The rest of my talk will be more of a theoretical perspective or how this whole concept of log-free programming works.

Then we will look at how this can be implemented in SICK.

And lastly, if it is a good solution.

So for internationals here, this is the German Eierlegende Wollmichsau, which

is here a jack of all trades.

And we will look at if lock-free programming is also a of programming.

OK, so how can we get rid of locks?

The problem is our critical section.

First, this needs to be guarded.

This needs to be locked.

So now if our critical section could be so small, so tiny that there's just no possibility for any other threat to get in there while we are executing this section, then we would be fine.

Then there would be no need for locking.

as there's no possibility for others to get in there.

And this is, in theory, atomic operations just to provide this attribute because we say they are invisible.

So they do work successfully or they don't work at all.

But you cannot observe any intermediate result of atomic operations.

This is what makes them invisible.

atomic for example there are atomic loads or atomic stores stuff like that and the atomic operation that's most important for us is called compare and swap or in short cas i don't know if anyone ever has said that or it's c8 chs cas this is the workhorse of lock free programming and

i have noted down like it looks like vic but isn't uh completely using technical correct um but it serves the purpose so we always or i use the pointer here we always get a pointer to any data we get an expected value and we get a new value

And our goal is to set the pointer to the new value and then we return if it succeeded or not.

Often, and that's also the case with Zig, there is no bool that is returned, but the old value itself.

And if it didn't succeed, it's just null.

But writing bool was a bit simpler.

So the first step of our compare and swap operation is checking if the value of our pointer currently is the expected value.

If that's not the case, then somebody has modified something and we aren't at the state we expect it to be at.

So we just abort and say false.

We have to try later.

But if it is the expected value, then we can just set the pointer to our new value and return true at the end.

So we can say like, look at this memory address.

If it still contains the value I expect, then and only then, and this is the important part, update it to the new value.

This is like when I need a spontaneous meeting with my colleagues, I just pop my head into their room and I'm like, do you have time currently?

And if they say no, then we are kind of here and I abort and say, OK, then I try later.

And if they say yes, then I have a spontaneous meeting with them and I return true and I'm happier.

So this is the working horse, the compare and swap operation.

um and while this does look like a lot of code um we will see later that it is indeed an atomic operation and that it results in just one line um great so now we can compose this single operation to a whole loop so this is the comparing swap loop

This is what powers most of the log-free data structures.

And this matches the little meeting story I just told.

So this just indefinitely checks if our compare and swap operation succeeded.

If yes, then fine, we can exit the loop.

If no, then we update our temporary values and then retry us.

Um, we will see this later also in the example, I put it up.

Great.

And we call this optimistic concurrency because as I said, the first try and assume that we can set a new value, um, which is pretty optimistic and only if it fails, you then try again and are again optimistic.

Um, and hope that we do not fail again.

Yeah, this often gets wrapped into atomic variables.

So we had atomic operations, atomic variables, just wrap this comparing swab loop so that you have, for example, an atomic integer where you then can set new values or add new values, load values atomically, so on and so forth.

Good.

The cool thing about this log-free design is that we can reason about different properties of it.

And there are three important properties that we want to look at a bit in detail.

Those are three levels of freeness.

They are called and they describe what guarantees our concurrent data structure has when we

access them so obstruction freedom is the lowest guarantee this just guarantees that if there are no other threats or if all other threats are currently suspended and not scheduled then our current threat can access the data structure and can proceed with its operation

So I tried to model this as a little diagram.

So data is our shared resource.

Red is a we, the current thread we're looking at.

And other thread is just the rest of the system or all other threads in our program.

So we first try to access data, but since all other threads or any other threads already are accessing our data, we fail and just return a failure as indicated by the dashed line.

But then all other threats are being suspended or aren't scheduled anymore.

We try to access this shared resource again, and now we succeed because there's just nobody else that blocks it.

So this just guarantees progression in isolation and is the weakest guarantee.

Next, there's lock-freeness.

And why this whole paradigm is called lock-free programming doesn't mean that any lock-free algorithm is also lock-free on this slide, which is why lock-free algorithms are also often just called non-blocking algorithms.

But that doesn't sound so exciting.

Non-blocking sounds kind of bad.

So I also call them lock-free algorithm.

And when I'm speaking about lock-freeness, I typically mean the whole paradigm and not only this attribute.

Just so we are speaking of the same thing.

Yeah, log-freeness is like one step further.

Now it's guaranteed that at least one thread always can make progress.

So even if there are other threads working on our shared data structures, log-freeness guarantees that at least one of our whole systems makes progress.

So it is fine that all our threads are just idling or in the loop forever, but at least our thread on the top does make progress.

That is the log-free attribute.

And this is typically guaranteed by our compare and swap loop we saw earlier.

So whenever we have a construct like this, this is typically lock-free.

And we can reason about that because if this compare and swap operation fails, what does that mean?

That means that another thread has successfully changed the state, so another thread's compare and swap operation was successful.

There's always at least one threat making progress.

I hope that was understandable.

So then the last guarantee, which is the biggest, greatest guarantee is weight-freeness.

Weight-freeness guarantees that in a finite sequence of steps or number of steps, every threat is guaranteed to make progress.

So this wasn't the case with log-freeness.

There we can still have a live log where we just are looping over and over.

Weight-freeness does provide that guarantee.

So our current thread and all other threads are eventually going to make progress.

And that's not always trivial to achieve.

So there are certain algorithms or certain data structures that let individual threads, for example, reserve a certain spot in memory, which they then can access instantaneously, which would lead to weight-freeness.

now the chalkboard would come in handy again but i try to explain it just with words also if you think about a queue where we have a head and a tail pointer just using this compare and swap algorithm

could lead to the the tail pointer um pointing to the or could lead to a new element being added but the tail pointer still points to the element above it because there are two operations that need to succeed the addition is one comparing swap and the update of the tail pointer is one comparing swap and if this update

did not yet succeed for whatever reason, then another thread currently accessing our queue could help the first thread and set the tail pointer to the new value.

And then the original thread who did the push operation on the queue, its compare and swap would be successful in the next try.

And this would also lead to weight-freeness if different threads help each other to progress.

But as I said, that's not trivial for all data structures.

We have to explicitly take care of that and explicitly think about that, while log-freeness is way easier to achieve with just this typical compare-and-swap loop.

And those three types of guarantees is what makes lock-free programming so cool because with locking algorithms, you got none of that.

They can do whatever they want and we are, or it's pretty hard to reason about the progression guarantees with typical locks.

Yeah, they don't even provide obstruction freedom, at least by default, which lock-free algorithms do.

Good.

So far, so good.

That's it with log-free coding.

So now hopefully have a at least foundational understanding about what is going on.

The gist of it is we have atomic operations like this compare and swap.

Their critical section is practically non-existent, so no other threat.

can observe intermediate values or can go in between operations.

They succeed or don't succeed at all.

And this is what makes synchronization possible without the need of locks.

Good.

How can we now implement this in Zig?

This is probably what you all have been waiting for.

And it's pretty easy to be honest.

Because those atomic operations are just a thing that our modern CPUs support.

There are different flavors of those atomic operations.

They are called test and set sometimes, or a compare and swap, fetch and increment, a compare and swap I already said.

And they all result in slightly different instructions.

those is the uh those are from the x86 instruction set you notice that they are all preceded by log and then the um the respective locking up or the respective respective atomic operations um we can also look at this quickly in the compiler explorer where i

where I provided a short example.

I just used compare and swap here, which is abbreviated with CMPXCHG for change.

We want to swap a 32-bit integer.

We want to check this pointer.

If it is currently still 0, then we want to change it to 1.

And if we look to the right here, then there is our compare and swap operation in the instruction set, which again is the log operation followed by compare and swap.

If we would, for example, have a load or store, then there would still be this log instruction followed by the load and store.

And as I said, this is executed atomically, so no other thread can intervene.

So lock-free programming is enabled by the hardware itself.

We don't have to build software support for that.

We can just translate functions or instructions directly to those lock and compare and swap operations.

And zig just provides built-ins for that.

So remember the compare and swap operation I implemented in voidal zig earlier?

This is just this built-in.

It takes a whole lot of parameters or arguments.

So the type of the pointer we want to override, then the pointer itself, the expected value, the new value, and some stuff we will look at later.

And you also notice that it again comes in two flavors.

This is weak and strong.

I think that if you program on modern x86 hardware, there is no difference between them.

But with compare and swap weak, the swapping operation can be successful, but the atomic operation

or it is possible that the atomic operation still fails despite really being successful.

I think this is only the case on older ARM instruction sets, so don't worry about that.

I typically use the weak flavor

Firstly, as I said, because it doesn't matter.

And we also, with this compare and swap loop, we natively implement the retry that the strong version would provide us.

So I just opt with the simpler one.

But as I said, don't worry about it if you're programming for modern hardware.

This shouldn't be too different.

Good.

So in our Z code, we can just use one of those built-ins and be happy as they would be directly translated to what we saw earlier in the compiler explorer.

Then from this built-in, we can now build atomic variables again, just like we did earlier in the OIDO SIG version.

SIG already provides std.atomic, which contains an atomic value, which for example has, first of all, an

init function where you can set the initial value and then fetch and add would be getting the current value and adding the first argument here.

So we would just increment it.

Those two lines would exactly translate to just this single built-in atomic read and write with the operation here and the value here.

But working with a real type and a function looking thing is way nicer than always writing out this or this.

um right and we again see there's another additional parameter just like we had here earlier this is like the same type and we will get to know it in a few slides

Good.

So now the first real working Zig example.

I implemented a stack in a log-free way.

This is like the classical example you also teach students.

It's called the Treiber stack.

The original paper called it that way.

And for now I implemented the push operation.

um so first we start by allocating a new node initializing the new node with the provided value and its next pointer is for now at least null and then we start by our comparing swap loop we already saw earlier um we use the atomic load operation to access our current head so

Again, this has to be atomic so nobody else can intercept our load.

We then set the next pointer of our new head to the current head.

And then we try to exchange the current top of our data structure with our new head.

But we only do so if the current top is still the top that we recorded here.

So this makes sure that in this line or in between this and that line nobody else modified our current stack.

And if that would be the case, then this check would be false and we would just rerun the loop again, getting old head, setting the pointer again to the now updated head and trying to exchange the pointer again.

So now a question for you.

I said the top needs to be loaded atomically and we also use an atomic instruction here.

But why doesn't need or why is next not atomic?

So shouldn't this also here be an atomic value that we only modify atomically?

Any idea on why I implemented it how I did?

Yeah.

Oh, I didn't quite understand that.

Yeah, exactly.

So Next isn't shared currently.

Next just exists on this new memory space we just allocated, but nobody besides we currently can access this memory.

So Next does not need to be synchronized in any way.

We can just access it directly.

While, of course, Top can be accessed from other threads, which is why we need to

interact with it in an atomic way good catch um so um just for completeness i also implemented pop which looks pretty much the same we also have our comparing swap loop now directly here we atomically load our uh our current head we uh temporarily store the

the next so the second element in our stack and then we try to swap our current top with the next element we just recorded here but only if it didn't change so only if it still has the old value

If this is true, then we exit the loop down here after deallocating the old head.

But if it's false, then we just rerun the old loop again and try another time.

Remember, optimistic concurrency.

Good.

All right.

So now let's look at why we need those release and acquire.

And you also had acquire and release and acquire here.

So why do we need those two atomic order arguments?

I also provide or prepared a little intro example for that.

So imagine this is executed by thread one.

This is executed by thread two, while both can access the data variable.

I know, again, not completely syntactically correct, but it's true.

While for us, this looks, or at least at first glance, this looks completely fine and completely normal.

We set data, then set data ready.

So now this loop can exit and use the data.

This can lead to severe problems because no, it doesn't look right.

The compiler as well as the CPU itself can and will probably just reorder those statements.

Because while we clearly see a dependency between this data and this data ready, the compiler and the CPU probably won't see this dependency as it is not explicitly encoded in our program.

So this use data could probably move before this while loop, who, I don't know, for example, mask the access time or the load time that data needs or whatever.

And then our program will already be faulty.

And this is what we obviously want to prevent with lock-free programming.

as you can imagine if we go back and look at the zig code if i don't know the load and the compare and swap instructions are interchanged or moved around then it just doesn't make sense anymore we are loading and comparing values that aren't current anymore and we need to prevent this and therefore we do have those two additional parameters

um so there are different modes that we can order instructions so those parameters prevent certain reorderings and for different cases we want to allow reorderings and for others we do not so there are different ordering properties or modes that we can specify

I am going into detail on three, but there are, I think, two more.

So there is unordered, which doesn't really provide any guarantees.

I think the Java virtual machine uses that for its shared variables.

There's also monotonic, which just makes sure that operations on the same memory address are always in a predefined order.

But it doesn't guarantee accesses to different memory addresses.

But we will look at acquire and release and acquire-release.

And now I already spoiled everything.

But those are the three main ones or most important ones.

And those are also the three that

power the lock-free stack that I have.

while I am a bit proud of this visualization because I think it looks pretty nice it doesn't really fulfill the purpose of explaining everything very good which I unfortunately only noticed today while driving here to Munich so there wasn't really time in redoing everything so I'll again try to

to explain it briefly and we will also again look at the example but just to make it short so acquire is typically used when we want to access variables so every time we load or we read memory locations we use the acquire this can be thought of like getting a lock where we enter a section that we

You want to make sure that everything that happens before it, we already captured in our read.

And we cannot move anything after the read.

Release is the exact opposite.

It's mostly used on store or write operations and can be thought of like an unlocking of a lock where everything after should be kept after it and not moved up in our critical section.

This is why the barrier is here and like this can't pass it downwards and here we cannot pass it.

upwards.

Acquire and release are also combinable where we just introduce two barriers where we can't move up and down.

This is for example useful for this atomic integer thing I showed earlier where we had this fetch add.

operation as this does a load and a ride both in one atomic operation it would also acquire and release at the same time so therefore this acquire release guarantee is there.

There's another one that provides even more guarantees.

This would be called sequential consistency, which works pretty much like acquire-release.

So there are also two barriers which we can't cross in both directions.

But sequential consistency would also make sure that

atomic operations always occur in the same order.

So even if we would have multiple operations in here, they would always have a deterministic order to them.

You typically use the sequential consistency when you don't have any idea what you should do as this just makes sure that everything works like intended, but obviously comes with a performance hit as the CPU can not optimize your stuff better.

um so if you then quickly go back to one of the zig examples maybe the push um so we had two atomic operations here there was the load and the compare and swap

load uses acquire because i said load operation operations typically use acquire because we want to make sure that our compare and swap here always uses the most current top that's loaded there

Where is this slide again?

Here, with acquire, any memory operation down here cannot move up through the barrier.

In this way, we always load the most current one.

So compare and swap could not move before the load and access a outdated variable.

And then we have compare and swap itself.

We there use release and acquire.

So release would be used in a successful case.

So if we successfully compare and swap the value, we set a release barrier.

If we unsuccessfully compare and swap, so we have to try again, we again use acquire.

This is because we don't want this load to be moved

before the unsuccessful case so that we in the next iteration again have the most current version of the loaded value up here.

And the release in the successful condition is used because we also want to signal to other threads

that now we are finished the top pointer did um did get updated successfully so other threats then do have access to the most current value again as they cannot pass this area upwards this is roughly how those uh those three guarantees work but as i said um

It's pretty complex in practice.

And if you're unsure, just row sequential consistency on there and it, it will work just fine.

Good.

Um, until now we only have talked about like the upsides of log3 programming.

We don't have problems with contention, for example.

We don't have dead logs.

We can have live logs.

But if we are careful with our implementation, then we can implement it in a log-free way, which then gets rid of live logs.

So I just mentioned good things about log-freeness.

But indeed, nothing is perfect.

And if log-free programming was perfect, then why would we have logs anyway?

A pretty common example or a common problem that we can encounter with lock-free programming is the ABA problem.

And to make it less technical and have a little fun example, I generated three images of a person that's waiting in front of a red light.

So this is the first bait that the person observes.

The traffic light is currently red, so it's blocked.

It cannot progress.

All right.

Then, I don't know, some time passes.

The light is red for a long time.

The person falls asleep.

But while it falls asleep, while it is descheduled, the light now is turning green, but without the person in the car noticing it.

and then after they wake up again the light again turns red which they notice again but for the person in the car this date and this date is not um distinguishable at this ah you got it from each other they do not know that the light did turn green in the meantime

So if you now try to translate this to our stack example, we can think of like a thread wanting to pop the top element.

Maybe I go back to the code.

here so thread a wants to pop the top element it executes this function but now for whatever reason before we try to execute compare and change for the first time the thread a gets descheduled because red b arrives and thread b

um pops i i would have loved to use the the board for that um red b pops two values and so they pop let me try something uh oops

okay i try to to explain it with text so we have a and b and c now uh thread a has stored the so a is top and thread a stored the top so that it can access a dot next which is v later to then set the top pointer to b again

Good.

Now thread B arrives and hop A hops B while thread A is still suspended and not doing anything and then thread B hops our A again.

And this does not have to be logically the same again.

The only thing that's relevant for our compare and swap operation is that this pointer has the same value.

And if you, I don't know, for example, use any allocator that reuses memory, it can be possible that another allocation for another node will eventually be in the same spot that

that our thread A remembered.

So now we have the problem that thread B is finished.

Our spec looks now like this.

And thread A tries to continue.

So thread A remembered this address.

This address, for whatever reason, now matches the, or we can also call this A prime, matches the address of our current top.

so for red a it looks like or for the compare and change operation it looks like the whoops

It looks like this situation looks like for our person.

They observe the red light, but they do not know that something changed while they were suspended.

And I think in the stack example, this isn't such a huge problem, but this can have semantic issues because as the state changes,

and the thread doesn't notice it it will act like the state didn't change which can easily lead to corruption especially if we are dealing with pointers so keep this aba problem in mind um there's also not a cookbook recipe to avoid this ab aba problem so you have to like evaluate and think on a per case by case basis how to avoid that um but three pretty common

ways to avoid it are for example using tagged pointers so on or i i would guess most systems are eight bit aligned this would mean that there are at least three bit that are always zero in a pointer value which we can for example use to

I have slides for that.

So this is the middle part.

We can use those three bits to, for example, store a version number.

And then every time any compare and swap operation succeeds, we just update this version number and compare it in the thread ace compare and swap operation.

And then even if the address didn't change,

the tag or version number in the pointer did change, which would be noticed by the compare and swap operation and it wouldn't succeed.

This is also possible without pointer tagging.

So there exists a double compare and swap operation.

But I don't think that's the case for x86.

I think it's only in the ARM instruction set.

But there you can explicitly pass two pointers, which are now our original pointer, the expected value and the actual value.

and now a version pointer, an expected version and an actual version, which does those two comparing swaps atomically.

But as I said, I don't think that this operation is present in x86, which is why pointer tailing exists.

Another solution might be hazard pointers, but as I'm already five minutes over time, I'm going to skip that.

Great.

So now that you also have learned about a few downsides of lock-free programming, so mainly this ABA problem, we can finally think about if lock-free programming really is the .

synchronization.

So to remember when are logs good enough?

Logs obviously are pretty simple.

So when performance is not that critical or there are only a few threads accessing a shared resource for a few times, just throw a mutex on there and it's fine.

It's probably easier to understand for such simple cases than using a compare and change loop.

Also, I said in the beginning that there are not only mutexes and that kind of stuff, but there also exist barriers and phases which are not only used for concurrent access, but which can also coordinate threats in general.

So if you want to make sure that, I don't know, 10 threads like stop at one coordinated point and only if every thread access or rise at this point, then they can continue together.

This would be a prime example for using logs.

So I'm unsure if the Zig standard library implements such haste logs, but those aren't too hard to implement by yourself.

This wouldn't really be possible with log-free programming.

But on the other hand, when is lock-free programming actually good?

So it typically has better performance, especially under oversubscription of threats.

So if there are way more threats than there are CPU cores, the scheduler has to block some threats.

And if it blocks threats that currently have a lock,

or acquired ALock.

This just makes things take even longer since those threads cannot release ALock and have to be rescheduled again to release it.

This is where lock-free programming really shines.

It also is not susceptible to this priority inversion which is pretty tough sometimes to notice and fix and also not susceptible to deadlocks or

now live looks maybe but can be easily avoided and especially we don't have any any dumb delays because of descheduling as we do not have to wait we can just try again and that might be enough too then

succeed great um that's uh not quite it tldr logs have real often hidden problems non-blocking algorithms basically just use atomic operations mainly there's comparing swap but also atomic load and store really important things there comparing swap inside a loop is the main building block of different data structures

um and while they do have nice properties like those freedom guarantees we saw earlier they do also have problems what doesn't have problems so if i would put one statement under this talk it would be don't be afraid to use logs but also do not limit yourself to them

There are real use cases out there where such high performance is required and low latency especially is required that also in practice sometimes log-free programming is used.

I know of at least one banking software provider that implemented their own insanely huge log-free implementations of a shared queue which is

pretty funny because it's written in Java.

Who the fuck uses Java to implement performance critical applications?

But it's really cool to see this in practice and to see how much throughput you can achieve with this kind of implementation.

So thanks for attending.

Thanks for listening.

I hope that the technical parts didn't scare you off and you at least got the gist of it.

And now you're free to ask questions.

Speaker 6

Yes.

So, thank you, Lukas.

This time, as it were, we will use a microphone for the questions.

Speaker 3

Okay, thank you.

Speaker 2

Up there, so you have to walk where there is...

Speaker 3

Yeah, so thank you for the great talk.

It was really, really great seeing you explain such hard topics in such a simple way.

When I think back 20 years ago, when I played with spin locks and all that stuff, I think I read that when you use the locked instructions on x86, that they basically pause the cache line synchronization between the cores.

And so when you heavily or aggressively spin in your while loops, you basically, don't you produce side effects on other cores?

Speaker 2

That's pretty technical.

I'm not totally sure or I haven't thought about that.

But you said that cache lines do get discarded with the log instruction.

Did I get that right?

Speaker 1

I don't.

Speaker 6

I don't.

Speaker 2

Yeah, that could be the case, but I'm not a hundred percent sure about that.

Did you all get the question?

I'm not sure.

Okay.

Otherwise I could have repeated it.

Speaker 6

More questions.

Speaker 4

Hey, thank you.

That was really nice.

I want to ask you a question about your slide where you were discussing acquire semantics with the compare and swap.

The one with, yeah, so the one with the code.

So you mentioned that the load should have, yeah, this one, thank you.

This one, the load should have acquire semantics so that the compare exchange cannot be reordered above the load.

Yes.

But doesn't the compare exchange have a data dependency on old head preventing that from happening?

Speaker 2

Yeah, that's also true.

Speaker 4

But so then my question then is, does it matter to use acquire?

Can you then downgrade to a weaker one?

Speaker 2

Well, we would need to test that, but by just looking at it, yeah, it could be also fine to use monotonic.

Okay.

And to gain, I don't know, one cycle of performance improvement.

Yes.

Speaker 4

Yeah.

Okay, thank you.

That was helpful.

Speaker 6

Maybe one more question before we go into the break.

Speaker 2

I think if you speak loud enough, it might also be picked up by my mic.

Speaker 5

Is it possible to do reallocation in a wait-free way?

Because when I was playing around with wait-free or lock-free or whatever data structures, I was unable to do it.

Speaker 2

Just let me quickly pull up the slide with the different freedoms.

I don't find it.

So the question was if it's possible to implement allocations in a wait way, a reallocation.

I also have not thought about that.

So you, you said that you once experimented with that and, um, yeah, I, I need to pass on that.

I'm sorry.

Right.

I also have a question.

Why the fuck isn't this working?

Did you have lectures in this hall?

Am I just using the thought wrong?

I mean, you can't see really anything when I'm writing there.

Mind you...

I had like 10 examples that I would have used the board for.

Well, you know what you have to check next year.