I've got an implementation of a stock exchange that uses the LMAX disruptor pattern in C++ https://github.com/sneilan/stock-exchange
And a basic implementation of the LMAX disruptor as a couple C++ files https://github.com/sneilan/lmax-disruptor-tutorial
I've been looking to rebuild this in rust however. I reached the point where I implemented my own websocket protocol, authentication system, SSL etc. Then I realized that memory management and dependencies are a lot easier in rust. Especially for a one man software project.
I briefly looked over your stock exchange code:
- For memory management, consider switching to std::shared_ptr. It won't slow anything down and will put that concern to rest entirely.
- For sockets, there are FOSS libraries that will outperform your code and save you a ton of headaches dealing with caveats and annoyances. For example, your looping through FD_ISSET is slower than e.g. epoll or kqueue.
- For dependencies, C++ is definitely wilder than other languages. Dependencies are even harder to find than they are to manage. There's a lot of prospective library code, some of it hidden in little forgotten folds of the Internet. Finding it is basically a skill unto itself, one that can pay off handsomely.
I did not know std::shared_ptr would not slow things down. I've learned something new today! :)
Yes, I agree, epoll is a lot better than FD_ISSET.
Maybe I can keep moving with my C++ code but do people still trust C++ projects anymore? My ideal use case is a hobbyist who wants a toy stock exchange to run directly in AWS. I felt that C++ has a lot of bad publicity and if I want anyone to trust/try my code I would have to rebuild it in rust.
Here's how to maximize shared_ptr performance:
- In function signatures, use const references: foo(const std::shared_ptr<bar> &p). This will prevent unnecessary bumps of the refcount.
- If you have an inner loop copying a lot of pointers around, you can dereference the shared_ptr's to raw pointers. This is 100% safe provided that the shared_ptr continues to exist in the meantime. I would consider this an optimization and an edge case, though.
I would say people trust C++ projects at least as much as any other professional language - more so if you prove that you know what you're doing.
This advice doesn't seem quite right to me, and in my codebases I strictly forbid passing shared_ptr by const reference. If you don't need to share ownership of bar, then you do the following:
If you do need to share ownership of bar, then you do the following: Why do we pass by value when sharing ownership? Because it allows for move semantics, so that you give the caller to option to make a copy, which bumps up the reference count, or to entirely avoid any copy whatsoever, which allows transfering ownership without bumping the reference count.Having said that, shared_ptrs do have their uses but they are very very rare and almost all of our use cases do not expose shared_ptr's in the public API but rather use them as an implementation detail. We use them almost exclusively for things like immutable data structures, or copy-on-write semantics, or as a part of a lock-free data structure.
foo(const bar&) is ideal if you precisely wish to bar ownership. If (and in many kinds of projects, invariably it's more like when) you later decide to share ownership, or if nullptr is an option, then it's no good.
foo(std::shared_ptr<bar>) is copy-constructed as part of your function call (bumping the refcount) unless copy elision is both available and allowed. It's only ideal if you almost always pass newly instantiated objects.
Pass by const reference is the sweet spot. If you absolutely must minimize the refcount bumps, overload by const reference and by rvalue.
As for shared_ptrs being very rare, uh, no. We use them by the truckload. To each their own!
foo(const bar&) is ideal if you precisely wish to bar ownership.
What?
invariably it's more like when) you later decide to share ownership,
shared_ptr shouldn't even be necessary for keeping track of single threaded scope based ownership.
As for shared_ptrs being very rare, uh, no. We use them by the truckload. To each their own!
You might want to look into that, you shouldn't need to count references in single threaded scope based ownership. If you need something to last longer, make it's ownership higher scoped.
If something already works it works, but this is not necessary and is avoiding understanding the actual scope of variables.
The context you're missing is in your post's GP. That poster holds a std::shared_ptr<bar> (for whatever perfectly valid reason) and wishes to pass it to foo(). However, he declares it as foo(const bar &) because the callee does not need to share in the ownership of the shared_ptr. That means it gets called as foo(*p.get()).
That's the incorrect assumption that you came to. Obviously if bar is only used for stack-based scope variables, no shared_ptr is needed.
The context you're missing is in your post's GP.
I didn't miss any of that, that exactly what I thought it meant. I just don't know what you mean by precisely wish to bar ownership
That's the incorrect assumption that you came to.
Prove it. In a single threaded program with scope based ownership, that shared_ptr is going to be freed somewhere, so why not just have it exist in that scope as a unique_ptr so the ownership scope is clear?
Obviously if bar is only used for stack-based scope variables, no shared_ptr is needed.
Are you saying I'm wrong then saying the exact thing I just said?
If foo() doesn't need to share ownership now but may need to later, declaring it as foo(const std::shared_ptr<bar> &) instead of foo(const bar &) allows this change without revising any prototypes. However, if we precisely wish to prohibit shared ownership by foo(), we can do so by declaring it as foo(const bar &).
The incorrect assumption that you came to is that we were talking about stack variables. But anyways, here's an example that's both scope-based and single threaded:
with an algorithm that selectively adds our pointer to the vector one or more times, and another that duplicates and removes elements according to some ongoing criteria. The object gets deleted when no pointer instances are left in the vector.In practice most code is multithreaded (not that it matters) and most shared_ptrs are held inside other objects (not that it makes a difference either.)
I'm saying you misunderstood and now I clarified again. I'm at the troll-detection threshold, so this is my last clarification. Take care!
If foo() doesn't need to share ownership now but may need to later,
This doesn't make sense. Why would a function in a more narrow scope need to take or share ownership? The variable passed will persist before and after the function call.
The incorrect assumption that you came to is that we were talking about stack variables.
I don't know what this means here. I said scope, you are saying stack. If lifetime is being dictated purely by scope, you don't need shared_ptr, you can just use a unique_ptr at the correct scope.
But anyways, here's an example that's both scope-based and single threaded: std::vector<std::shared_ptr<bar> > v;
This isn't scope based lifetime, this is lifetime based on some other criteria than going out of scope. I will show you with what you wrote:
and another that duplicates and removes elements according to some ongoing criteria.
Not sure if this is what GP was getting at, but in games a shared pointer (or similar custom resource handle) to something like a texture can be very useful - you want to share resources between objects in the scene so that you don't load a thing from disk when it's already in memory, but don't want to hold onto it forever to save RAM.
Very true, and in between threads and in other areas too, but this isn't scope based ownership, it's completely separate from the scope hierarchy of the program.
Scope based ownership would be a unique_ptr that frees the heap memory when it goes out of scope (if it isn't moved).
Exactly!
There is at least one use case I can think of: the function may copy the shared_ptr, but you want to avoid touching the reference count for the (frequent) case where it doesn't. This is an edge case, though, and personally I almost never do it.
Additionally: if you care about nullability semantics within your function, then you write foo(const bar*) and pass in bar_ptr.get(), and of course check that the value is != nullptr before dereferencing it.
Otherwise, I'm inclined to agree -- don't pass around smart pointers unless you're actually expressing ownership semantics. Atomics aren't free, ref-counting isn't free, but sometimes that genuinely is the correct abstraction for what you want to do.
One more point: shared ownership should not be used as a replacement for carefully considering your ownership model.
(For readers who might not be as familiar with ownership in the context of memory management: ownership is the notion that an object's lifetime is constrained to a given context (e.g. a scope or a different object -- for instance, a web server would typically own its listening sockets and any of its modules), and using that to provide guarantees that an object will be live in subcontexts. Exclusive ownership (often, in the form of unique_ptr) tends to make those guarantees easier to reason about, as shared ownership requires that you consider every live owning context in order to reason about when an object is destroyed. Circular reference? Congrats, you've introduced a memory leak; better break the cycle with weak_ptr.)
Let me preface by noting that I don't necessarily disagree with any part of what you wrote. However, there are design patterns that exceed the guardrails you're thinking of, and those are the patterns that benefit the most from shared_ptr.
Typically, they involve fine- to medium-grained objects, particularly those that have dynamic state (meaning by-value copies are not an option.)
An example might be a FlightAware-like system where each plane has a dynamically-updated position:
Updater routinely calls UpdatePosition(), whereas View only calls const methods on Plane such as GetPosition(). There can be a View for, say, Delta flights and one for United. Let's simplify by assuming that planes are in the sky forever and don't get added or removed.Destructing Updater doesn't affect Views and vice-versa. Everything is automatically thread-safe as long as the Pos accesses inside each Plane are thread-safe.
The key here is that Plane is fine-grained enough and inconsequential enough for lazy ownership to be ideal.
If planes are around forever, wouldn't you be better off interning them? e.g. having a single global std::vector<Plane> (or std::array<Plane, N>) and passing around offsets in that array? And your PlaneVec would just be a glorified std::vector<size_t> (or int)? I don't see any value in maintaining a reference count if you're never intending to clean up these objects.
(The argument for using int here would be if you always have fewer than 2 billion planes, and so you can store a PlaneVec in less space. size_t is indistinguishable from Plane* in this context; you have the same amount of indirections either way.)
As I said, shared ownership has its uses, but most instances I've seen could have been replaced with a different model and would have been less painful to debug memory leaks and use-after-free.
Definitely not. :) I added that restriction just to sidestep the need to add and remove planes to/from existing Updaters and Views. Besides, you might have an Updater for US flights and one for European ones, and a View might span worldwide Delta flights or just US ones. Updaters and Views might come and go dynamically. The reference counting is key.
In this example, it doesn't matter when Planes get cleaned up, but it does matter that they do. A better alternative than the one you're proposing would be to just use vector<Plane *> and leak the planes, but that's crappy for different reasons (e.g. long-term memory usage and it would bar Plane from, say, RAII-owning its own log file.)
My experience is the opposite. It has to do with the coarseness of the objects involved and the amount of inter-object links. We typically have a vast variety of classes. Many of them have shared_ptr members, resulting in rich graphs.
Many methods capture the shared_ptr parameters by copying them inside other objects. However, many methods just want to call a couple methods on the passed-in object, without capturing it. By standardizing on const shared_ptr &, all calls are alike, and callees can change over time (e.g. from not capturing to capturing.)
What if the callee sometimes wants to get a reference count and sometimes doesn't? In the latter case, your proposed signature forces an unnecessary pair of atomic reference count operations. But if you use
instead, then foo can't acquire a reference even when it wants to.You could stick std::enable_shared_from_this` under `bar`. But `std::enable_shared_from_this` adds a machine word of memory, so you might not want to do that.
If you pass
you incur an extra pointer chase in the callee. Sure, you could write but then you burn an extra argument register. You can't win, can you?You can win actually. Just use https://www.boost.org/doc/libs/1_85_0/libs/smart_ptr/doc/htm... or your favorite intrusive reference-counted smart pointer, not `std::shared_ptr`. If you do, you get the same capabilities that `std::enable_shared_from_this` grants but without any of the downsides.
Actually this is usually not the case (assuming of course that caller is holding the original pointer in a shared_ptr<bar> which is the use case we were discussing.)
That shared_ptr<bar> instance is held either on the stack (with address FP + offset or SP + offset) or inside another object (typically 'this' + offset.) To call foo(const shared_ptr<bar> &), the compiler adds the base pointer and offset together, then passes the result of that addition - without dereferencing it.
So as it turns out, you may actually have one fewer pointer chase in the const shared_ptr<bar> & case. For example, if foo() is a virtual method and a specific implementation happens to ignore the parameter, neither the caller nor the callee ever dereference the pointer.
The one exception is if you've already resolved the underlying bar& for an unrelated reason in the caller.
I do agree that intrusive_ptr is nice (and we actually have one codebase that uses something very similar.) However shared_ptr has become the standard idiom, and boost can be a hard sell engineering-wise.
You're overthinking it. Think in cache lines. No matter what the instructions say, with all their fancy addressing modes, foo has to load two cache lines: one holding the shared_ptr and another holding the pointee data. If we instead passed bar* in a register, we'd need to grab only one cache line: the pointee's.
Sure. Maybe the caller already has a fully formed shared_ptr around somewhere but not in cache. Maybe foo often doesn't access the pointee. But how often does this situation arise?
The cache doesn't make a difference here. To clarify: we start with a shared_ptr<bar> instance. It must get dereferenced to be used. It must either be dereferenced by the caller (the const bar & contract) or by the callee (the const shared_ptr<bar> & contract).
If the caller dereferences it, it might turn out to be superfluous if the callee wasn't actually going to use it. In this case const shared_ptr<bar> & is more efficient.
However, if the caller happened to have already dereferenced it prior to the call, one dereferencing would be avoided. In this case const bar & is more efficient.
This is where our misunderstanding is. The caller starts out by only having a shared_ptr. Someone (caller or callee) has to dig the bar * out.
Sure. I've just only rarely encountered the situation you're describing. When it comes up, passing a shared_ptr by const reference is fine --- just so long as it's documented and the code explains "Look, I know this looks like a newbie error, but in context, it's actually an optimization. Here's why..." lest someone "optimize" away the deferred load win.
If you _maybe_ need to share ownership, the second is a little pessimistic - you always increase the ref count.
That is correct and I can see that being a justification for passing a const&, in fact the C++ Core Guidelines agree with you that such a scenario is the only acceptable reason for passing a shared_ptr by const&, although they encourage passing by value, or just passing a const T&.
Obviously the correct way is to accept a templated type and use perfect forwarding. /s /s /s /s /s
Bonus points for using static_cast<T &&>(p) instead of std::forward<T>(p) ;)
That's not true. It does slow things down because it has an atomic access. How slow depends on the platform.
unique_ptr does not slow things down.
std::shared_ptr definitely slows things down. It's non-intrusive therefore requires a memory indirection.
C++ might have a bad reputation, but in many fields the only alternative, in terms of ecosystem, tooling and tribal knowledge is C.
Between those two, I rather pick the "Typescript for C" one.
Reference counting definitely slows down tight loops if you are not careful.
The way to avoid that in low latency code is to break the abstraction and operate with the raw pointer in the few areas where this could be a bottleneck.
It is usually not a bottleneck if your code is decently exploiting ipc, an extra addition or subtraction easily gets executed while some other operation is waiting a cycle for some cpu resource.
C++ gets bad publicity only from evangelists of the flavour of the month of self-described "successor of C++". They don't have a sales pitch beyond "C++ bad" and that's what they try to milk.
And yet the world runs on C++.
When I did low latency everyone was offloading TCP to dedicated hardware.
They would shut down every single process on the server and bind the trading trading app to the CPUs during trading hours to ensure nothing interrupted.
Electrons travel slower than light so they would rent server space at the exchange so they had direct access to the exchange network and didn't have to transverse miles of cables to send their orders.
They would multicast their traffic and there were separate systems to receive the multicast, log packets, and write orders to to databases. There were redundant trading servers that would monitor the multicast traffic so that if they had to take over they would know all of the open positions and orders.
They did all of their testing against simulators - never against live data or even the exchange test systems. They had a petabyte of exchange data they could play back to verify their code worked and to see if tweaks to the algorithm yielding better or worse trading decisions over time.
A solid understanding of the underlying hardware was required, you would make sure network interfaces were arranged in a way they wouldn't cause contention on the PCI bus. You usually had separate interfaces for market data and orders.
All changes were done after exchange hours once trades had been submitted to the back office. The IT department was responsible for reimbursing traders for any losses caused by IT activity - there were shady traders who would look for IT problems and bank them up so they could blame a bad trade on them at some future time.
Any good books/resources you can recommend to learn about the above architectures/techniques?
Some years ago I wrote a gist about HFT/HPC systems patterns (versus OPs C++ patterns) applied to dockerized Redis. Might be dated, but touches on core isolation/pinning, numa/cgroups, kernel bypass, with some links to go deeper. Nowadays I do it with Kubernetes and Nomad facilities, but same basic ideas:
https://gist.github.com/neomantra/3c9b89887d19be6fa5708bf401...
Nice; reminds me of the Redhat Performance Tuning and Real Time Low Latency Optimization guides.
A few episodes of Signals and Threads, a podcast from Jane Street, go into parts of it.
Thank You.
You don't need to shut down processes on the server. All you have to do is isolate CPU cores and move your workloads onto those cores. That's been a common practice in low latency networking for decades.
I'm not in HFT, but I wouldn't expect that to be enough.
Not only do you want to isolate cores, you want to isolate any shared cache between cores. You do not want your critical data ejected from the cache because a different core sharing the cache has decided it needs that cache. Which of course starts with knowing exactly what CPU you are using since different ones have different cache layouts.
You also don't want those other cores using up precious main memory or IO bandwidth at the moment you need it.
Just to add to your good points: since there's always a faster cache for your working set to not fit in, you can use memory streaming instructions to reduce cache pollution. Depending on the algorithm, increasing cache hit rates can give ridiculous speed-ups.
Correct. I was just pointing out to OP that moving processes is not worthwhile and isolation is how you'd do it
I’ve worked at a few firms and never heard of an IT budget for f-ups. Sounds like a toxic work environment.
Depends on how it's set up. You take a chunk of profits as well if things go well.
Same. That sounds like a way to make that relationship between front office and back office as toxic and unproductive as possible.
A great insightful comment, thank you!
fun fact: the original LMAX was designed for and written in Java.
https://martinfowler.com/articles/lmax.html
I think it made sense at the time. From what I understand, you can make Java run as fast as C++ if you're careful with it and use JIT. However, I have never tried such a thing and my source is hearsay from friends who have worked in financial institutions. Then you get added benefit of the Java ecosystem.
From my hearsay, you absolutely can, given two things: fewer pointer-chasing data structures, and, most crucially, fewer or no allocations. Pre-allocate arrays of things you need, run ring buffers on them if you have to use a varying number of things.
A fun but practical approach which I again heard (second-hand) to be used, is just drowning your code in physical RAM, and switch the GC completely off. Have enough RAM to run a trading day, then reboot. The cost is trivial, compared to spending engineering hours on different approaches.
I worked in trading and we did the first one, in C++. We'd load all the instruments (stocks etc.) on startup to preallocate the "universe", and use ring buffers as queues. Instruments don't change during trading hours so restarting daily to pick up the new data is enough.
I saw a Java team do the second one in an order router (a system that connects to various exchanges and routes+translates orders for each exchange's requirements), and they wrote an interesting retrospective doc where they basically said it wasn't worth it - it caused a lot of trouble without giving a significant edge in performance. YMMV! That was around 2012.
I honestly don't know why the real time trading devs don't make their own OS/programming language for this. It's not like they don't have the money.
That is basically what they do when deploying into FPGAs.
Making your own OS or language is hard, if you care about both performance and correctness.
But HFT people do a lot of OS-level hacking, squeezing the last bits of performance from the kernel where the kernel is needed, and/or running parts of the kernel stack (like networking) in userspace, avoiding context switches. CPU core pinning, offloading of everything possible to the network cards, etc, goes without saying.
So literally remove as much allocation as possible and then reboot each day. That makes a lot of sense to me!
Quite fitting in a thread about HFT that has already referenced game development as a parallel universe of similar techniques.
In the feature phone era, mobile phone games were written in Java (well, a subset: Java EE). Practically all games followed the same memory management strategy. They allocated the memory they needed during the early initialisation, and then never again during the actual runtime of the game. That was the only way to retain even a semblance of performance.
All the java libs that you use can never do an allocation -- ever!. So you don't really get that much benefit to the java ecosystem (other than IDE's). You have to audit the code you use to make sure allocations never happen during the critical path.
Fifteen years ago, the USN's DDX software program learned this the hard way when they needed a hard real time requirement in the milliseconds.
In my experience: Allocation is OK, but garbage collection is bad.
I think back then GC defaulted running potentially at allocation.
shared_ptr is a much better solution for garbage collection. One I wish that java had implemented.
I think CPython does reference counting for its memory management, it still has to run a GC since reference counting does not handle reference cycles.
You say it's easier in Rust, but you still have a complete C++ implementation and not a Rust one. :)
It took me about a year to build all of this stuff in C++. So I imagine since I've had to learn rust, it will probably take me the same amount of time if I can save time with dependencies.
In my experience, once you know the problem really well, yes you're right.
If you are building a complex prototype from scratch, you'll usually spend more time fighting the Rust compiler than trying out alternate design decisions.
I do know the problem domain well. My second iteration in rust already almost has an order book implemented.
Writing code in rust however is very fun! (At least so far lol)
Writing Rust is fun in the sense that solving a puzzle is fun.
Great when that's what you set out to do (rewrite it in Rust^TM) can get annoying otherwise (solving/researching a problem).
Better than C++? Have you experienced any problems in Rust that C++ instantly solved? Not because I'm pro/anti rust/C++ here, just curious.
Linus said he wouldn't start Linux if Unix was ready at that time.
Minix....
Don't know why so many downvotes, probably because I confused Unix with "GNU kernel".
Here's the source (Jan 29, 1992):
https://groups.google.com/g/comp.os.minix/c/wlhw16QWltI/m/P8...It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue. Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper. In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.
I did not realize this. Thank you so much for pointing this out. I'm going to take a look.
Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html I'll check out the paper.
I sincerely doubt the big HFT firms use anything of Fowler’s. Their optimizations are down to making their own hardware. LL is very context dependent and Amdahl’s law applies here.
I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:
1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.
2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].
[1] https://preshing.com/20130618/atomic-vs-non-atomic-operation...
[2] https://en.wikipedia.org/wiki/Memory_ordering
[3] https://en.cppreference.com/w/cpp/atomic/memory_order
Fowler's implementation is written in Java which has a different memory model from C++. To see another example of Java memory model vs a different language, Jon Gjengset ports ConcurrentHashMap to Rust
I'll have to take a look at the code. Maybe I can integrate it into https://github.com/rburkholder/trade-frame
The LMAX disruptor is a great data structure when you have threads bound to cores and most/all of them are uncontended. It has some terrible pathologies at the tails if you aren't using this pattern. Threads getting descheduled at bad times can really hurt.
SPSC ring buffers are going to be hard to beat for the system you are thinking of, and you can likely also implement work stealing using good old locks if you need it.
Instead of this:
you can do this. i.e. you can replace a division / modulo with a bitwise AND, saving a bit of computation. This requires that the size of the ringbuffer is a power of two.What's more, you get to use sequence numbers over the full range of e.g. uint64_t. Wraparound is automatic. You can easily subtract two sequence numbers, this will work without a problem even accounting for wraparound. And you won't have to deal with stupid problems like having to leave one empty slot in the buffer because you would otherwise not be able to discern a full buffer from an empty one.
Naturally, you'll still want to be careful that the window of "live" sequence numbers never exceeds the size of your ringbuffer "window".