Last month, I wrote a little about using a toy problem to explore Python’s async/await
concurrency model. I’ve been meaning to learn Rust for ages, and this seemed a nice size to experiment with. I ported over that program to Rust—and the whole thing is up at Github for your enjoyment.
But I’ve run into an interesting problem: the version using the standard library’s ARC/Mutex concurrency is incredibly slow. A rewrite using the crossbeam
library’s atomic data types is much faster—despite that they should be using the same primitives. Not only is the ARC/Mutex version slow, but it gets much slower over time—as if something’s leaking badly. I welcome insight to what I could be missing. After all, this is my very first Rust program; perhaps I’m using ARC in a very improper way.
Let’s look at some examples of how the standard ARC/Mutex code differs from crossbeam
.
initialization
Initialization is pretty similar between them. The ARC/Mutex version imports some standard libraries, and sets up the struct that will be shared between the computer and inspector threads:
// ARC/Mutex version
use std::f64::consts::PI;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::{Duration, SystemTime};
const TARGET: u32 = 10;
const GUESS: u32 = 100;
struct Leibniz {
x: f64,
d: f64,
ticks: u32,
tocks: u32,
}
And the crossbeam
version does similarly—but can’t have a central struct:
// crossbeam version
extern crate crossbeam;
use crossbeam::{atomic::AtomicCell, thread};
use std::f64::consts::PI;
use std::thread as sthread;
use std::time::{Duration, SystemTime};
const TARGET: u32 = 10;
const GUESS: u32 = 2500000;
computer
The ARC/Mutex computer is as simple as you could ask for:
// ARC/Mutex version
fn computer(state: Arc<Mutex<Leibniz>>) {
loop {
let mut s = state.lock().unwrap();
for _ in 1..s.ticks {
s.x += 1.0 / s.d;
s.d += 2.0;
s.x -= 1.0 / s.d;
s.d += 2.0;
}
s.tocks += 1;
}
}
while the crossbeam
computer has to mess around with load
s and store
s more:
// crossbeam version
fn computer(
xr: &AtomicCell<f64>,
dr: &AtomicCell<f64>,
ticks: &AtomicCell<u32>,
tocks: &AtomicCell<u32>,
) {
loop {
let mut x = xr.load();
let mut d = dr.load();
let ticks = ticks.load();
for _ in 1..ticks {
x += 1.0 / d;
d += 2.0;
x -= 1.0 / d;
d += 2.0;
}
tocks.fetch_add(1);
xr.store(x);
dr.store(d);
}
}
inspector
The same is true for the inspector. The ARC/Mutex version is clean to write:
// ARC/Mutex version
fn inspector(state: Arc<Mutex<Leibniz>>) {
let mut old_d = 1.0;
let now = SystemTime::now();
loop {
thread::sleep(Duration::from_secs(1));
let mut s = state.lock().unwrap();
if s.tocks <= TARGET {
s.ticks /= 2;
} else if s.tocks > TARGET {
s.ticks = s.ticks + s.ticks / 10;
}
println!("{:?} {} {} {} {}", now.elapsed().unwrap(), s.ticks, s.tocks, s.d - old_d, PI - 4.0 * s.x);
old_d = s.d;
s.tocks = 0;
}
}
while the crossbeam
version has a third of its lines on load
s or store
s:
// crossbeam version
fn inspector(
xr: &AtomicCell<f64>,
dr: &AtomicCell<f64>,
ticksr: &AtomicCell<u32>,
tocksr: &AtomicCell<u32>,
) {
let mut old_d = 1.0;
let mut now = SystemTime::now();
loop {
sthread::sleep(Duration::from_secs(1));
let x = xr.load();
let d = dr.load();
let tocks = tocksr.load();
let ticks = ticksr.load();
if tocks <= TARGET {
ticksr.store(ticks / 2);
} else if tocks > TARGET {
ticksr.store(ticks + ticks / 10);
}
println!("{:?} {} {} {} {}", now.elapsed().unwrap(), ticks, tocks, d - old_d, PI - 4.0 * x);
tocksr.store(0);
}
}
setup
The last bit of setup is different: ARC/Mutex requires us to clone the state ARC, and move one copy into a new thread:
// ARC/Mutex version
fn main() {
println!("ARC std version");
let state = Arc::new(Mutex::new(Leibniz {
x: 0.0,
d: 1.0,
ticks: GUESS,
tocks: 0,
}));
let state_i = state.clone();
thread::spawn(move || computer(state));
inspector(state_i);
}
while crossbeam
has us use scoped threads, borrowing each of the atomic values read-only (though of course they can be store
d):
// crossbeam version
fn main() {
println!("Atomic crossbeam version");
let x = AtomicCell::new(0.0);
let d = AtomicCell::new(1.0);
let ticks = AtomicCell::new(GUESS);
let tocks = AtomicCell::new(0);
thread::scope(|s| {
s.spawn(|_| {
computer(&x, &d, &ticks, &tocks);
});
inspector(&x, &d, &ticks, &tocks);
})
.unwrap();
}
Having to thread separate AtomicCell
values through everywhere is a pain—and it’s not much relieved by putting them in a struct, because you still need extra load
and store
s everywhere you access them. I’m consistently worried that I’ll get the ordering between atomic values wrong and cause either a deadlock or a logic error—even though these are almost one-way channels (tocks
is incremented by computer
and reset to 0 by inspector
).
I’ve adjusted everything I can find in the ARC/Mutex version, including the initial and target constants. I am pretty sure I’m not supposed to be taking a reference to the ARC, but a clone. I’d love to find a set of profiling tools that would let me see what the ARC/Mutex version is spending its time on. I’d love to have advice from a competent Rustacean on how this code could or should look. Thanks!