Last month, I wrote a little about using a toy problem to explore Python’s async/await
concurrency model. I’ve been meaning to learn Rust for ages, and this seemed a nice size to experiment with. I ported over that program to Rust—and the whole thing is up at Github for your enjoyment.
But I’ve run into an interesting problem: the version using the standard library’s ARC/Mutex concurrency is incredibly slow. A rewrite using the crossbeam
library’s atomic data types is much faster—despite that they should be using the same primitives. Not only is the ARC/Mutex version slow, but it gets much slower over time—as if something’s leaking badly. I welcome insight to what I could be missing. After all, this is my very first Rust program; perhaps I’m using ARC in a very improper way.
Let’s look at some examples of how the standard ARC/Mutex code differs from crossbeam
.
initialization
Initialization is pretty similar between them. The ARC/Mutex version imports some standard libraries, and sets up the struct that will be shared between the computer and inspector threads:
// ARC/Mutex version
use std::f64::consts::PI;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::{Duration, SystemTime};
const TARGET: u32 = 10;
const GUESS: u32 = 100;
struct Leibniz {
: f64,
x: f64,
d: u32,
ticks: u32,
tocks}
And the crossbeam
version does similarly—but can’t have a central struct:
// crossbeam version
extern crate crossbeam;
use crossbeam::{atomic::AtomicCell, thread};
use std::f64::consts::PI;
use std::thread as sthread;
use std::time::{Duration, SystemTime};
const TARGET: u32 = 10;
const GUESS: u32 = 2500000;
computer
The ARC/Mutex computer is as simple as you could ask for:
// ARC/Mutex version
fn computer(state: Arc<Mutex<Leibniz>>) {
loop {
let mut s = state.lock().unwrap();
for _ in 1..s.ticks {
.x += 1.0 / s.d;
s.d += 2.0;
s.x -= 1.0 / s.d;
s.d += 2.0;
s}
.tocks += 1;
s}
}
while the crossbeam
computer has to mess around with load
s and store
s more:
// crossbeam version
fn computer(
: &AtomicCell<f64>,
xr: &AtomicCell<f64>,
dr: &AtomicCell<u32>,
ticks: &AtomicCell<u32>,
tocks{
) loop {
let mut x = xr.load();
let mut d = dr.load();
let ticks = ticks.load();
for _ in 1..ticks {
+= 1.0 / d;
x += 2.0;
d -= 1.0 / d;
x += 2.0;
d }
.fetch_add(1);
tocks.store(x);
xr.store(d);
dr}
}
inspector
The same is true for the inspector. The ARC/Mutex version is clean to write:
// ARC/Mutex version
fn inspector(state: Arc<Mutex<Leibniz>>) {
let mut old_d = 1.0;
let now = SystemTime::now();
loop {
thread::sleep(Duration::from_secs(1));
let mut s = state.lock().unwrap();
if s.tocks <= TARGET {
.ticks /= 2;
s} else if s.tocks > TARGET {
.ticks = s.ticks + s.ticks / 10;
s}
println!("{:?} {} {} {} {}", now.elapsed().unwrap(), s.ticks, s.tocks, s.d - old_d, PI - 4.0 * s.x);
= s.d;
old_d .tocks = 0;
s}
}
while the crossbeam
version has a third of its lines on load
s or store
s:
// crossbeam version
fn inspector(
: &AtomicCell<f64>,
xr: &AtomicCell<f64>,
dr: &AtomicCell<u32>,
ticksr: &AtomicCell<u32>,
tocksr{
) let mut old_d = 1.0;
let mut now = SystemTime::now();
loop {
sthread::sleep(Duration::from_secs(1));
let x = xr.load();
let d = dr.load();
let tocks = tocksr.load();
let ticks = ticksr.load();
if tocks <= TARGET {
.store(ticks / 2);
ticksr} else if tocks > TARGET {
.store(ticks + ticks / 10);
ticksr}
println!("{:?} {} {} {} {}", now.elapsed().unwrap(), ticks, tocks, d - old_d, PI - 4.0 * x);
.store(0);
tocksr}
}
setup
The last bit of setup is different: ARC/Mutex requires us to clone the state ARC, and move one copy into a new thread:
// ARC/Mutex version
fn main() {
println!("ARC std version");
let state = Arc::new(Mutex::new(Leibniz {
: 0.0,
x: 1.0,
d: GUESS,
ticks: 0,
tocks}));
let state_i = state.clone();
thread::spawn(move || computer(state));
;
inspector(state_i)}
while crossbeam
has us use scoped threads, borrowing each of the atomic values read-only (though of course they can be store
d):
// crossbeam version
fn main() {
println!("Atomic crossbeam version");
let x = AtomicCell::new(0.0);
let d = AtomicCell::new(1.0);
let ticks = AtomicCell::new(GUESS);
let tocks = AtomicCell::new(0);
thread::scope(|s| {
.spawn(|_| {
s&x, &d, &ticks, &tocks);
computer(});
&x, &d, &ticks, &tocks);
inspector(})
.unwrap();
}
Having to thread separate AtomicCell
values through everywhere is a pain—and it’s not much relieved by putting them in a struct, because you still need extra load
and store
s everywhere you access them. I’m consistently worried that I’ll get the ordering between atomic values wrong and cause either a deadlock or a logic error—even though these are almost one-way channels (tocks
is incremented by computer
and reset to 0 by inspector
).
I’ve adjusted everything I can find in the ARC/Mutex version, including the initial and target constants. I am pretty sure I’m not supposed to be taking a reference to the ARC, but a clone. I’d love to find a set of profiling tools that would let me see what the ARC/Mutex version is spending its time on. I’d love to have advice from a competent Rustacean on how this code could or should look. Thanks!