Sniffen Packets

With a name like Sniffen, it's got to smell good.

Confusing ARC delays in Rust

Last month, I wrote a little about using a toy problem to explore Python’s async/await concurrency model. I’ve been meaning to learn Rust for ages, and this seemed a nice size to experiment with. I ported over that program to Rust—and the whole thing is up at Github for your enjoyment.

But I’ve run into an interesting problem: the version using the standard library’s ARC/Mutex concurrency is incredibly slow. A rewrite using the crossbeam library’s atomic data types is much faster—despite that they should be using the same primitives. Not only is the ARC/Mutex version slow, but it gets much slower over time—as if something’s leaking badly. I welcome insight to what I could be missing. After all, this is my very first Rust program; perhaps I’m using ARC in a very improper way.

Let’s look at some examples of how the standard ARC/Mutex code differs from crossbeam.

initialization

Initialization is pretty similar between them. The ARC/Mutex version imports some standard libraries, and sets up the struct that will be shared between the computer and inspector threads:

// ARC/Mutex version

use std::f64::consts::PI;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::{Duration, SystemTime};

const TARGET: u32 = 10;
const GUESS: u32 = 100;

struct Leibniz {
    x: f64,
    d: f64,
    ticks: u32,
    tocks: u32,
}

And the crossbeam version does similarly—but can’t have a central struct:

// crossbeam version

extern crate crossbeam;

use crossbeam::{atomic::AtomicCell, thread};
use std::f64::consts::PI;
use std::thread as sthread;
use std::time::{Duration, SystemTime};

const TARGET: u32 = 10;
const GUESS: u32 = 2500000;

computer

The ARC/Mutex computer is as simple as you could ask for:

// ARC/Mutex version

fn computer(state: Arc<Mutex<Leibniz>>) {
    loop {
        let mut s = state.lock().unwrap();
        for _ in 1..s.ticks {
            s.x += 1.0 / s.d;
            s.d += 2.0;
            s.x -= 1.0 / s.d;
            s.d += 2.0;
        }
        s.tocks += 1;
    }
}

while the crossbeam computer has to mess around with loads and stores more:

// crossbeam version

fn computer(
    xr: &AtomicCell<f64>,
    dr: &AtomicCell<f64>,
    ticks: &AtomicCell<u32>,
    tocks: &AtomicCell<u32>,
) {
    loop {
        let mut x = xr.load();
        let mut d = dr.load();
        let ticks = ticks.load();
        for _ in 1..ticks {
            x += 1.0 / d;
            d += 2.0;
            x -= 1.0 / d;
            d += 2.0;
        }
        tocks.fetch_add(1);
        xr.store(x);
        dr.store(d);
    }
}

inspector

The same is true for the inspector. The ARC/Mutex version is clean to write:

// ARC/Mutex version

fn inspector(state: Arc<Mutex<Leibniz>>) {
    let mut old_d = 1.0;
    let now = SystemTime::now();
    loop {
        thread::sleep(Duration::from_secs(1));
        let mut s = state.lock().unwrap();
        if s.tocks <= TARGET {
            s.ticks /= 2;
        } else if s.tocks > TARGET {
            s.ticks = s.ticks + s.ticks / 10;
        }
        println!("{:?} {} {} {} {}", now.elapsed().unwrap(), s.ticks, s.tocks, s.d - old_d, PI - 4.0 * s.x);
        old_d = s.d;
        s.tocks = 0;
    }
}

while the crossbeam version has a third of its lines on loads or stores:

// crossbeam version

fn inspector(
    xr: &AtomicCell<f64>,
    dr: &AtomicCell<f64>,
    ticksr: &AtomicCell<u32>,
    tocksr: &AtomicCell<u32>,
) {
    let mut old_d = 1.0;
    let mut now = SystemTime::now();
    loop {
        sthread::sleep(Duration::from_secs(1));
        let x = xr.load();
        let d = dr.load();
        let tocks = tocksr.load();
        let ticks = ticksr.load();
        if tocks <= TARGET {
            ticksr.store(ticks / 2);
        } else if tocks > TARGET {
            ticksr.store(ticks + ticks / 10);
        }
        println!("{:?} {} {} {} {}", now.elapsed().unwrap(), ticks, tocks, d - old_d, PI - 4.0 * x);
        tocksr.store(0);
    }
}

setup

The last bit of setup is different: ARC/Mutex requires us to clone the state ARC, and move one copy into a new thread:

// ARC/Mutex version

fn main() {
    println!("ARC std version");
    let state = Arc::new(Mutex::new(Leibniz {
        x: 0.0,
        d: 1.0,
        ticks: GUESS,
        tocks: 0,
    }));
    let state_i = state.clone();
    thread::spawn(move || computer(state));
    inspector(state_i);
}

while crossbeam has us use scoped threads, borrowing each of the atomic values read-only (though of course they can be stored):

// crossbeam version

fn main() {
    println!("Atomic crossbeam version");
    let x = AtomicCell::new(0.0);
    let d = AtomicCell::new(1.0);
    let ticks = AtomicCell::new(GUESS);
    let tocks = AtomicCell::new(0);

    thread::scope(|s| {
        s.spawn(|_| {
            computer(&x, &d, &ticks, &tocks);
        });
        inspector(&x, &d, &ticks, &tocks);
    })
    .unwrap();
}

Having to thread separate AtomicCellvalues through everywhere is a pain—and it’s not much relieved by putting them in a struct, because you still need extra load and stores everywhere you access them. I’m consistently worried that I’ll get the ordering between atomic values wrong and cause either a deadlock or a logic error—even though these are almost one-way channels (tocks is incremented by computer and reset to 0 by inspector).

I’ve adjusted everything I can find in the ARC/Mutex version, including the initial and target constants. I am pretty sure I’m not supposed to be taking a reference to the ARC, but a clone. I’d love to find a set of profiling tools that would let me see what the ARC/Mutex version is spending its time on. I’d love to have advice from a competent Rustacean on how this code could or should look. Thanks!