Learn Rust through linalg

This series of posts is going to be a non-traditional intro to Rustlang, so buckle up! I will give more details when necessary. Codes are available in my github.

Basic linear algebra such as vector inner product and General Matrix-Matrix (GEMM) product are at the heart of computer science so why not learning Rust by implementing them!

Rust is continuing to be loved (see the stackoverflow survey(s)) with bright future!

In part 1, we will go through vector inner product on CPU. Part 2, will be about GEMM in Rust on GPU. Stay tuned!

Rust is a statically typed, modern system language (C level speed) with zero-cost abstraction, no segfaults and free of data race.

To add more; as for a system language, it doesn’t make sense to have a Garbage Collection, Rust memory management is left to the programmer but there’re useful mechanisms to automate it in most cases through for example implementing Drop trait where when a variable goes out of scope it’s data is deallocated and removed. Neat!

Rust type system is an implementation of Affine type systems; that means each data/value has exactly one owner and can be used only once. Ownership is a unique feature of Rust.

Alright! that was a crude intro to Rust. To learn more, see the official Rust book and the docs. I’m assuming you’ve already installed Rust through rustup which provides the great toolchain and undoubtedly is superior to C/C++. (Welcome to the modern world!)

Vector inner product

To implement simple inner product of two equal length vectors we have two basic options. Either Rust’s array of type [f64: N] (where N is a non-negative integer whose size is known at compile time) or Vec(dynamic array). Rust’s array size must be known at compile time, so when it’s not known we can use it behind a pointer &[f64] aka slice.

Rust’s has a variety of different pointers the most common ones are &x (shared reference) and &mut xmutable reference of x. There’re variety of smart pointers available such as Box, Rc (reference-counting), RefCell (internal mutatbility) as well as the raw pointer * , etc.

Let’s go to the code:

Imperative

The simplest impl goes like this

pub fn aslice_dot_naive(a: &[f64], b: &[f64]) -> f64 {
    let mut ret = 0.0;
    for i in 0..a.len() {
        ret += a[i] * b[i];
    }
    ret
}

pub fn vec_dot_naive(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    let mut ret = 0.0;
    for i in 0..a.len() {
        ret += a[i] * b[i];
    }
    ret
}

fn is a function pointer. let mut ret = 0.0;is mutable variable binding. By default variables are immutable (not specifying mut). See they have the same impls but different signatures.

Functional

Using functional features like zip and map (taking a closure) or fold we can also write

pub fn vec_dot_zip(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    a.iter().zip(b.iter()).map(|(&x, &y)| x * y).sum()
}

pub fn vec_dot_fold(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    (0..a.len()).fold(0f64, |sum, i| sum + a[i] * b[i])
}

It’s beautifully functional and doesn’t seem like system programming at all!

Unsafe

Majority of time safe Rust is enough. The unsafe world doesn’t have all sorts of Rust’s guarantees (i.e. no data race, etc.) so must be used with care, otherwise, segfaults (and the rest) will be thrown at your face!

For example, to access the raw parts a Vec in above example, we can use get_unchecked as below

fn vec_dot_unsafe(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    unsafe {
        (0..a.len()).fold(0f64, |sum, i| sum + a.get_unchecked(i) * b.get_unchecked(i))
    }
}

Notice the unsafe keyword wrapping the unsafe world for us. Unsafe is the main feature for FFI in Rust because other languages are truly unsafe.

Speaking of unsafe and since we’re doing vector dot product, then we can use BLAS or CBLAS instead of all the basic implementations. Fortunately, the bindings are available here.

All we need to do is to add them to cargo.toml

[dependencies]
blas = "0.19"
openblas-src = "0.5"
cblas = "0.1.5"

and we are good to go! Here‘s the dense dot product ddot Netlib BLAS doc

use blas::ddot;
use cblas::ddot as cddot

fn dot_blas(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    unsafe {
        ddot(a.len() as i32, &a[..], 1, &b[..], 1)
    }
}

fn dot_cblas(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    unsafe {
        cddot(a.len() as i32, &a[..], 1, &b[..], 1)
    }
}

Data parallel

Inner product is a classic example of map-reduce / data-parallel. There’s an amazing Rayon crate that we can use and it keeps Rust’s premises, most notably here, data race free. It parallelizes our code whenever possible, otherwise falls back to sequential.

use std::ops::Add;

pub fn vec_dot_par_iter_slow(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    // final sum is the bottleneck
    a.par_iter().zip(b.par_iter()).map(|(&x, &y)| x * y).sum()
}

pub fn vec_dot_par_iter_fast(a: &Vec<f64>, b: &Vec<f64>) -> f64 {
    a.par_iter().zip(b.par_iter()).map(|(&x, &y)| x * y).reduce_with(Add::add).unwrap()
}

Ndarray

Another option is the Ndarray crate which is like Numpy has dot product functionality. We’ll use it with blas feature enabled.

Test

Now to test, we’ll use cargo test. First we create a vector of some given length with elements randomly sampled from standard normal distribution. It goes something like this

#[cfg(test)]
mod tests {

    use std::iter;
    use rand::{Rng, thread_rng};
    use super::*;

    fn randn_vec(n: usize) -> Vec<f64> {
        let mut rng = thread_rng();
        iter::repeat(()).map(|()| rng.gen()).take(n).collect::<Vec<f64>>()
    }

    fn close(x: f64, y: f64) -> bool {
        (x - y).abs() < 1e-8
    }
    

    #[test]
    fn dot() {
        let v = randn_vec(10);
        let a =  aslice_dot_naive(&v[..], &v[..]);
        let b = vec_dot_naive(&v, &v);
        let c = vec_dot_iter(&v, &v);
        assert!(close(a, b));
        assert!(close(b, c));
    }
}

assert! is a macro (bang at the end) which is expanded at compile time and it checks the boolean value. If false, it panics!

Rust’s macro is hygienic unlike C.

We can simply make our own close! macro with its own AST grammer

macro_rules! close {
    ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8))
}

Since we’re going to test more things and macro are great to prevent DRY, we can write an all_close! macro that compares all floats, recursively.

macro_rules! all_close {
    ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8);
    ($x:expr, $y:expr, $($ys:expr),+) => (all_close!($y, $($ys),+))
}

Then, can compare values altogether with all_close!(a, b, c); So the complete tests are

#[cfg(test)]
mod tests {

    use std::iter;
    use rand::{Rng, thread_rng};
    use super::*;

    fn randn_vec(n: usize) -> Vec<f64> {
        let mut rng = thread_rng();
        iter::repeat(()).map(|()| rng.gen()).take(n).collect::<Vec<f64>>()
    }

    macro_rules! all_close {
        ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8));
        ($x:expr, $y:expr, $($ys:expr),+) => (
            all_close!($y, $($ys),+)
        )
    }

    #[test]
    fn dot() {
        let v = randn_vec(10);
        let a =  aslice_dot_naive(&v[..], &v[..]);
        let b = vec_dot_naive(&v, &v);
        let c = vec_dot_zip(&v, &v);
        let d = vec_dot_fold(&v, &v);
        let e = vec_dot_unsafe(&v, &v);
        let f = vec_dot_par_iter_slow(&v, &v);
        let g = vec_dot_par_iter_fast(&v, &v);
        let h = dot_blas(&v, &v);
        let i = dot_cblas(&v, &v);
        all_close!(a, b, c, d, e, f, g, h, i)
    }
}

Then run cargo test --lib vector should pass our tests.

Benchmarks

We can go ahead and perform some microbenchmarks with cargo bench which is available though Rust nightly release.

To switch to rustc nightly, simply run rustup default nightly.

First, we need to add test feature at the root module

#![feature(test)]
extern crate test;

then in tests module above, we can benchmark our impls for a random vector of some length, for example.

use test;

#[bench]
fn bench_aslice_dot_naive(bench: &mut test::Bencher) {
    let v = randn_vec(1000);
    bench.iter(|| {
        aslice_dot_naive(&v[..], &v[..])
    });
}

However, there’s a better option; Criterion crate which preforms the benchmarks in the stable Rust and provides more statistical info (confidence interval, p-values, outliers as well as warm up time).

The results of running our micro-benchmarks for inner product of random vectors length 1000 and 1,000,000 with itself (cargo bench 2>&1 | tee -a bench_logs.txt) on my four year old, Thinkpad T540p 8-cores laptop, running Ubuntu 14.04 are:

Note: Keep the Numpy C-optimized results for float32 in mind (float32 is needed for correct comparison to Ndarray)

# python 3.6 Anaconda
# numpy version 1.14
import numpy as np

a = np.random.randn(1000).astype("float32")
%timeit a.dot(a)
# 573 ns ± 0.909 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

b = np.random.randn(1_000_000).astype("float32")
%timeit b.dot(b)
# 420 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Log summaries

length 1000

  1. Naive: ~ 795 ns (nano sec)
  2. Parallel: ~ 11 us (slow) and ~ 5 us (fast) 🤔. Though they’re worse than the naive cases due to excessive thread communication overheads. A way to fix them is through explicit divide-and-conquer (forming tree-aggregate pattern) with proper join .
  3. Blas/CBlas: ~ 105 ns
  4. Ndarray-blas: ~ 61 ns 🔥

length 1,000,000

  1. Naive: ~ 887 us (micro sec)
  2. Parallel: ~ 6.5 ms (slow) and ~ 887 us (fast). Still should be improved as explained above.
  3. Blas/CBlas: ~ 358 us
  4. Ndarray-blas: ~ 111 us 🔥

Hats off to bluss’s Ndarray and the notable speedups compared to Numpy!

The complete benchmark logs in more details are available here.

That’s it for now. Stay tuned for more good stuff in part 2.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.