Existing ML/DL ecosystems are huge because they are the combinations of High Performance Computing, Mathematical Optimization, System and Compiler Engineering, etc. etc. So for the sake of simplicity, if we go by the common breakdown of ML into **traditional ML** vs. **DL** (overlap included), then rusty-machine, rustlearn vs. leaf comes in front of our eyes. They have done very interesting and bold developments, in particular, `leaf`

at their time, but eventually they were mostly abandoned because of the huge task of creating a complete open-source ML/DL framework which requires

- Various
**language supports**(will get into in a bit) **Mature base**linear algebra and statistic crates- A
**community**of ML specialists who happen to know Rust and are willing to contribute

Dominant existing ML libraries (mostly in Python/Cython or C++) have been developed with all these supports and Rust is no exception.

A while ago *Gonzalo* has put up a list of HPC requirements which as of now, we can say Rust supports most of the items as language (stable/unsable) features or in crates and hopefully by the end of this year we will see more and more supports. Still constant-generics (good array support), stable `std::simd`

and native GPU, async etc. supports are work-in-progress. Some workarounds and existing solutions namely are; generic-array (using typenum), packed simd, RustaCUDA. For *MPI*, there’s an MPI-binding and for *OpenMP*, there’s rayon.

Are we learning yet? is tracking most of the signals in this area and a simple search over crates.io will tell you that we have a lot of things to cover, so when in comes to production Rust is not there yet!

Thanks to *bluss* who initiated ndarray and various contributors, `ndarray`

has become the `numpy`

of Rust i.e. the base linear algebra crate (though still a lot to be done). Note that, this is very fundamental and simply wrapping BLAS/BLIS, LAPACK etc. are not enough!

`ndarray`

has become the base for Rust ML ecosystem where others are building upon for example, ndarray-linalg, ndarray-stats.

Looking back, it is fair to say people have been, more or of less, experimenting with Rust for ML. I think the experimental phase is getting into its final stage, once Rust pushes the immediate requirements such as const-generic, GAT, `std::simd`

, GPU support. I think the community is getting bigger and considering the collective efforts of the authors and contributors of the aforementioned crates, the number of ML specialists and enthusiasts is approx. where we can all get together to do interesting things by learning from and assessing existing ones (in particular in Python) to create our own curated Rust ecosystem. I think it is time to create an *ML Working Group* or at least for now, if you’re interested you can join rust-ml group to see how things would turn out.

This is the area I’m mostly passionate about. DL *frontiers* are pushing more and more into systems and compiler so that harder computations, graph level optimizations, *differentiation* (aka differentiable programming), efficient codegen and kernel generations etc. to happen at the *compile time*. Major frontiers are; TVM, tensorflow/swift, pytorch/glow (also pytorch with TVM backend). So when it comes to Rust, all these efforts **cannot be ignored**.

Therefore, a (short term) solution is creating bindings. That’s what I did for TVM. Basically, we can train (mostly vision tasks now) using any DL frameworks (TensorFlow, PyTorch, MXNet) or bridge some with ONNX, then compile using TVM on varieties of supported hardwares, and for * inference*, we can use our beloved Rust. I should also mention the existing bindings such as tensorflow/rust and tch-rs. The major problem with these bindings is they’re limited. For example,

`tenorflow/rust`

does not have the higher abstractions that Python has now and `tch-rs`

is Inference, in particular on edge devices, is one of the hottest areas. Another very interesting project which uses Rust for inference is tract which has good support for TF and ONNX ops. I should mention that Google’s TFLite, Tencent’s NCNN or FeatherCNN, Xiaomi’s MACE and Microsoft’s ELL are all trying to push their own solutions, but frankly, they’re still limited to certain well-known tasks and are painful to use for varieties of other tasks.

You might ask, *how about creating a DL framework in Rust from scratch?* I’d yes, first read the source code of any major DL framework and try to catch up on the compiler development. Then you’ll see the pieces are moving fast and haven’t even converged to a relatively complete solution. Though it could work out as a *very* long term solution, personally I’m not interested now.

I love Rust because of two main reasons

- It is very community driven and offering solutions never/less seen before by keeping the community healthy where
*no-ego*rules and any inputs are welcome - The community and in particular the
*leaders*have*high EQ*which in my opinion, is one of the most neglected cohesive forces in fruitful long lasting open-source communities

I would love to see Rust flourishing in ML/DL domains. There are still areas that it lacks a decent crate such as a *Visualizations* crate for ML type of workloads, but my bet is on Rust. I hope this post has cleared up where Rust is when it comes to ML/DL. For inputs from other people, please see the rust-ml discussion.

Continuing from Rust standard library study series, it’s time for VecDeque<T>. Because of its similarity to `Vec`

, there isn’t much to say.

A

Rust std docdouble-endedqueueimplemented with agrowable ring buffer.

The “default” usage of this type as a queue is to use`push_back`

toaddto the queue, and`pop_front`

toremovefrom the queue.`extend`

and`append`

push onto the back in this manner, and iterating over`VecDeque`

goes front to back.

- Similar to
`Vec`

,`VecDeque`

has amortized insert to both ends of the container, but unlike`Vec`

, it has amortized removal from both ends. (Recalling from`Vec`

study, removal is strictly with no shrink factor involved) - Similar to
`Vec`

, indexing is

Similar to Vec<T> study, here’s the stripped down definition of `VecDeque<T>`

struct VecDeque<T> { // tail and head are pointers into the buffer. Tail always points // to the first element that could be read, Head always points // to where data should be written. // If tail == head the buffer is empty. The length of the ringbuffer // is defined as the distance between the two. tail: usize, head: usize, buf: RawVec<T>, } // Default Global allocator struct RawVec<T, A: Alloc = Global> { ptr: Unique<T>, cap: usize, a: A, } #[repr(transparent)] struct Unique<T: ?Sized> { pointer: *const T, _marker: PhantomData<T>, }

The same `Vec`

methods, such as `with_capacity`

(note that ring buffer always *leaves one space empty*), `truncate`

, `shrink_to`

etc. exist and follow the same observations as in `Vec`

study. The notable methods are

`push_back`

and`pop_back`

which involve moving the`head`

pointer and`push_front`

and`pop_front`

which involve moving the`tail`

pointer.- retain which acts as the filter method.
- resize_with with a
`generator: impl FnMut() -> T`

- as_slices (and the mut one) which contains, in order the content of the
`VecDeque`

.

That’s it for now!

]]>Continuing from Rust standard library study series, it’s time for `LinkedList<T>`

. Note that implementation are taken from *Rust stable v1.33.0*.

A

Rust std docdoubly-linkedlist withowned nodes.

The`LinkedList`

allowspushing andpopping elements at either end inconstant time.

Almost always it is better to use`Vec`

or`VecDeque`

instead of`LinkedList`

. In general, array-based containers arefaster, more memory efficientand make better use of CPUcache.

Note that unlike `Vec<T>`

- accessing an element through index is i.e. needs to iterate linearly over the list.
`append`

is- Interesting how
`linked_list::Iter`

is different from`linked_list::IterMut`

. Invariant of`IterMut`

is enforced with`&mut`

and`PhantomData<&'a Node<T>>`

ensures soundness (more on`PhantomData`

, dropck later).

There’s an entire book (which I highly recommend to go through details if you haven’t already) convincing the reader why it’s tricky to implement even *singly*-linked list and most probably not a good idea for new Rust users!

Because of Rust’s affine type system / ownership, it’s actually tricky to implement *doubly*-linked list. The main reason is it seems a node needs to have *two owners* from adjacent nodes. However, that’s possible with `NonNull<T>`

which we talked about in `Vec<T>`

study.

Here’s the stripped down definition

// Note: NonNull<T> does NOT own the referent #[repr(transparent)] // <-- enforces the same type representation as *const T struct NonNull<T: ?Sized> { pointer: *const T, } struct Node<T> { next: Option<NonNull<Node<T>>>, // Not Option<Box<Node<T>>> prev: Option<NonNull<Node<T>>>, element: T, } struct LinkedList<T> { head: Option<NonNull<Node<T>>>, tail: Option<NonNull<Node<T>>>, len: usize, marker: PhantomData<Box<Node<T>>>, // <-- sound dropck }

`Option<Box<Node<T>>>`

?It’s probably a good idea to see what would be the difference if we use `Box<Node<T>>`

instead. We discussed what `Unique<T>`

is and how it’s different from `NonNull<T>`

previously, but as a quick reminder, `Unique<T>`

owns the referent whereas `NonNull<T>`

does not and in fact `Box<T>`

(a **pointer type** for **heap allocation**) just wraps `Unique<T>`

and provides new interface for interacting with `Unique<T>`

.

Let’s consider `Node<T>`

with `Box`

as follows (playpen link)

struct Node<T> { prev: Option<Box<Node<T>>>, next: Option<Box<Node<T>>>, element: T, } fn main() { let mut head_node = Node { prev: None, next: None, element: 1, }; let next_node = Node { prev: Some(Box::new(head_node)), // <-- head_node is moved here next: None, element: 2, }; head_node.next = Some(Box::new(next_node)); // Not good! }

This begs for *Use-After-Free* (UAF) so *Undefined Behavior* (UB) which we know we shouldn’t push further. However, using a non-owning `NonNull<T>`

can solve the problem as follows (playpen link)

use std::ptr::NonNull; struct Node<T> { prev: Option<NonNull<Node<T>>>, next: Option<NonNull<Node<T>>>, element: T, } fn main() { let mut head_node = Node { prev: None, next: None, element: 1, }; let next_node = Node { prev: NonNull::new(&mut head_node as *mut _), next: None, element: 2, }; head_node.next = NonNull::new(&next_node as *const _ as *mut _); }

But how can we make sure this is sound esp. when using it in `LinkedList<T>`

. More precisely

`PhantomData`

and dropck ?I’ve been trying to understand the deeper relation between `PhantomData`

and what makes dropck sound (so does `LinkedList<T>`

sound), but couldn’t find any clear explanations so I asked it in the user channel and got an amazing thorough answer which can be generalized to `Vec`

, `LinkedList`

etc.

First, two important points that we’re going to talk about

- Variance is a concept related to
**generic parameter(s)**. - Rust has
**subtyping****lifetime parameters**.

Things can get confusing because a lifetime parameter is actually a generic parameter, so variance and subtyping are *tied* together.

Variance in Rust is about allowing a lifetime parameter to be ok (i.e. approximated) with a

- Shorter lifetime:
**co**-variance - Longer lifetime:
**contra**-variance - Neither shorter nor longer lifetime:
**in**-variance

These are actually the **assumptions** we need to make so that we can be sure our implementation is **sound**.

*Note*: Wherever you see variance in Rust, by *default* it means covariance.

Now here’s a classic example:

Try to guess the output first. (Playground link)

struct MyCell<T> { value: T } impl<T: Copy> MyCell<T> { fn new(value: T) -> Self { MyCell { value } } fn get(&self) -> T { self.value } fn set(&self, new_value: T) { // signature: pub unsafe fn write<T>(dst: *mut T, src: T) // Overwrites a memory location with the given value without reading // or dropping the old value unsafe { std::ptr::write(&self.value as *const T as *mut T, new_value); } } } fn foo(rcell: &MyCell<&i32>) { let val: i32 = 13; rcell.set(&val); println!("foo set value: {}", rcell.value); } fn main() { static X: i32 = 10; let cell = MyCell::new(&X); foo(&cell); println!("end value: {}", cell.value); }

And the output is

foo set value: 13 end value: 32766 // ???

If you could guess the *end value* will be non-sense you might skip to the end and if it’s unsettling and you were hoping the compiler will guide us here, please keep reading.

Well, before going into more details here’s the example using `Cell<T>`

. Can you guess the output now? (Run in playground)

use std::cell::Cell; fn foo(rcell: &Cell<&i32>) { let val: i32 = 13; rcell.set(&val); } fn main() { static X: i32 = 10; let cell = Cell::new(&X); foo(&cell); }

And it doesn’t compile because of

error[E0597]: `val` does not live long enough --> src/main.rs:7:15 | 5 | fn foo(rcell: &Cell<&i32>) { | - let's call the lifetime of this reference `'1` 6 | let val: i32 = 13; 7 | rcell.set(&val); | ----------^^^^- | | | | | borrowed value does not live long enough | argument requires that `val` is borrowed for `'1` 8 | } | - `val` dropped here while still borrowed error: aborting due to previous error

Ok! this is what we expect from the compiler, right?

To understand where the root of the problem is in `MyCell<T>`

, let’s try to analyze using nomicon’s visual representation of lifetime

static X: i32 = 10; 'a: { let cell = MyCell::new(&'a X); 'b: { // foo(&'b cell) more or less is: let val: i32 = 13; // <-- created here 'c: { rcell.set(&'c val); println!("foo set value: {}", rcell.value); } } // <-- val is dropped here println!("end value: {}", cell.value); }

The problem occurs when we’ve allowed change/mutation to be ok for shorter lifetime ``c`

than ``a`

(co-variant assumption). However, clearly this is not sound, because `val`

exists in ``b`

and is **dropped** at the end of ``b`

‘s scope, that’s why printing the `cell.value`

will be nonsense (content of a freed pointer!).

** Claim**: It doesn’t matter how you implement

`set`

for `MyCell<T>`

it’ll Pretty powerful claim! The reason is that the essence of `MyCell<T>`

doesn’t put any restrictions on not allowing shorter lifetimes to mess up with its value. In other words, we’re kind of *forgetting* about any particular lifetime constraints, meaning that our `MyCell<T>`

is co-variant wrt `T`

where for our case `T`

is `&'a i32`

so is co-variant wrt ``a`

.

To be able to make such claims, we need to have *type-level* knowledge (more succinct treatment through type constructor) and value-level knowledge is *not* enough. We can be pretty sure that these issues have been taken care of when using types provided for us in standard library.

`Cell<T>`

enforce in-variance?Here’s the stripped down definition of `Cell<T>`

pub struct Cell<T: ?Sized> { value: UnsafeCell<T>, } #[lang = "unsafe_cell"] // --> known to the compiler pub struct UnsafeCell<T: ?Sized> { value: T, }

Wait what? what’s the difference? o_0

The answer has deep root in compiler with the attribute #[lang = “unsafe_cell”].

`MyCell<T>`

somehow?If you find yourself in a situation that need to make your type in-variant, you can include any in-variant type, such as `Cell<T>`

or `UnsafeCell<T>`

in PhantomData, for example with `PhantomData<Cell<T>>`

(Exercise: Can you find other in-variant types?). Checkout how it fixes the issue.

I hope you can see how important variance is and how the compiler is handling it for us in most cases. Complete understanding of this matter becomes really important when writing unsafe code in FFI for example.

You might ask how using a longer lifetime could be ok? Well, it’s kind of rare in fact. For example, for`fn(&'a i32)`

it’s ok to use `fn(&'static i32)`

so it is *contra-variant wrt its arguments* (and co-variant wrt return types). There’s a fourth case such as most of primitive types that are **bi**-variant, meaning they’re *both* co-variant and contra-variant.

If you’re interested in knowing more there are some great resources

- Felix Klock presentation
- Rustonomicon.
- Other resources at the end of my presentation here.
- The Variance RFC is excellent but don’t get confused with some historic changes.
- Rust compiler guide.

The upcoming series of blog posts will contain my study of Rust standard library. I’ve partially written some parts of the series in scattered places and want to gather them in one place for better and easier access. I intend to update whenever I discover something interesting/important to remember.

I’m referring to implementations in *Rust stable v1.33.0*.

This post covers some details of std::vec that I found important.

`Vec<T>`

is a dynamic array which only grows and never shrinks automatically.

A

Rust std doccontiguousgrowable arraytype withheap-allocatedcontents.

Notice the difference between its counterpart *stack-allocated fixed-size* array `[T; N]`

(where at this time `N`

needs to be a specified non-negative integer. Constant generic hopefully will come soon).

Ok! let’s dig in to it. `Vec<T>`

contains **(pointer, capacity, length)**.

- The pointer will
**never be null**, so it enjoys*null-pointer-optimization*. - The pointer may
*not*actually point to allocated memory for example in`Vec::new(), vec![]`

or`Vec::with_capacity(0)`

.

- The capacity of a vector is the amount of space allocated for any future elements that will be added onto the vector.
- The length is the number of actual elements pushed/inserted in the vector.
`Vec`

allocates memory iff`mem::size_of::<T>() * capacity() > 0`

. So it does*not*allocate for a Zero-Sized-Type (ZST) even with positive capacity.- When length matches the capacity,
`Vec`

will (re)allocate by a certain growth factor. This makes insertion*amortized*. Right now the growth factor is 2. However, comparing to other languages such as C++, Java, etc. it doesn’t seem to be optimal given any global first-fit allocator. Heuristically,**1.5**or a number a bit less than golden ratio is considered optimal. Here’s the related issue that’s currently open. I found it interesting to dig in! - How about shrink factor? for example, if we
`pop`

half of the elements out, would a quarter of the memory be freed? No, actually! That’s if`pop`

ing all the elements the capacity won’t change leaving a hole on the heap. Therefore,`pop`

(from back) is and**not**amortized If you need to free up some memory, use

.`shrink_to_fit`

- If need to use
`Vec`

for FFI or as a memory-backed collection, be sure to deallocate memory with`from_raw_parts`

and then drop explicitly. - If used in FFI and need to pass as pointer, for safety remember to call
`shrink_to_fit`

or`truncate`

by the length prior to passing the (`as_mut_ptr()`

or`as_ptr()`

) to not pass uninitialized memory buffer. - The order of elements is always guaranteed to be the same if coerced into slice.

Here’s the stripped down definition

struct Vec<T> { buf: RawVec<T>, len: usize, } // Default Global allocator struct RawVec<T, A: Alloc = Global> { ptr: Unique<T>, cap: usize, a: A, } #[repr(transparent)] struct Unique<T: ?Sized> { pointer: *const T, _marker: PhantomData<T>, }

- #[repr(transparent)] enforces that
`Unique<T>`

type representation is the same as`*const T`

. `Unique<T>`

is the covariant version of`*mut T`

(wrt`T`

) and has*stronger*semantic than`NonNull<T>`

.- Unlike
`*mut T`

, the pointer must**always be non-null**. - In fact,
`Box<T>`

wraps`Unique<T>`

i.e.`struct Box<T: ?Sized>(Unique<T>)`

. - It can be accessed in nightly through
`#![feature(ptr_internals)]`

and`core::ptr::Unique`

. - If
`T`

is`Send/Sync`

then`Unique<T>`

is`Send/Sync`

. - The presence of
`PhantomData<T>`

marker is only important for*rustc dropck*to understand that we logically own a`T`

causing the main difference between`Unique<T>`

and`NonNull<T>`

where it’s defined as

// NonNull<T> doesn't own the referent whereas Unique<T> does #[repr(transparent)] struct NonNull<T: ?Sized> { pointer: *const T, }

This series of posts is going to be a *non-traditional intro* to Rustlang, so buckle up! I will give more details when necessary. Codes are available in my github.

Basic linear algebra such as vector inner product and General Matrix-Matrix (GEMM) product are at the heart of computer science so why not learning Rust by implementing them!

Rust is continuing to be loved (see the stackoverflow survey(s)) with bright future!

In part 1, we will go through vector inner product on **CPU**. Part 2, will be about GEMM in Rust on **GPU**. Stay tuned!

Rust is a statically typed, modern system language (C level speed) with zero-cost abstraction, no segfaults and free of data race.

To add more; as for a system language, it doesn’t make sense to have a Garbage Collection, Rust memory management is left to the programmer but there’re useful mechanisms to automate it in most cases through for example implementing `Drop`

*trait* where when a variable goes out of scope it’s data is deallocated and removed. Neat!

Rust type system is an implementation of Affine type systems; that means each data/value has exactly **one owner and can be used only once. ***Ownership* is a unique feature of Rust.

Alright! that was a crude intro to Rust. To learn more, see the official Rust book and the docs. I’m assuming you’ve already installed Rust through `rustup`

which provides the great toolchain and undoubtedly is superior to C/C++. (Welcome to the modern world!)

To implement simple inner product of two equal length vectors we have two basic options. Either Rust’s array of type `[f64: N]`

(where N is a non-negative integer whose size is known at compile time) or `Vec`

(dynamic array). Rust’s array size must be known at compile time, so when it’s not known we can use it behind a *pointer* `&[f64]`

aka *slice.*

Rust’s has a variety of different pointers the most common ones are `&x`

(shared reference) and `&mut x`

mutable reference of x. There’re variety of smart pointers available such as `Box`

, `Rc`

(reference-counting), `RefCell`

(internal mutatbility) as well as the *raw pointer* `*`

, etc.

Let’s go to the code:

The simplest impl goes like this

pub fn aslice_dot_naive(a: &[f64], b: &[f64]) -> f64 { let mut ret = 0.0; for i in 0..a.len() { ret += a[i] * b[i]; } ret } pub fn vec_dot_naive(a: &Vec<f64>, b: &Vec<f64>) -> f64 { let mut ret = 0.0; for i in 0..a.len() { ret += a[i] * b[i]; } ret }

`fn`

is a *function pointer. *`let mut ret = 0.0;`

is mutable variable binding. By *default variables are immutable* (not specifying `mut`

). See they have the same impls but different signatures.

Using functional features like `zip`

and `map`

(taking a closure) or `fold`

we can also write

pub fn vec_dot_zip(a: &Vec<f64>, b: &Vec<f64>) -> f64 { a.iter().zip(b.iter()).map(|(&x, &y)| x * y).sum() } pub fn vec_dot_fold(a: &Vec<f64>, b: &Vec<f64>) -> f64 { (0..a.len()).fold(0f64, |sum, i| sum + a[i] * b[i]) }

It’s beautifully functional and doesn’t seem like system programming at all!

Majority of time safe Rust is enough. The unsafe world doesn’t have all sorts of Rust’s guarantees (i.e. no data race, etc.) so must be used with care, otherwise, segfaults (and the rest) will be thrown at your face!

For example, to access the raw parts a `Vec`

in above example, we can use `get_unchecked`

as below

fn vec_dot_unsafe(a: &Vec<f64>, b: &Vec<f64>) -> f64 { unsafe { (0..a.len()).fold(0f64, |sum, i| sum + a.get_unchecked(i) * b.get_unchecked(i)) } }

Notice the unsafe keyword wrapping the unsafe world for us. Unsafe is the main feature for FFI in Rust because other languages are truly unsafe.

Speaking of unsafe and since we’re doing vector dot product, then we can use BLAS or CBLAS instead of all the basic implementations. Fortunately, the bindings are available here.

All we need to do is to add them to `cargo.toml`

[dependencies] blas = "0.19" openblas-src = "0.5" cblas = "0.1.5"

and we are good to go! Here‘s the dense dot product `ddot`

Netlib BLAS doc

use blas::ddot; use cblas::ddot as cddot fn dot_blas(a: &Vec<f64>, b: &Vec<f64>) -> f64 { unsafe { ddot(a.len() as i32, &a[..], 1, &b[..], 1) } } fn dot_cblas(a: &Vec<f64>, b: &Vec<f64>) -> f64 { unsafe { cddot(a.len() as i32, &a[..], 1, &b[..], 1) } }

Inner product is a classic example of map-reduce / data-parallel. There’s an amazing Rayon crate that we can use and it keeps Rust’s premises, most notably here, *data race free*. It parallelizes our code whenever possible, otherwise falls back to sequential.

use std::ops::Add; pub fn vec_dot_par_iter_slow(a: &Vec<f64>, b: &Vec<f64>) -> f64 { // final sum is the bottleneck a.par_iter().zip(b.par_iter()).map(|(&x, &y)| x * y).sum() } pub fn vec_dot_par_iter_fast(a: &Vec<f64>, b: &Vec<f64>) -> f64 { a.par_iter().zip(b.par_iter()).map(|(&x, &y)| x * y).reduce_with(Add::add).unwrap() }

Another option is the Ndarray crate which is like Numpy has `dot`

product functionality. We’ll use it with *blas feature enabled.*

Now to test, we’ll use `cargo test`

. First we create a vector of some given length with elements randomly sampled from standard normal distribution. It goes something like this

#[cfg(test)] mod tests { use std::iter; use rand::{Rng, thread_rng}; use super::*; fn randn_vec(n: usize) -> Vec<f64> { let mut rng = thread_rng(); iter::repeat(()).map(|()| rng.gen()).take(n).collect::<Vec<f64>>() } fn close(x: f64, y: f64) -> bool { (x - y).abs() < 1e-8 } #[test] fn dot() { let v = randn_vec(10); let a = aslice_dot_naive(&v[..], &v[..]); let b = vec_dot_naive(&v, &v); let c = vec_dot_iter(&v, &v); assert!(close(a, b)); assert!(close(b, c)); } }

`assert!`

is a macro (bang at the end) which is expanded at compile time and it checks the boolean value. If false, it panics!

Rust’s macro is hygienic unlike C.

We can simply make our own `close!`

macro with its own AST grammer

macro_rules! close { ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8)) }

Since we’re going to test more things and macro are great to prevent DRY, we can write an `all_close!`

macro that compares all floats, recursively.

macro_rules! all_close { ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8); ($x:expr, $y:expr, $($ys:expr),+) => (all_close!($y, $($ys),+)) }

Then, can compare values altogether with `all_close!(a, b, c);`

So the complete tests are

#[cfg(test)] mod tests { use std::iter; use rand::{Rng, thread_rng}; use super::*; fn randn_vec(n: usize) -> Vec<f64> { let mut rng = thread_rng(); iter::repeat(()).map(|()| rng.gen()).take(n).collect::<Vec<f64>>() } macro_rules! all_close { ($x:expr, $y:expr) => (assert!(($x - $y).abs() < 1e-8)); ($x:expr, $y:expr, $($ys:expr),+) => ( all_close!($y, $($ys),+) ) } #[test] fn dot() { let v = randn_vec(10); let a = aslice_dot_naive(&v[..], &v[..]); let b = vec_dot_naive(&v, &v); let c = vec_dot_zip(&v, &v); let d = vec_dot_fold(&v, &v); let e = vec_dot_unsafe(&v, &v); let f = vec_dot_par_iter_slow(&v, &v); let g = vec_dot_par_iter_fast(&v, &v); let h = dot_blas(&v, &v); let i = dot_cblas(&v, &v); all_close!(a, b, c, d, e, f, g, h, i) } }

Then run `cargo test --lib vector`

should pass our tests.

We can go ahead and perform some microbenchmarks with `cargo bench`

which is available though Rust nightly release.

To switch to `rustc`

nightly, simply run `rustup default nightly.`

First, we need to add test feature at the root module

#![feature(test)] extern crate test;

then in tests module above, we can benchmark our impls for a random vector of some length, for example.

use test; #[bench] fn bench_aslice_dot_naive(bench: &mut test::Bencher) { let v = randn_vec(1000); bench.iter(|| { aslice_dot_naive(&v[..], &v[..]) }); }

However, there’s a better option; *Criterion* crate which preforms the benchmarks in the stable Rust and provides more statistical info (*confidence interval*, *p-values*, *outliers* as well as *warm up time*).

The results of running our micro-benchmarks for inner product of random vectors length 1000 and 1,000,000 with itself (`cargo bench 2>&1 | tee -a bench_logs.txt)`

on my four year old, Thinkpad T540p 8-cores laptop, running Ubuntu 14.04 are:

Note: Keep the Numpy C-optimized results for float32 in mind (float32 is needed for correct comparison to Ndarray)

# python 3.6 Anaconda # numpy version 1.14 import numpy as np a = np.random.randn(1000).astype("float32") %timeit a.dot(a) # 573 ns ± 0.909 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) b = np.random.randn(1_000_000).astype("float32") %timeit b.dot(b) # 420 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

- Naive: ~ 795 ns (nano sec)
- Parallel: ~ 11 us (slow) and ~ 5 us (fast) . Though they’re worse than the naive cases due to excessive thread communication overheads. A way to fix them is through
*explicit divide-and-conquer*(forming tree-aggregate pattern) with proper`join`

. - Blas/CBlas: ~ 105 ns
- Ndarray-blas: ~ 61 ns

- Naive: ~ 887 us (micro sec)
- Parallel: ~ 6.5 ms (slow) and ~ 887 us (fast). Still should be improved as explained above.
- Blas/CBlas: ~ 358 us
- Ndarray-blas: ~ 111 us

Hats off to bluss’s Ndarray and the notable speedups compared to Numpy!

The complete benchmark logs in more details are available here.

That’s it for now. Stay tuned for more good stuff in part 2.

]]>Today, Machine Learning/Deep Learning people have been sharing their great excitements over Ali Rahimi’s talk at NIPS (from min 57 onwards). Undoubtedly, it’s a great talk and you should check it out if you care about fundamental issues and the lost rigor in Deep Learning results. His talk has resonated a lot with me as well and for more reasons that I’ll try to explain in my own way while these tweets sum up the hype part well

While there’s no doubt Deep Learning has been an incredible enabler but **AI hype is real** and you can feel its bittersweet taste. It is too naive to think that at this stage Deep Learning will bring us *Artificial General Intelligence* and portraying ML/DL like in terminator movie coming to extinct the human race is irresponsible and idiotic. At this stage Deep Learning is like a bundle of techniques and since it is led by (empirical part of) Computer Science community, “working code” and seeing the empirical results is somehow *the* *proof*. Moreover, apparently in order to get into ML/DL, you’d only need to know calculus and coding to *be the revolution* and *change the world*! ¯\__(ツ)__/¯

M*y pessimistic side* is saying what is happening now is that Deep Learning is growing exponentially fast with big army of enthusiasts ready to code up something quickly, generate results, publish papers and attract many attentions. These results somehow contribute to building systems for health care for example, which is sensitive enough and we cannot give more power to Deep Learning until there’s a solid,

What is baffling to me is there are many faculties who seem happy about this situation and are not addressing the real problems and are somehow becoming the enemy of science. Their ignorance is mind blowing!

My * optimistic side* is pointing me towards the efforts and initiatives for addressing these issues. Some examples include introduction of google’s colaboratory project and now

To finish off, I think another important part is the realm of optimization algorithms more towards Unit Tests for Stochastic Optimization type of research.

]]>First a short relevant intro about myself then the overview.

Back in 2014, the first course in machine learning that I took was Andrew Ng’s very famous introductory coursera course. I became very interested mainly because with simple math and programming, one was able to build models for solving various important tasks. Later by taking more academic courses I started to have the idea of even working in ML field. It’s been quite some time now that I work in ML and Big Data areas. I am also a Deep Learning enthusiast starting to get firm grasp of practical and theoretical aspects of DL. My current position requires changing two hats frequently; researcher hat and software engineer hat.

In short, the research role requires being updated with ML research trends and the ability to assess many academic results. The software engineer role requires criticizing what results are valuable in *real* *production environment* in order to bring them into existing software ecosystems / platforms and ultimately creating / enhancing ML products.

Given that, I’ve had some previous exposure to DL and training deep NN, but I want to emphasize on **1) going back to basics frequently to close some of the learning gaps** **2) never missing the opportunity of learning from the key people of the area that you’re working on**. These two were my motivations of taking Andrew Ng’s DL coursera courses.

At the time of write this post, 3 out of 5 courses were lunched. I’ll also update this post when course 4 and 5 become available and I finish them. I should also mention I skipped the lectures of course 1 and watched the lectures of course 2 and 3 at x1.25 / x1.5 speed.

- Course 1 (Neural networks and deep learning): What I really liked and definitely would recommend is when you want to learn ML/DL, start coding everything at low level of Numpy and don’t jump into Keras (as opposed to fast.ai‘s approach. I’ll back to it later). This is exactly how assignments where designed though with inevitable amount of writing boilerplate code.
- Course 2 (Improving deep neural networks): It gives you the necessary intuitions about improving and tuning deep networks from different perspectives. Again I liked the practical aspects of implementing various tuning, regularization and optimization techniques in the assignments. The last assignment teaches the introductory Tensorflow. I was expecting to get to using Tensorflow much sooner though.
- Course 3 (Structuring machine learning projects): I enjoyed this course more that the first two mainly because it taught me things that don’t exist in any literature that are extremely important from both the research and engineering sides. The course offers techniques to
*critical problems that arise when you want to design/architect or assess ML/DL projects as well as prioritize on what directions to choose in different scenarios.* - One highlight of the courses that I also enjoyed is having a series of interviews with DL heros, from Hinton to others key researchers.
- I like the fact that Andrew Ng is bringing his own terminologies and notations.
- Assignments were straightforward and nicely designed in Python.
- There’re few typos or solution mismatches in the assignments that overtime will be corrected.
- Nitpick in term of coding style in Python, I always advocate keeping PEP8 in mind, such as parameters in callables with given values shouldn’t have extra whitespaces; e.x. write f(x=1) not f(x = 1).

I finished watched fast.ai first course lectures when they were just launched earlier this year and also their latest course recently. I can understand the value of their approach and why it works for those with less exposure to Math, however, for me it was rather disappointing and wasn’t satisfying at all. From the early on in part 1), building a cat vs. dog classifier using VGG16 in Keras in a few lines of code shows me how much important details were kept away from me (behind so much abstractions) and worries me more than gives me confident. However, part 2) of their lectures is more appealing to me.

So I think, at the end it boils down to two factors *whether you know you want to understand deeper from the start (Andrew Ng’s courses) or you want to go into very simplified applications first and don’t care about the details, then gradually learn some techniques to better understand DL (fast.ai courses). *

Finally, here is my list of other DL courses/book that I enjoyed:

- CS231n: Convolutional Neural Networks for Visual Recognition
- CS224d: Deep Learning for Natural Language Processing
- Goodfellow’s et. al deep learning book
- Neural network methods in natural language processing (even if you’ve already read the premier before)

]]>

Let’s start with the intuitive distributional hypothesis

A

word is characterized by the company it keepsorlinguistic items with similar distributions have similar meanings.

In other words, we expect to see words used in similar *contexts* have similar *meaning (semantic relations)*.

The goal is to build a mathematical model such that given a word (token) we can assign real-valued vector (in some space), so that interesting properties of words such as semantic relations, are preserved for example, by using their inner product or cosine similarity of vectors as metric.

Mikolov et. al. constructed such vector representations of words from natural language text that preserve semantic relations via simple quantities of vectors such as their *inner product*. Their *set of models* is called Word2Vec. Intuitively, there are two basic ideas; for a series of words (tokens) in a corpus/*Text* data, and a choice of *context* i.e. a *window (set)* of words, either find the probability of some word given the some word in the context that is, *or* find the probability of observing a context given a word that is, The first model is called Continuous Bag of Words (CBOW) and the second model is called SkipGram. (If you’ve never heard of these terms before, I suggest reading this note.)

One of well-known tasks that Word2Vec performed well is the analogy task. Word2Vec captures the following type of relations between words:

Formally, let be a vector representation (embedding) of a word in some Euclidean vector space. Then

and one such similarity measure can be cosine similarity of vectors, for example.

Given the notations above and the conditional probabilities the goal is to find some parameters of the parametrized in order to maximize the corpus probability, i.e.

There is an independent assumption (reflected in ) that given a word then observing different words in a context are independent from each other and those events for each word themselves are independent (reflected in the outer ). Note also that contexts are words and SkipGram models each context given a word *independently*.

One approach to model the conditional probabilities so as to connect the word-vector representation idea, is via softmax function

where are the desired vector representations for and is the (Euclidean) inner product. Note that is the set of all for all and so there are number of parameters where is the embedding dimension.

This representation can be considered as a shallow neural network with softmax output.

To address some of the difficulties on the training such as finding the denominator in the softmax there’re two proposed solutions; 1) *hierarchical softmax* 2) *negative sampling*. In practice negative sampling is more favorable. To get to know more I recommend reading Word2Vec explained paper and more expanded version in Word2Vec parameter learning explained.

Recall that if two random outcomes are independent, then That is, their join distribution factorizes into their individual distributions. To have a general measure of this phenomenon, given any two (not necessarily independent) random outcomes, we can define their Point-wise Mutual Information as In case of word-context pairs, we can define the PMI matrix whose entries are which is matrix.

One can use Singular Value Decomposition on the PMI matrix to get lower dimensional representations of words. Let be the SVD of the PMI matrix, then for example, the symmetric factorization and provide word and context representations, respectively. However, SVD provides the best rank approximation wrt matrix norm, and in practice, this is not enough!

What’s the relation between SkipGram and PMI?

Levy-Goldberg showed that if you take the word embedding and the context embedding obtained in SGNS, then is in fact, a factorization of (shifted) PMI matrix. In other words, where is the number of negative samples. This result bridges the neural network method with the traditional vector space model of semantics. Another important advantage is that **PMI approach suffers from scalability but SGNS is very scalable, indeed.**

Minh et. al used noise contrastive estimation (NCE) to model the probability of a word-context coming from correct sample vs. incorrect one. Levy-Goldberg also showed that this approach is equivalent to factorizing the word-context matrix whose entries are shifted

Another approach is described in GloVe: Global vectors for word representations . The idea is to related *inner product* of desired vectors to the *ratio* of the context-word conditional probabilities. The optimization function that is

where is the element of the matrix of word-word co-occurence count are biases and is some function (found empirically) with some desired properties. Despite GloVe gives more parameters to the model, **its performance is quite similar to SGNS**. In fact, by fixing the biases to be the logarithm of word and context counts, GloVe also factorizes a shifted version of PMI matrix.

Levy-Goldberg have also provided a much more clear description of the SGNS model, which briefly goes like this:

After taking from and using the softmax it becomes equivalent to

where is the set of all word-context pairs from

The role of Negative Sampling is to approximation the softmax log of probabilities. Basically, to approximate the above objective, we change it to a classification task of which is given a word-context whether it is coming from our or not So we need to gather some noise word-contexts . Mikolov et. al, did it by randomly sampling from *smoothed* unigram distribution (i.e. to power ), context noises (in short ) for each word

Note that without negative samples, the above objective can be maximized *when all word and context vectors become equal with big enough inner product*. Therefore, with negative sampling, the approximation goes like this;

After some standard simplifications (see Word2Vec Explained), the above objective becomes

where is the Sigmoid function.

In the presence of negative samples , for each we are computing

For a given let be the number of times they appear in (this about like matrix, is the entry in row and column ). So the SGNS objective (equation ) summed with multiplicities becomes

Landgrad-Bellay has recently provided another interpretation of the above SGNS objective and they showed that it is equivalent to the *weighted logistic* PCA. The generalization of this fact is captured through exponential family PCA.

Another interesting idea is to do with the recent work of Avraham-Goldberg to include morphological information of words with Part of Speech (POS) tagging with preprocessing of the text and consider the *pair* instead of the word alone. The result is having different vector representations for cases like and

To understand geometric structures of data, one can look into Topological Data Analysis and its methods such as Persistent Homology. Zhu has an introduction of such approach for Natural Language Processing. In basic algebraic topology (with enough assumption), the dimension of zeroth homology group (zeroth Betti number) of a topological space is the number of connected components and its first Betti number counts the number of holes. Given some data points, the idea of persistent homology is to *track homological classes along increasing neighborhoods of (simplicial complexes) data points*.

Recently Michel et. al using persistent homological approach concluded that such method doesn’t have positive impact on document classification and clustering tasks. They have used Gromov-Haussdorff distance (which is insensitive to isometries) and defined two documents have the same “geometry” if their GH distance if zero. However, it could be argued that this definition of “geometry” is very limiting and doesn’t capture all existing structures in document data!

Arora et. al. in their work, RAND-WALK: A latent variable model approach to word embeddings with generative modelling approach, provided more justifications for relations between PMI, Word2Vec and GloVe.

Hashimoto-Alvarez-Melis considered the task of word embedding as *metric recovery*. That is, given a word embedding over a document with sentences with total number words and vocabulary (unique words), one can view (where context is a word as well and there’s no separate context embedding vectors, such as SGNS throwing away the context vectors) as a Markov (Gaussian) random walk with transition function

then there exists a sequence such that in probability, as

where is the co-occurence matrix over our document. (matrix version of earlier with contexts as words). This described **log-linear relation between co-occurences and distance.**

The metric recovery holds in more general setting for random walk over *unweighted directed graphs and data manifold. *Intuitively,

for some meaning of convergence and distance/geodesic.

We can view words as symbolic data and try to represent their relations with graphs. To learn a representation of symbolic data with hierarchical relations (with existence of power-law distribution), one well-known approach is embedding the data in a **non-Euclidean space**, such as -dimensional *Poincaré ball* (or complex upper half-plane ) which is the Euclidean dimensional ball equipped with (non-Euclidean) Riemannian metric. In two closely published papers Chamberlain et. al studied hyperbolic embedding for -dimensional Poincaré ball and Nickel-Kiela for a general -dimensional Poincaré ball. They examined such embeddings for different types of symbolic data such as text and network graphs and showed improvements as well as the ability of capturing more information in lower dimension because hyperbolic spaces are richer than flat Euclidean spaces.

Another line of ideas is related to encode-decoder (seq2seq) approach where either encoder or decoder can be any of Convolutional NN or Recurrent NN such as LSTM or GRU. The basic idea is to encode a sequence of data (sentences) and try to reconstruct them back with decoder. In the meantime, compact representations are constructed. One such successful approach has been introduced by Kiros et. al. in their Skip-Thought Vectors with GRU as encoder and decoder. Given a tuple of sentences let be the -th word for sentence and let be its embedding. The objective is to maximize the log-probabilities for the **forward and backward sentences conditioned on the encoder representation**:

where is the sequence of words in sentence coming before the -th words and is the hidden state of the encoder GRU.

Thanks for reading this post! I’m still learning and will try to update this post if I find more interesting ideas. If you have any thoughts, please comment below.

]]>When I was studying Algebraic Geometry the language of category, object, morphism, functors, natural-morphism, etc. were essential in understanding modern aspects of the field as well as modern Mathematics, in general. When I switched to Computer Science I stumbled upon Functional Language and Functional Programming (FP) through Scala programming language (mainly because of Apache Spark) and I loved it. I clearly remember the moment when I realized the connections between category theory language that I was using in my Math-time and then in Computer Science, with the implementations of Functor, Monad, Moniod, etc. It was fascinating and sweet, indeed. I quickly found out that Scala has a purely functional extension called Scalaz that even has Yoneda object defined. It was a WOW moment! I really enjoyed understanding and applying Yoneda lemma in my Math-time.

It discusses about the a way for developers to

use an

interactive proof assistantto bothimplement their systemand tostate a formal theoremdefining what it means for their system to be correct.

It is a way to close the gap between implementation errors and the formal Mathematical theory behind the implementation. The mechanism is important because when you design a machine learning system that enables you to

find implementation errors

systematicallywithout recourse to empirical testing.

Basically, it gives you formal verification of your program. It does so by using Lean programming language which is an *interactive proof assistant* as well as a programming language (talk by the creator of Lean). I’ll explain more about what interactive proof assistant does.

As a case study, the author developed a system called Centigrad

which allows users to construct arbitrary stochastic computation graphs.

** Stochastic computation graph** is a DAG (computation graph) having deterministic or stochastic computational units. Its

Here’s an example of a simple variational autoencoder (VAE) (which is a generative model with stochastic computation graph introduced by the normal sampling layer in the middle. For more about VAE, see the nice post by fastforwardlabs)

Assume that you want to develop a backpropagation algorithm for such loss function. You may also have a sketch of the final derivation. Also assume that we don’t have access to Tensorflow or other deeplearning frameworks that do automatic differentiation. You may even have no data! **It’s you and your computer** and you want to make a formal Mathematical system that does all the things for you. You can be sure that the derivation is correct formally, when your system doesn’t fail and is then *bug-free*. To do so, you’d need to define a long series of things in your system so that it can understand what the above loss function mean and how to compute backpropagration.

Basically, you’d need to define real number type, tensors, function, differentiation operator , differentiability, integration operator , integrability, conditions such that differentiation commutes with integration, notion of distribution, sampling from a distribution, computing the expectation (monadic operation!) and more details so that the system can understand the meaning the loss function and what backpropagation supposed to do. *The novelty is that in Lean you can do so and it even can lead you interactively from in between steps / lemmas (aka tactics in Lean) to the final machine-checkable formal derivation*.

In conclusion, the authors wrote

… we were able to achieve extremely high confidence that our system was bug-free without needing to think about how all the pieces of the system fit together. In our approach, the computer— not the human—is responsible for ensuring that all the local properties that the developer establishes imply that the overall system is correct.

Given the initial background

- Scala is really a good fit, if one wants to create next Lean programming language.
- This paper lead me to read and think about type theory, decidability and homotopy type theory. Their applications and contributions to the foundations of modern Mathematics. It is truly fascinating that such abstractions not only paves the road for Mathematicians in various fields such as Algebraic Geometry, Algebraic Topology, etc. but also enables creation of computer languages to assist us in constructing proofs and verify computations formally.