**updated 01/08/2017** – added CG solver in reco, adjusted results

As I promised in last post, I’m going to share benchmark of different implementation of matrix factorization with Weighted Alternating Least Squares. User-Item interaction matrix is made from lastfm-360K dataset. Implementations incude:

- My reco R package
- Ben Frederickson implicit python module
- Apache Spark implementation
- Quora’s qmf solver

For the transparency I’ve created repository with all the code. If you will find any flaws plese write me a message.

### Benchmark set-up

- Hardware – Intel Xeon X3470, 4 physical cores, 8 threads (Nehalem architecture). All implementations used 8 threads since I found this adds some performance compared to 4 cores.
- OpenBLAS. All implementations use high-level parallelism, so in order to avoid thread contention I’ve limited BLAS threads to 1:
`export OPENBLAS_NUM_THREADS=1`

- In order to compare apples-to-apples all C-family imlpementations (
`qmf`

,`rect`

,`implicit`

) were compiled with following flags:`-O3 -ffast-math -march=native -msse4.2`

- Spark was compiled with native BLAS support as described in instruction
- I didn’t compare accuracy of implementations. I’m pretty sure about
`reco`

,`implicit`

and`qmf`

, but I’m quite sceptical about Spark. From my experience almost all algorithms from Spark’s MlLib are far from perfect (mildly speaking) - Each implementation run for 3 iterations

### Results

```
library(ggplot2)
library(plotly)
df = read.csv("https://raw.githubusercontent.com/dselivanov/bench-wals/master/results.csv")
g = ggplot(df) +
geom_line(aes(x = rank, y = time, col = solver)) +
geom_point(aes(x = rank, y = time, col = solver))
ggplotly(g, width = 9)
```

## Surprise

Biggest surprise is that Spark’s implementation was comparable to others! On `rank = 128`

it even outperforms `qmf`

. I don’t know exact reason, but may be `qmf`

doesn’t use native BLAS. Also it would be worth to check ranking accuracy of Spark’s results.

## Conclusions

- Ben’s solver based on Conjugate Gradient is the fastest. reco’s Conjugate Gradient only a little bit slower (20% on rank 128).
- reco is fastest among Cholesky solvers.
~~Would be interesting to implement CG solver as well.~~– done. - At small ranks Spark is several times slower than other packages. But with larger values of rank it cathes. Seems for small ranks overhead of calling native routines is substantial.