cs267 hw0 cyclades (1) .pdf
Original filename: cs267-hw0-cyclades (1).pdf
This PDF 1.5 document has been generated by TeX / pdfTeX-1.40.17, and has been sent on pdf-archive.com on 25/01/2017 at 07:56, from IP address 108.205.x.x.
The current document download page has been viewed 243 times.
File size: 84 KB (2 pages).
Privacy: public file
Download original PDF file
CS267 HW0: Cyclades
January 25, 2017
My name is David Winer, and I am a Masters student in EECS. I studied Applied Math
at Brown for undergrad. In this course, I am hoping to learn more about applications of
parallel computing in industry and research, with specific focus in modern data science and
machine learning. I am currently working with Lee Fleming (IEOR) on applications of
natural language processing (NLP) techniques to US patent data.
Cyclades is a general framework for parallelizing a family of machine learning algorithms:
stochastic update (SU) algorithms . It builds off of the work of Hogwild!, an update
scheme that allows for parallel stochastic gradient descent (SGD) updates without the use of
locks . It is more general than Hogwild!, however, since it can be applied to any arbitrary
SU algorithm, not just stochastic gradient descent (e.g., stochastic variance reduced gradient,
SAGA). The trick to the Cyclades approach is allocating related updates u1 , . . . , un to cores
in a way that minimizes conflicts. Given certain sparsity constraints, Cyclades can achieve
nearly linear speedup (while guaranteeing similar convergence behavior as you would expect
from a serial algorithm).
Target platform and performance
The initial Cyclades implementation was developed in C++ and tested on a 72-CPU
machine spread across 4 NUMA nodes. Each node had 18 CPUs and 1 TB of memory. It was
tested on four machine learning tasks (each of which employs an SU algorithm): least squares,
computing the top eigenvector of AT A for an adjacency matrix A, matrix completion, and
semantic word embedding (Word2Vec). Each task was tested with a maximum of 18 threads
(i.e., within one NUMA node). Across all tasks, Cyclades achieved significant speedup
over a serial approach: between 3x (graph eigenvectors) and 12x (word embedding) on 18
threads. Even more importantly, it outperformed Hogwild! solutions to each of these
problems, especially with a larger number of threads (> 2 − 4).
The main limitation associated with this approach is the sparsity constraints required for
Cyclades to achieve meaningful speedup. For example, for a dense machine learning
problem where each update requires accessing every component of the model variable x,
Cyclades would not produce better performance than a serial algorithm.
 X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang, M. Jordan, K. Ramchandran, C.
Re, and B. Recht. Cyclades: Conflict-free asynchronous machine learning. NIPS, 2016.
 F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.