2011 > April > April 18, 2011

Peerasak Wangsom JCSSE2007 (PDF)

File information

Title: Full page photo print
Author: Hong

This PDF 1.6 document has been generated by PScript5.dll Version 5.2 / Acrobat Distiller 7.0 (Windows), and has been sent on pdf-archive.com on 18/04/2011 at 07:58, from IP address 202.6.x.x. The current document download page has been viewed 1000 times.
File size: 635.94 KB (6 pages).
Privacy: public file

File preview

National Joint Conference on Computer Science & Software Engineering (JCSSE2007)
May, 2-4 2007

A Performance Comparison of Matrix Multiplication Algorithms on Chip Multiprocessors

Email: wangsom@hotmail.com, wmrs@cs.tu.ac.th

1. #

Abstract

4 55 (Chip Multiprocessor)
4< = = > ? 4A>B C D4AE=
F?F5G4 =C>
B ? 4 FHI ? 4A
? JK = 4 55DGJE4K=K
I4 ? I4LCF
LC ? 4MHHIF 4 55 N
FF 2, 4 8 ?4A G?
CE4 ?HI ?4AH
C4< 16 ? L?NE=K [1]
4 55 4AFF
G? > F ? 4A? HV
H N V W 4AC
F 4 55 N 4
4 D4M [2]
4F4I 4W [3] ?K=
?HFFH (Measurement) LC LC
CIJ FFF4Y FF VC
HV JK =ACE=HJK=FFFLC L

The challenge of chip multiprocessors (CMPs) is
not only on the technology to put a complete
multiprocessor on a single chip, but also on the
techniques to optimize its performance. This paper
shows how a discrete-event simulation model with
animation could contribute to the study of
performance impacts on such a system.
We
compared the performance between two matrixmultiplication algorithms, Cannon’s and Fox’s, on a
16-core CMP and found that Cannon’s algorithm
show a better performance in terms of both speed and
processor utilization.

!! " " $
!!
! % & ' (
)"*% +,
-%'. !! !&%
)"
"/ !!'(
.
, *" )"
.01 . 2 + ,
!!%
. ( *% % 1

' !! +,
. 1.
3 . & !!& & &!!4 . ! 16 " 1
! . 1 &!!
& +,
.

& )
.
) " $

477

I 4 4< E? E = HN ?
>=HJCAG= HE?D BJ K
? 4A N C BNH
? K = AG= H >= KHI 4 K4 F 4I
4 E= B >VN H N N HV
4Y LC>=HJ ? J
VW4 > 4 55
FFFHJ CD J >
FF4<LCE E=
H NNVWJ > 4 55FF SIMD (Single Instruction Multiple
Data) C 16 ?4A? ?LCFF
>? FF (2D mesh torus) L
4 4A]M C
G 5FFG?> 5VC 4< KH> LC
F LC ? 4<4Y
> FF 4 55 H NN
V W4 F F4 G
5> 4 x 4 32 x 32 FFG?>
? V5VC 4<CF FF E=?
VFF (Cannon_s algorithm) [4]
FF]5 (Fox_s algorithm) [5] N N A
I4E=? VFFN ()
4AB? VFF]5K
5N > ? (>) LC C >>
5 V FF ?
(Slowdown) C G ? V FF]5 HV
=CH4A=?K= V
FF]5 HN F? VFF
N () K=4 ?-

4A (Processor Utilization) fCC=
95 5VC G ? VFF]5DV = 14
??E4> FN4F= >=C
2 VG 5FFG?> >=C
3 FFHJ SIMD CK=K H >=C 4
FF H >=C 5 I4 4A

2. %&
VG 5FFG?>
G4FF N CFF4AK =W
HJHH FFF ?4AFF [6] LFF SIMD 4
F [7] 5VC DV =?HJHH N HJK =E=
VC4 G ?B>=HJCE?
HJ V ? N 4 F F E=
N H NNHV LVW VFF
FF]5 5VC 4< VC
JE4K=KD4M FFG?>
CE4 4<CF?4< VCK=
LNC ?HJ? 4 (Memory
efficient) [4, 5] HV ?H F4FK=F
4 55FF SIMD C? ?
4A?N ?HJ>HJ
2.1 %%
VFF4A
I 5 ? > N 5 K
N >N N J K =
F 4 55 ?4A?
(P) HJ p ??FF p × p >?
FF G 5 A × B = C

478

5 A B > n × n V
FFHF? 5 A B 4<
5? Ai , j Bi , j (0 ≤ i, j < p ) 5VC
> (n / p ) × (n / p ) HJ p FB
HN 5? Ai , j Bi , j HDGBFE=K
?4A? Pi , j
>NC VC H 5?K Pi , j K ?
C I P KDC i I 5 ? Ai , j E4
5= HJ i N I P K C j
I 5 ? Bi , j >VN >= FHJ j N
HCH 5?= Pi , j HE=F 5
? Ai ,( j + i ) mod p B(i + j ) mod p , j HN Pi , j H

J AG?E4 VFF]5
>N 4A N
?F? K VFF
5 A B HDGF? 4< 5? Ai , j
Bi , j BFE=K ?4A? Pi , j
>N C VC V F F ] 5
H? l l 5 ? A > I Pi ,i
( P0,0 , P1,1 ,..., P p −1, p −1 ) E4K = P I ?CG?K
D HN Pi , j J G 5 ?
? 5? A CE=FF Bi , j =BF
?E=K 5? Ci , j
>N C V FF]5 H ?
l l 5 ? A K ? L 5
? A H Pi ,( j + i ) mod p > >N ? = N

G 5?N =BF?E=K 5
? Ci , j
>NC I P H I 5? A E4
5= 1 N I 5? B >VN>= F
1 N HN Pi , j HG 5? A B C
E= F = J A C E = E 4 F 5
? Ci , j J>NC 2 N p − 1 N B
HE=A> 5 C
4AFF> (Tp ) LC K=
VFF I44< N [4]
3

Tp =

2

n
n
+ 4ts p + 4tw
p
p

? E4K = P I ?CG?KD I
5? B >VN>= F 1 N HN Pi , j HG
5? A B CE=F=JAC
E=E4F 5? Ci , j J>NC 2
N p − 1 N BHE=A> 5 C
4AFF> (Tp ) LC K=
VFF]5 I44< N [5]
Tp =

(1)

n3
+ ts p + twn 2
p

(2)

4 > VN (J
HC 1 2) 4<CF? V
FFK=4AB?FF
] 5 LC 4 A F D 4M
FFG?>CE4 [4, 5] NHV
4< E4E= C H J 4I K = F 4
55

LC ts = C=J F? >=G
t w = CK=K? >=G> 1
2.2 %'
V FF]5 I 5
??F?ll (Broadcast) E4K =I
? 4AKD LC J E4

479

AG=K=HV HFDG= > 4A
> V E= ? K = 4Iv FI
LCE LGJFJ >N

3. %)#* SIMD /
FFn (HASE L Hierarchical computer
Architecture design and Simulation Environment)
[8] 4<LC L= FFHJ IE?
?LC (Discrete-Event Simulation Model) LC
VW4 > D4M
5VC KFFn FFHJ D4M
=K=G?? VC LC L
LC?K =AG=K=rK ?>VN N NAG= H
E=JFFHJ SIMD FF 16 ?4A
?FF>? FF (G4C 1) K=K

4.
1 1)
H NNHV J4HFFnC
? K=LC4FF4 >
V G 5 LC V C
F 4 55FF SIMD 16
? ?FF>? FF J?C
E= H FFHJ F F = FF
(Analytical Model) LCL?LCDL>
>=GCE=HFFHJ
4.1 212342&*%)#*
LC L ? >= G C E = H FFHJ N
LCDLE= AG= HJK4A
G 5 C > 2 C E = H FFHJ
4FFFCJH=FF
J ? J ?
(Slowdown) L4AC C>VN>
5I>LCFF> 4 x 4

G4C 1 FFHJ SIMD Fn

FFHJ SIMD Fn CK =
AG=K =4 F ? 1 I J C W5F K =
AG= K = r V C = V W>VN
4AFFFHJ E= K ?
HJ FFnH= E]FV>N
J > FFHJ (Trace file) E= FF
nD E] ? >VN 4<
LCE K =AG=K= B=G IC
>VN ? 4A ( G4C 1) N
1




(Slowdown)
225.0
200.0
175.0
150.0

Cannon SIM

125.0

Cannon Model

100.0

Fox SIM

75.0

Fox Model

50.0
25.0
0.0
4x4

8x8

16x16

32x32



G4C 2 ? > G 5C>

2

http://www.icsa.inf.ed.ac.uk/research/groups/hase/models/

480

4x4 8x8 16x16 32x32

> V >> 5)
??4A F
FFHJ ? Jl D

G4C 2 ? ? 4FF ?
VCE=HFFHJ H
= FF 5VC F? N V ?
? HFFHJ E? ? F ? C E = H
=FF LCF D (t-test)
HV I4E=?4A> N V
FFFHJ SIMD DG=
4ACE=HFFHJ DJE4K=
4<>=GLCVW H?E4E=
4.2
&*
LC4 F F4 > V
G 5F 4 55 AG= HK=
G4FF FF]4F=
4M H H 4M H H L ,

. 1 . 3 . & !! E= ?
VFF FF]5 4MHH
C L %
. 5VC F? 4< >
4 x 4 32 x 32 H > N
4MHH?4AFFFHJ
4.3 *%
5
4A V N F
FFHJ SIMD 16 ?4A3 F?
LC4AG 5> 4 x 4
32 x 32 VFFK=4A 102 21,250 F4 B?FF]5
G? 92 6,016 F JF N NA
D F? 4MHHN N (4

5. 75%

H K=>=GHFFHJ
F? V FFK=
4AB ? V FF]5
>=I4> [4, 5] CI4HK==FF
N N H K= LC E F
FFHJ F? K>NC > N
V5VC 4<>NC= J5NJ V
FF]5K = 4A?LC H
= ?ll 5? A K =I
?4A?CG?D K>C
VFFK= I 5?
A N

!"# $$%
100
80
Speedup SIM

40

Speedup Model

%

60

20

24

x1

02

4

12

56

x5
10



51
2

28

x2

x1

25
6

x6
4

12
8

x3
2

64

32

8

x1
6
16

8x

4x

4

0

G4C 3 4ACB?>

HG4C 3 DV 4ACB
? L? (Speedup) > VFF
LC F F FF]5 H B E= ?
VFF? LC
5> C>VN H ?> N
4MHHC >VNJK =I4E=? >>

3

K=nC 3 J FFF4x F Fedora Core 4.0
FLC Intel Pentium 4 B 3 GHz ?HJ 512 MB
4
DV Fllz > FFHJ (simulation cycle)

481

5C C>VN ?4A
> V FF? V
FF]5 ?? EB
=FF> N V F 5>K l?
? 1,024 x 1,024 HF? V
4AB? VFF]5
LCK=LCE FFFHJ LF
E]F VJ LC H C ?
4ADG K= 5VC DV K=
4 ?4A =F? V
FFN K=4 ?
4AfCC= 95 5VC G ?FF]5
DV = 14 HV I4E= ? V G 5FF F 4 55FF SIMD C ?4A?
16 ??LCFF>? ?FF
? VFF]5 LC4AF
5>E? 1,024 x 1,024
H NNNK = B? VW>N
J FFHJ CLCE ?K =
VWAFCF4 K
4A> VFD4Mf
? JE= ?H>VN 5VC H4<4
? F W AG= A 4 55 AG= =
4W AG= r 4 KL
VC = 4<IJC J F
G 5 FFG? > FD4M
? A
K=4 ?4AHJE4K=LC

4 F ?4AK
E=

6.
9
>>FI Professor Roland Ibbett J F
FFnFF SIMD 5VC AG= HE=Jr?

7. ;**
[1] F.J. Villa, M.E. Acacio and J.M. Garcia, Memory Subsystem
Characterization in a 16-core Snoop-Based ChipMultiprocessor Architecture, In Proceedings of The 2005 on
HPCC, Springer, September 2005, pp. 213-222.
[2] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands and
K. Yelick, The potential of the cell processor for scientific
computing, In Proceedings of the 3rd conference on
Computing frontiers, 2006, pp. 9-20.
[3] F. Franchetti, Y. Voronenko and M. Puschel, FFT program
generation for shared memory: SMP and multicore, In
Proceedings of the 2006 ACM/IEEE conference on
Supercomputing, 2006.
[4] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction
to Parallel Computing design and analysis of algorithms, The
Benjamin/Cummings, 1994.
[5] A. Gupta and V. Kumar, Scalability of Parallel Algorithms
for Matrix Multiplication, In Proceedings of the 1993
International Conference on Parallel Processing, 1993.
[6] M. Krishnan and J. Nieplocha, SRUMMA: a matrix
multiplication algorithm suitable for clusters and scalable
shared memory systems, In Proceedings of IPDPS 2004. pp.
70-80.
[7] P. Bjorstad, F. Manne, T. Sorevik and M. Vajteric, Efficient
matrix multiplication on SIMD computers, SIAM Journal on
Matrix Analysis and Applications, 1992, pp. 386-401.
[8] P.S. Cole, F.W. Howell, R.N. Ibbett and L.M. Williams, A
Hierarchical Computer Architecture Design and Simulation
Environment, ACM Transaction on Modeling and Computer
Simulation, 1998, pp.431-446.

482

Download Peerasak Wangsom JCSSE2007

Peerasak_Wangsom_JCSSE2007.pdf (PDF, 635.94 KB)

Download PDF

Share this file on social networks

Link to this page

Permanent link

Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..

Short link

Use the short link to share your document on Twitter or by text message (SMS)

HTML Code

Copy the following HTML code to share your document on a Website or Blog

QR Code to this page

QR Code link to PDF file Peerasak_Wangsom_JCSSE2007.pdf

This file has been shared publicly by a user of PDF Archive.
Document ID: 0000030069.
Report illicit content