This PDF 1.6 document has been generated by PScript5.dll Version 5.2 / Acrobat Distiller 7.0 (Windows), and has been sent on pdf-archive.com on 18/04/2011 at 07:58, from IP address 202.6.x.x.
The current document download page has been viewed 1000 times.
File size: 635.94 KB (6 pages).
Privacy: public file
National Joint Conference on Computer Science & Software Engineering (JCSSE2007)
May, 2-4 2007
A Performance Comparison of Matrix Multiplication Algorithms on Chip Multiprocessors
Email: wangsom@hotmail.com, wmrs@cs.tu.ac.th
1. #
Abstract
4 55 (Chip Multiprocessor)
4< = = > ? 4A>B C D4AE=
F?F5G4 =C>
B ? 4 FHI ? 4A
? JK = 4 55DGJE4K=K
I4 ? I4LCF
LC ? 4MHHIF 4 55 N
FF 2, 4 8 ?4A G?
CE4 ?HI ?4AH
C4< 16 ? L?NE=K [1]
4 55 4AFF
G? > F ? 4A? HV
H N V W 4AC
F 4 55 N 4
4 D4M [2]
4F4I 4W [3] ?K=
?HFFH (Measurement) LC LC
CIJ FFF4Y FF VC
HV JK =ACE=HJK=FFFLC L
The challenge of chip multiprocessors (CMPs) is
not only on the technology to put a complete
multiprocessor on a single chip, but also on the
techniques to optimize its performance. This paper
shows how a discrete-event simulation model with
animation could contribute to the study of
performance impacts on such a system.
We
compared the performance between two matrixmultiplication algorithms, Cannon’s and Fox’s, on a
16-core CMP and found that Cannon’s algorithm
show a better performance in terms of both speed and
processor utilization.
!! " " $
!!
! % & ' (
)"*% +,
-%'. !! !&%
)"
"/ !!'(
.
, *" )"
.01 . 2 + ,
!!%
. ( *% % 1
' !! +,
. 1.
3 . & !!& & &!!4 . ! 16 " 1
! . 1 &!!
& +,
.
& )
.
) " $
477
I 4 4< E? E = HN ?
>=HJCAG= HE?D BJ K
? 4A N C BNH
? K = AG= H >= KHI 4 K4 F 4I
4 E= B >VN H N N HV
4Y LC>=HJ ? J
VW4 > 4 55
FFFHJ CD J >
FF4<LCE E=
H NNVWJ > 4 55FF SIMD (Single Instruction Multiple
Data) C 16 ?4A? ?LCFF
>? FF (2D mesh torus) L
4 4A]M C
G 5FFG?> 5VC 4< KH> LC
F LC ? 4<4Y
> FF 4 55 H NN
V W4 F F4 G
5> 4 x 4 32 x 32 FFG?>
? V5VC 4<CF FF E=?
VFF (Cannon_s algorithm) [4]
FF]5 (Fox_s algorithm) [5] N N A
I4E=? VFFN ()
4AB? VFF]5K
5N > ? (>) LC C >>
5 V FF ?
(Slowdown) C G ? V FF]5 HV
=CH4A=?K= V
FF]5 HN F? VFF
N () K=4 ?-
4A (Processor Utilization) fCC=
95 5VC G ? VFF]5DV = 14
??E4> FN4F= >=C
2 VG 5FFG?> >=C
3 FFHJ SIMD CK=K H >=C 4
FF H >=C 5 I4 4A
2. %&
VG 5FFG?>
G4FF N CFF4AK =W
HJHH FFF ?4AFF [6] LFF SIMD 4
F [7] 5VC DV =?HJHH N HJK =E=
VC4 G ?B>=HJCE?
HJ V ? N 4 F F E=
N H NNHV LVW VFF
FF]5 5VC 4< VC
JE4K=KD4M FFG?>
CE4 4<CF?4< VCK=
LNC ?HJ? 4 (Memory
efficient) [4, 5] HV ?H F4FK=F
4 55FF SIMD C? ?
4A?N ?HJ>HJ
2.1 %%
VFF4A
I 5 ? > N 5 K
N >N N J K =
F 4 55 ?4A?
(P) HJ p ??FF p × p >?
FF G 5 A × B = C
478
5 A B > n × n V
FFHF? 5 A B 4<
5? Ai , j Bi , j (0 ≤ i, j < p ) 5VC
> (n / p ) × (n / p ) HJ p FB
HN 5? Ai , j Bi , j HDGBFE=K
?4A? Pi , j
>NC VC H 5?K Pi , j K ?
C I P KDC i I 5 ? Ai , j E4
5= HJ i N I P K C j
I 5 ? Bi , j >VN >= FHJ j N
HCH 5?= Pi , j HE=F 5
? Ai ,( j + i ) mod p B(i + j ) mod p , j HN Pi , j H
J AG?E4 VFF]5
>N 4A N
?F? K VFF
5 A B HDGF? 4< 5? Ai , j
Bi , j BFE=K ?4A? Pi , j
>N C VC V F F ] 5
H? l l 5 ? A > I Pi ,i
( P0,0 , P1,1 ,..., P p −1, p −1 ) E4K = P I ?CG?K
D HN Pi , j J G 5 ?
? 5? A CE=FF Bi , j =BF
?E=K 5? Ci , j
>N C V FF]5 H ?
l l 5 ? A K ? L 5
? A H Pi ,( j + i ) mod p > >N ? = N
G 5?N =BF?E=K 5
? Ci , j
>NC I P H I 5? A E4
5= 1 N I 5? B >VN>= F
1 N HN Pi , j HG 5? A B C
E= F = J A C E = E 4 F 5
? Ci , j J>NC 2 N p − 1 N B
HE=A> 5 C
4AFF> (Tp ) LC K=
VFF I44< N [4]
3
Tp =
2
n
n
+ 4ts p + 4tw
p
p
? E4K = P I ?CG?KD I
5? B >VN>= F 1 N HN Pi , j HG
5? A B CE=F=JAC
E=E4F 5? Ci , j J>NC 2
N p − 1 N BHE=A> 5 C
4AFF> (Tp ) LC K=
VFF]5 I44< N [5]
Tp =
(1)
n3
+ ts p + twn 2
p
(2)
4 > VN (J
HC 1 2) 4<CF? V
FFK=4AB?FF
] 5 LC 4 A F D 4M
FFG?>CE4 [4, 5] NHV
4< E4E= C H J 4I K = F 4
55
LC ts = C=J F? >=G
t w = CK=K? >=G> 1
2.2 %'
V FF]5 I 5
??F?ll (Broadcast) E4K =I
? 4AKD LC J E4
479
AG=K=HV HFDG= > 4A
> V E= ? K = 4Iv FI
LCE LGJFJ >N
3. %)#* SIMD /
FFn (HASE L Hierarchical computer
Architecture design and Simulation Environment)
[8] 4<LC L= FFHJ IE?
?LC (Discrete-Event Simulation Model) LC
VW4 > D4M
5VC KFFn FFHJ D4M
=K=G?? VC LC L
LC?K =AG=K=rK ?>VN N NAG= H
E=JFFHJ SIMD FF 16 ?4A
?FF>? FF (G4C 1) K=K
4.
1 1)
H NNHV J4HFFnC
? K=LC4FF4 >
V G 5 LC V C
F 4 55FF SIMD 16
? ?FF>? FF J?C
E= H FFHJ F F = FF
(Analytical Model) LCL?LCDL>
>=GCE=HFFHJ
4.1 212342&*%)#*
LC L ? >= G C E = H FFHJ N
LCDLE= AG= HJK4A
G 5 C > 2 C E = H FFHJ
4FFFCJH=FF
J ? J ?
(Slowdown) L4AC C>VN>
5I>LCFF> 4 x 4
G4C 1 FFHJ SIMD Fn
FFHJ SIMD Fn CK =
AG=K =4 F ? 1 I J C W5F K =
AG= K = r V C = V W>VN
4AFFFHJ E= K ?
HJ FFnH= E]FV>N
J > FFHJ (Trace file) E= FF
nD E] ? >VN 4<
LCE K =AG=K= B=G IC
>VN ? 4A ( G4C 1) N
1
(Slowdown)
225.0
200.0
175.0
150.0
Cannon SIM
125.0
Cannon Model
100.0
Fox SIM
75.0
Fox Model
50.0
25.0
0.0
4x4
8x8
16x16
32x32
G4C 2 ? > G 5C>
2
http://www.icsa.inf.ed.ac.uk/research/groups/hase/models/
480
4x4 8x8 16x16 32x32
> V >> 5)
??4A F
FFHJ ? Jl D
G4C 2 ? ? 4FF ?
VCE=HFFHJ H
= FF 5VC F? N V ?
? HFFHJ E? ? F ? C E = H
=FF LCF D (t-test)
HV I4E=?4A> N V
FFFHJ SIMD DG=
4ACE=HFFHJ DJE4K=
4<>=GLCVW H?E4E=
4.2
&*
LC4 F F4 > V
G 5F 4 55 AG= HK=
G4FF FF]4F=
4M H H 4M H H L ,
. 1 . 3 . & !! E= ?
VFF FF]5 4MHH
C L %
. 5VC F? 4< >
4 x 4 32 x 32 H > N
4MHH?4AFFFHJ
4.3 *%
5
4A V N F
FFHJ SIMD 16 ?4A3 F?
LC4AG 5> 4 x 4
32 x 32 VFFK=4A 102 21,250 F4 B?FF]5
G? 92 6,016 F JF N NA
D F? 4MHHN N (4
5. 75%
H K=>=GHFFHJ
F? V FFK=
4AB ? V FF]5
>=I4> [4, 5] CI4HK==FF
N N H K= LC E F
FFHJ F? K>NC > N
V5VC 4<>NC= J5NJ V
FF]5K = 4A?LC H
= ?ll 5? A K =I
?4A?CG?D K>C
VFFK= I 5?
A N
!"# $$%
100
80
Speedup SIM
40
Speedup Model
%
60
20
24
x1
02
4
12
56
x5
10
51
2
28
x2
x1
25
6
x6
4
12
8
x3
2
64
32
8
x1
6
16
8x
4x
4
0
G4C 3 4ACB?>
HG4C 3 DV 4ACB
? L? (Speedup) > VFF
LC F F FF]5 H B E= ?
VFF? LC
5> C>VN H ?> N
4MHHC >VNJK =I4E=? >>
3
K=nC 3 J FFF4x F Fedora Core 4.0
FLC Intel Pentium 4 B 3 GHz ?HJ 512 MB
4
DV Fllz > FFHJ (simulation cycle)
481
5C C>VN ?4A
> V FF? V
FF]5 ?? EB
=FF> N V F 5>K l?
? 1,024 x 1,024 HF? V
4AB? VFF]5
LCK=LCE FFFHJ LF
E]F VJ LC H C ?
4ADG K= 5VC DV K=
4 ?4A =F? V
FFN K=4 ?
4AfCC= 95 5VC G ?FF]5
DV = 14 HV I4E= ? V G 5FF F 4 55FF SIMD C ?4A?
16 ??LCFF>? ?FF
? VFF]5 LC4AF
5>E? 1,024 x 1,024
H NNNK = B? VW>N
J FFHJ CLCE ?K =
VWAFCF4 K
4A> VFD4Mf
? JE= ?H>VN 5VC H4<4
? F W AG= A 4 55 AG= =
4W AG= r 4 KL
VC = 4<IJC J F
G 5 FFG? > FD4M
? A
K=4 ?4AHJE4K=LC
4 F ?4AK
E=
6.
9
>>FI Professor Roland Ibbett J F
FFnFF SIMD 5VC AG= HE=Jr?
7. ;**
[1] F.J. Villa, M.E. Acacio and J.M. Garcia, Memory Subsystem
Characterization in a 16-core Snoop-Based ChipMultiprocessor Architecture, In Proceedings of The 2005 on
HPCC, Springer, September 2005, pp. 213-222.
[2] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands and
K. Yelick, The potential of the cell processor for scientific
computing, In Proceedings of the 3rd conference on
Computing frontiers, 2006, pp. 9-20.
[3] F. Franchetti, Y. Voronenko and M. Puschel, FFT program
generation for shared memory: SMP and multicore, In
Proceedings of the 2006 ACM/IEEE conference on
Supercomputing, 2006.
[4] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction
to Parallel Computing design and analysis of algorithms, The
Benjamin/Cummings, 1994.
[5] A. Gupta and V. Kumar, Scalability of Parallel Algorithms
for Matrix Multiplication, In Proceedings of the 1993
International Conference on Parallel Processing, 1993.
[6] M. Krishnan and J. Nieplocha, SRUMMA: a matrix
multiplication algorithm suitable for clusters and scalable
shared memory systems, In Proceedings of IPDPS 2004. pp.
70-80.
[7] P. Bjorstad, F. Manne, T. Sorevik and M. Vajteric, Efficient
matrix multiplication on SIMD computers, SIAM Journal on
Matrix Analysis and Applications, 1992, pp. 386-401.
[8] P.S. Cole, F.W. Howell, R.N. Ibbett and L.M. Williams, A
Hierarchical Computer Architecture Design and Simulation
Environment, ACM Transaction on Modeling and Computer
Simulation, 1998, pp.431-446.
482
Peerasak_Wangsom_JCSSE2007.pdf (PDF, 635.94 KB)
Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..
Use the short link to share your document on Twitter or by text message (SMS)
Copy the following HTML code to share your document on a Website or Blog