Peerasak Wangsom JCSSE2007 (PDF)




File information


Title: Full page photo print
Author: Hong

This PDF 1.6 document has been generated by PScript5.dll Version 5.2 / Acrobat Distiller 7.0 (Windows), and has been sent on pdf-archive.com on 18/04/2011 at 07:58, from IP address 202.6.x.x. The current document download page has been viewed 1000 times.
File size: 635.94 KB (6 pages).
Privacy: public file
















File preview


National Joint Conference on Computer Science & Software Engineering (JCSSE2007)
May, 2-4 2007

 
      
A Performance Comparison of Matrix Multiplication Algorithms on Chip Multiprocessors


       

       
Email: wangsom@hotmail.com, wmrs@cs.tu.ac.th

1.  #

Abstract

 4 55 (Chip Multiprocessor)
4<   =  =    >  ? 4 A>B  C  D4 AE=
F?F5G4  =  C>
B  ?   4 FHI ? 4 A 
?  JK = 4 55DGJE4K=K
I4 ? I4LC F 
LC ? 4MHHIF 4 55 N
FF 2, 4  8 ? 4 A  G?
C E4  ? HI ? 4 AH
 C4< 16 ?  L ?NE=K [1]
 4     55   4 AFF
G? > F  ? 4 A?   HV 
 H  N  V  W  4 AC
 F 4 55 N 4 
4     D4M   [2] 
4F4I  4W [3]  ? K=  
?HFFH (Measurement) LC  LC
 CIJ FFF4Y FF VC
HV JK =ACE=HJK=FFFLC L

The challenge of chip multiprocessors (CMPs) is
not only on the technology to put a complete
multiprocessor on a single chip, but also on the
techniques to optimize its performance. This paper
shows how a discrete-event simulation model with
animation could contribute to the study of
performance impacts on such a system.
We
compared the performance between two matrixmultiplication algorithms, Cannon’s and Fox’s, on a
16-core CMP and found that Cannon’s algorithm
show a better performance in terms of both speed and
processor utilization.

 
  
         
   
 !!     " "    $
 !!
! % &  ' (
)"*%   + ,
-%'. !! !&%
)"
"/ !!'(
.
, *" )"
.01 . 2    + , 
 !!%
.  (  *% % 1 

 '  !!   + ,
.  1.
 3  .  & !!&  & &!!4 . !       16 "   1
!    .  1  &!!
&    + ,
.

& )
.
)    "  $

477

I 4 4<   E? E = HN     ?  
>=HJCAG= HE?D BJ K
 ? 4 A N C BNH
? K = AG= H  >=  KHI 4 K4 F 4I
4    E=  B >VN   H   N  N HV 
4Y LC>=HJ ? J
VW4  >  4 55
FFFHJ CD J >
FF4<LCE E=
 H NNVWJ >  4 55FF SIMD (Single Instruction Multiple
Data) C 16 ? 4 A? ?LCFF
>?   FF (2D mesh torus) L
 4  4 A]M C
G 5FFG?> 5VC 4<  KH> LC
 F LC ? 4<4Y 
> FF 4 55  H NN
V  W4  F  F4    G 
 5> 4 x 4  32 x 32 FFG?>
 ?  V5VC 4<CF FF E=?
 VFF (Cannon_s algorithm) [4]
FF]5 (Fox_s algorithm) [5] N N A
 I4E= ?  VFFN ()
4 AB  ? VFF]5K
  5N  > ? (>) LC  C  >>
  5    V  FF    ?
(Slowdown) C G  ?     V  FF]5 HV 
 =CH4 A= ?K= V
FF]5 HN F ? VFF
N () K=4 ? -

4 A (Processor Utilization) fCC=
95 5VC G  ? VFF]5DV = 14
? ?E4> F N4F=   >=C
2  VG 5FFG?>  >=C
3 FFHJ SIMD CK=K  H  >=C 4
FF  H   >=C 5 I4 4A

2.    %&
 VG 5FFG?> 
G4FF N CFF4 AK =W
HJHH FFF   ? 4 AFF [6] LFF SIMD 4
F  [7] 5VC DV = ?HJHH N HJK =E=
 VC4  G ?B>=HJCE?
HJ     V   ?  N 4  F  F  E=
 N  H NNHV LVW VFF
 FF]5 5VC 4< VC
JE4K=KD4M FFG?>
C E4 4<CF ?4< VCK=
LNC ?  HJ? 4   (Memory
efficient) [4, 5] HV ?H F4FK=F
 4 55FF SIMD C? ? 
4 A?N ?  HJ>HJ
2.1  %%
 VFF4 A 
 I    5 ?  > N    5 K 
  N >N N J K =
F 4 55 ? 4 A?
(P) HJ  p ? ?FF p × p >?
    FF   G    5 A × B = C

478

 5 A  B > n × n  V
FFHF?   5 A  B 4<
 5? Ai , j  Bi , j (0 ≤ i, j < p ) 5VC 
> (n / p ) × (n / p ) HJ  p FB
HN 5? Ai , j  Bi , j HDGBFE =K
? 4 A? Pi , j
>NC VC H 5?K Pi , j K ?
C I  P KD C i I    5 ?   Ai , j E4
 5=  HJ    i N I  P K   C j
I    5 ?   Bi , j >VN  >=  FHJ    j N
 HCH 5?= Pi , j HE=F 5
? Ai ,( j + i ) mod p  B(i + j ) mod p , j HN Pi , j H

J  AG?E4  VFF]5
    >N  4 A N
? F ? K VFF
 5 A  B HDGF? 4< 5? Ai , j
 Bi , j BFE =K ? 4 A? Pi , j
>N     C VC      V   F F ]   5
H?  l l  5 ?   A > I  Pi ,i
( P0,0 , P1,1 ,..., P p −1, p −1 ) E4K = P I ? CG?K
D    HN  Pi , j J G    5 ?  
 ?  5? A CE=FF Bi , j = BF
?E =K 5? Ci , j
>N  C      V  FF]5 H ?
 l l  5 ?   A K ? L    5
?   A H Pi ,( j + i ) mod p > >N  ?   =  N

G 5?N  = BF?E =K 5
? Ci , j
>NC I P H I 5? A E4
 5= 1 N  I 5? B >VN>= F
1 N HN Pi , j HG 5? A  B C
E=  F = J  A   C E = E 4  F   5
? Ci , j J>NC 2 N  p − 1 N B
HE=A>  5 C
 4 AFF> (Tp ) LC  K=
 VFF I44< N [4]
3

Tp =

2

n
n
+ 4ts p + 4tw
p
p

? E4K = P I ? CG?KD    I
 5? B >VN>= F 1 N HN Pi , j HG
 5? A  B CE=F= JAC
E=E4 F 5? Ci , j J>NC 2
N  p − 1 N BHE=A>  5 C
 4 AFF> (Tp ) LC  K=
 VFF]5 I44< N [5]
Tp =

(1)

n3
+ ts p + twn 2
p

(2)

4  >  VN  (J 
HC 1  2) 4<CF ? V
FFK= 4 AB  ?FF
]   5  LC  4    A  F   D  4M      
 FFG?>C E4 [4, 5]  NHV
  4<  E4E= C H J  4I   K = F  4
 55

LC ts =   C=J F? >=G
t w =  CK=K? >=G> 1  
2.2  %'
 V FF]5   I    5
?? F?ll (Broadcast) E4K =I
? 4 AKD     LC  J  E4

479

AG=K=HV  HF DG= > 4 A
>    V  E=   ? K = 4Iv   FI 
LCE LGJFJ >N

3. %)#* SIMD / 
FFn (HASE L Hierarchical computer
Architecture design and Simulation Environment)
[8] 4<LC L= FFHJ   IE?
?LC (Discrete-Event Simulation Model) LC
VW4  > D4M 
5VC KFFn   FFHJ   D4M  
 =K=G??  VC LC L
LC? K =AG=K=rK ?>VN N NAG= H
E=JFFHJ SIMD FF 16 ? 4 A
?FF>?   FF (G4C 1) K=K


4. 
1 1 )
 H NNHV J4HFFnC
?  K=LC4FF4  >
   V  G    5 LC     V  C
 F 4 55FF SIMD 16
?  ?FF>?   FF J?C
E= H FFHJ     F F =  FF
(Analytical Model) LCL ?LCDL>
>=GCE=HFFHJ
4.1 212342&*%)#*
LC  L    ?  >=  G  C E = H FFHJ   N 
LCDLE= AG= HJ K4 A
G    5 C >  2 C E = H FFHJ   
4FFF CJ H=FF
J    ?  J       ?
(Slowdown) L 4 AC C>VN>
 5I>LCFF> 4 x 4

G4C 1 FFHJ SIMD Fn

FFHJ SIMD Fn  CK =
AG=K =4 F ? 1 I  J  C W5F K =
AG= K =  r    V  C =  V  W>VN  
4 AFFFHJ   E= K ?  
HJ FFnH= E]FV>N
J > FFHJ (Trace file) E = FF
nD E] ? >VN 4<
LCE K =AG=K= B=G IC
 >VN ? 4 A ( G4C 1)  N
1

   
     


(Slowdown)
225.0
200.0
175.0
150.0

Cannon SIM

125.0

Cannon Model

100.0

Fox SIM

75.0

Fox Model

50.0
25.0
0.0
4x4

8x8

16x16

32x32

 
 

G4C 2  ? >  G 5C>

2

http://www.icsa.inf.ed.ac.uk/research/groups/hase/models/

480

4x4 8x8 16x16  32x32

>  V >>  5)   
  ? ? 4 A F
FFHJ ? Jl D 

G4C 2  ? ? 4FF ?
  VCE=HFFHJ H
=  FF 5VC F ?  N     V   ?    
? HFFHJ   E?  ?   F ?  C E = H 
=FF LCF  D  (t-test)
HV I4E= ?4 A> N   V
FFFHJ SIMD  DG=  
4 ACE=HFFHJ DJE4K=
4<>=GLCVW H?E4E=
4.2  
&* 
LC4 F  F4    >   V 
G 5F 4 55 AG= HK=
G4FF FF]4F= 
 4M H H  4M H H  L   ,

  .  1  . 3  .  & !!   E=  ?
 VFF FF]5 4MHH
C L %
 . 5VC F? 4< >
4 x 4  32 x 32  H  > N 
4MHH? 4 AFFFHJ
4.3  *%
5 
 4 A  V  N  F
FFHJ SIMD 16 ? 4 A3 F ?
LC4 AG 5> 4 x 4 
32 x 32  VFFK= 4 A 102  21,250 F4 B  ?FF]5
G? 92  6,016 F JF N NA
  D  F ? 4MHHN  N (4

5. 75%
 
H  K=>=GHFFHJ
F ?     V  FFK=  
4 AB  ?     V  FF]5 
>=I4> [4, 5] CI4HK==FF
N N H  K=  LC  E F
FFHJ  F ? K>NC > N 
 V5VC 4<>NC= J5NJ  V
FF]5K = 4 A ?LC H
= ?ll 5? A K =I
? 4 A?CG?D   K>C
 VFFK= I 5?
A  N 
     
!"# $$%
100
80
Speedup SIM

40

Speedup Model

%

60

20

24

x1

02

4

12

56

x5
10

 


51
2

28

x2

x1

25
6

x6
4

12
8

x3
2

64

32

8

x1
6
16

8x

4x

4

0

G4C 3 4 ACB  ?> 

HG4C 3  DV 4 ACB
 ? L? (Speedup) >  VFF
LC    F F FF]5 H B  E= ? 
 VFF?  LC
 5> C>VN H  ? > N
 4MHHC  >VNJK =I4E= ? >>

3

K=n C 3 J FFF4x F  Fedora Core 4.0
FLC Intel Pentium 4  B 3 GHz ?  HJ 512 MB
4
DV Fllz > FFHJ (simulation cycle)

481

 5C C>VN  ? 4 A
>  V FF ? V
FF]5 ?? EB   
=FF> N  V F 5>K l?
?  1,024 x 1,024 HF ?     V 
  4 AB  ? VFF]5
LCK=LCE FFFHJ LF
E]F VJ   LC  H  C ? 
4 ADG  K=  5VC DV   K=
4 ? 4 A = F ?  V
FFN K=4 ? 
4 AfCC= 95 5VC G  ?FF]5
DV = 14 HV I4E= ?  V G 5FF  F 4 55FF SIMD C ? 4 A?
16 ? ?LCFF>?   ?FF
 ? VFF]5 LC4 AF
 5>E?  1,024 x 1,024
 H NNNK = B ? VW>N
J FFHJ CLCE ? K =
VWAFC F4  K
4 A>  VFD4Mf
? JE= ?H>VN 5VC H4<4
?  F W  AG= A    4     55 AG=  =  
4W AG=  r 4 KL  
 VC = 4<IJC J F
G    5  FFG? >  FD4M  
  ? A  
K=4 ? 4 AHJE4K=LC 

  4  F ? 4 AK
E=

6.   
9
>>FI Professor Roland Ibbett J F
FFnFF SIMD 5VC AG= HE=Jr?

7.  ;* *
[1] F.J. Villa, M.E. Acacio and J.M. Garcia, Memory Subsystem
Characterization in a 16-core Snoop-Based ChipMultiprocessor Architecture,€ In Proceedings of The 2005 on
HPCC, Springer, September 2005, pp. 213-222.
[2] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands and
K. Yelick, The potential of the cell processor for scientific
computing,€ In Proceedings of the 3rd conference on
Computing frontiers, 2006, pp. 9-20.
[3] F. Franchetti, Y. Voronenko and M. Puschel, FFT program
generation for shared memory: SMP and multicore,€ In
Proceedings of the 2006 ACM/IEEE conference on
Supercomputing, 2006.
[4] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction
to Parallel Computing design and analysis of algorithms, The
Benjamin/Cummings, 1994.
[5] A. Gupta and V. Kumar, Scalability of Parallel Algorithms
for Matrix Multiplication,€ In Proceedings of the 1993
International Conference on Parallel Processing, 1993.
[6] M. Krishnan and J. Nieplocha, SRUMMA: a matrix
multiplication algorithm suitable for clusters and scalable
shared memory systems,€ In Proceedings of IPDPS 2004. pp.
70-80.
[7] P. Bjorstad, F. Manne, T. Sorevik and M. Vajteric, Efficient
matrix multiplication on SIMD computers,€ SIAM Journal on
Matrix Analysis and Applications, 1992, pp. 386-401.
[8] P.S. Cole, F.W. Howell, R.N. Ibbett and L.M. Williams, A
Hierarchical Computer Architecture Design and Simulation
Environment,€ ACM Transaction on Modeling and Computer
Simulation, 1998, pp.431-446.

482






Download Peerasak Wangsom JCSSE2007



Peerasak_Wangsom_JCSSE2007.pdf (PDF, 635.94 KB)


Download PDF







Share this file on social networks



     





Link to this page



Permanent link

Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..




Short link

Use the short link to share your document on Twitter or by text message (SMS)




HTML Code

Copy the following HTML code to share your document on a Website or Blog




QR Code to this page


QR Code link to PDF file Peerasak_Wangsom_JCSSE2007.pdf






This file has been shared publicly by a user of PDF Archive.
Document ID: 0000030069.
Report illicit content