Peerasak Wangsom JCSSE2007 .pdf

File information


Original filename: Peerasak_Wangsom_JCSSE2007.pdf
Title: Full page photo print
Author: Hong

This PDF 1.6 document has been generated by PScript5.dll Version 5.2 / Acrobat Distiller 7.0 (Windows), and has been sent on pdf-archive.com on 18/04/2011 at 05:58, from IP address 202.6.x.x. The current document download page has been viewed 952 times.
File size: 621 KB (6 pages).
Privacy: public file


Download original PDF file


Peerasak_Wangsom_JCSSE2007.pdf (PDF, 621 KB)


Share on social networks



Link to this file download page



Document preview


National Joint Conference on Computer Science & Software Engineering (JCSSE2007)
May, 2-4 2007



A Performance Comparison of Matrix Multiplication Algorithms on Chip Multiprocessors





Email: wangsom@hotmail.com, wmrs@cs.tu.ac.th

1. #

Abstract

4 5 5 (Chip Multiprocessor)
4< = = > ? 4 A > B C D4 A E =
F ? F5G 4 = C >
B ? 4 F HI ? 4 A
? J K = 4 5 5 DG J E4K =K
I4 ? I4 LC F
LC ? 4MHHIF 4 5 5 N
FF 2, 4 8 ? 4 A G?
C E4 ? HI ? 4 A H
C 4< 16 ? L ? N E =K [1]
4 5 5 4 A FF
G? > F ? 4 A ? HV
H N V W 4 A C
F 4 5 5 N 4
4 D 4M [2]
4 F4 I 4 W [3] ? K =
? H FFH (Measurement) LC L C
C I J F FF 4Y FF VC
HV J K =A CE = H J K = F FF LC L

The challenge of chip multiprocessors (CMPs) is
not only on the technology to put a complete
multiprocessor on a single chip, but also on the
techniques to optimize its performance. This paper
shows how a discrete-event simulation model with
animation could contribute to the study of
performance impacts on such a system.
We
compared the performance between two matrixmultiplication algorithms, Cannon’s and Fox’s, on a
16-core CMP and found that Cannon’s algorithm
show a better performance in terms of both speed and
processor utilization.





!! " " $
!!
! % & ' (
)" *% + ,
-%' . !! ! & %
)"
"/ !!'(
.
, *" )"
. 01 . 2 + ,
!!%
. ( *% % 1

' ! ! + ,
. 1 .
3 . & !!& & &!!4 . ! 16 " 1
! . 1 &!!
& + ,
.

& )
.
) " $

477

I 4 4< E ? E = H N ?
>= HJ CAG= H E ? D B J K
? 4 A N C B N H
? K = AG= H >= KH I 4 K 4 F 4 I
4 E = B >VN H N N HV
4Y LC >= HJ ? J
V W 4 > 4 5 5
F FFHJ C D J >
FF 4< LC E E =
H N N V W J > 4 5 5 FF SIMD (Single Instruction Multiple
Data) C 16 ? 4 A ? ? LC FF
>? F F (2D mesh torus) L
4 4 A ]M C
G 5 FF G?> 5VC 4< KH> LC
F LC ? 4< 4Y
> FF 4 5 5 H N N
V W 4 F F4 G
5 > 4 x 4 32 x 32 FF G?>
? V 5VC 4< C F FF E = ?
V FF (Cannon_s algorithm) [4]
FF] 5 (Fox_s algorithm) [5] N N A
I4E = ? V FF N ( )
4 A B ? V FF] 5 K
5 N > ? (>) LC C > >
5 V FF ?
(Slowdown) C G ? V FF] 5 HV
= CH 4 A = ? K = V
FF] 5 H N F ? V FF
N ( ) K =4 ? -

4 A (Processor Utilization) f C C =
95 5VC G ? V FF] 5 DV = 14
? ? E4> F N4 F = >= C
2 V G 5 FF G?> >= C
3 FFHJ SIMD CK =K H >= C 4
F F H >= C 5 I4 4 A

2. % &
V G 5 FF G?>
G4 FF N C FF 4 A K = W
HJ H H F FF ? 4 A FF [6] L FF SIMD 4
F [7] 5VC DV = ? HJ H H N H J K =E =
V C 4 G ? B >= HJ CE ?
H J V ? N 4 F F E =
N H N NHV L V W V FF
FF] 5 5VC 4< V C
J E4K =K D 4M FF G?>
C E4 4< C F ? 4< V CK =
LN C ? HJ ? 4 (Memory
efficient) [4, 5] HV ? H F 4 FK =F
4 5 5 FF SIMD C ? ?
4 A ? N ? HJ > HJ
2.1 % %
V FF 4 A
I 5 ? > N 5 K
N > N N J K =
F 4 5 5 ? 4 A ?
(P) HJ p ? ? FF p × p >?
F F G 5 A × B = C

478

5 A B > n × n V
FF H F? 5 A B 4<
5 ? Ai , j Bi , j (0 ≤ i, j < p ) 5VC
> (n / p ) × (n / p ) HJ p F B
H N 5 ? Ai , j Bi , j H DG BFE =K
? 4 A ? Pi , j
> N C VC H 5 ? K Pi , j K ?
C I P K D C i I 5 ? Ai , j E4
5= HJ i N I P K C j
I 5 ? Bi , j >VN >= F HJ j N
H CH 5 ? = Pi , j H E = F 5
? Ai ,( j + i ) mod p B(i + j ) mod p , j H N Pi , j H

J A G ? E4 V FF] 5
> N 4 A N
? F ? K V FF
5 A B H DG F? 4< 5 ? Ai , j
Bi , j BFE =K ? 4 A ? Pi , j
> N C VC V F F ] 5
H ? l l 5 ? A > I Pi ,i
( P0,0 , P1,1 ,..., P p −1, p −1 ) E4K = P I ? C G?K
D H N Pi , j J G 5 ?
? 5 ? A CE = F F Bi , j = BF
? E =K 5 ? Ci , j
> N C V FF] 5 H ?
l l 5 ? A K ? L 5
? A H Pi ,( j + i ) mod p > > N ? = N

G 5 ? N = BF ? E =K 5
? Ci , j
> N C I P H I 5 ? A E4
5= 1 N I 5 ? B >VN >= F
1 N H N Pi , j H G 5 ? A B C
E = F = J A C E = E 4 F 5
? Ci , j J > N C 2 N p − 1 N B
H E =A > 5 C
4 A FF> (Tp ) LC K =
V FF I4 4< N [4]
3

Tp =

2

n
n
+ 4ts p + 4tw
p
p

? E4K = P I ? C G?K D I
5 ? B >VN >= F 1 N H N Pi , j H G
5 ? A B CE = F = J A C
E =E4 F 5 ? Ci , j J > N C 2
N p − 1 N BH E =A > 5 C
4 A FF> (Tp ) LC K =
V FF] 5 I4 4< N [5]
Tp =

(1)

n3
+ ts p + twn 2
p

(2)

4 > V N ( J
H C 1 2) 4< C F ? V
FF K = 4 A B ? FF
] 5 LC 4 A F D 4M
FF G?> C E4 [4, 5] N HV
4< E4E = C H J 4 I K = F 4
5 5

LC ts = C = J F ? >= G
t w = CK =K ? >= G > 1
2.2 % '
V FF] 5 I 5
? ? F ? ll (Broadcast) E4K = I
? 4 A K D LC J E4

479

AG=K =HV H F DG = > 4 A
> V E = ? K = 4Iv F I
LC E L G J F J > N

3. % )# * SIMD /
FF n (HASE L Hierarchical computer
Architecture design and Simulation Environment)
[8] 4< LC L = FFHJ I E ?
? LC (Discrete-Event Simulation Model) LC
V W 4 > D 4M
5VC K FF n FFHJ D 4M
= K = G? ? VC LC L
LC ? K =AG=K = r K ?>VN N NAG= H
E = J FFHJ SIMD FF 16 ? 4 A
? FF >? F F ( G4 C 1) K =K


4.
1 1 )
H N NHV J 4 H FF n C
? K = LC 4 F F4 >
V G 5 LC V C
F 4 5 5 FF SIMD 16
? ? FF >? F F J ? C
E = H FFHJ F F = FF
(Analytical Model) LC L ? LC DL >
>= G CE =H FFHJ
4.1 2 1 23 42 & *% )# *
LC L ? >= G C E = H FFHJ N
LC DL E = AG= H J K 4 A
G 5 C > 2 C E = H FFHJ
4 F F F C J H = FF
J ? J ?
(Slowdown) L 4 A C C >VN >
5 I > LC F F> 4 x 4

G4 C 1 FFHJ SIMD F n

FFHJ SIMD F n CK =
AG=K =4 F ? 1 I J C W 5 F K =
AG= K = r V C = V W >VN
4 A F FFHJ E = K ?
HJ FF n H = E] F V > N
J > FFHJ (Trace file) E = FF
n D E] ? >VN 4<
LC E K =AG=K = B = G I C
>VN ? 4 A ( G4 C 1) N
1





(Slowdown)
225.0
200.0
175.0
150.0

Cannon SIM

125.0

Cannon Model

100.0

Fox SIM

75.0

Fox Model

50.0
25.0
0.0
4x4

8x8

16x16

32x32




G4 C 2 ? > G 5 C>

2

http://www.icsa.inf.ed.ac.uk/research/groups/hase/models/

480

4x4 8x8 16x16 32x32

> V > > 5 )
? ? 4 A F
FFHJ ? J l D

G4 C 2 ? ? 4 F F ?
V CE =H FFHJ H
= FF 5VC F ? N V ?
? H FFHJ E ? ? F ? C E = H
= FF LC F D (t-test)
HV I4E = ? 4 A > N V
F FFHJ SIMD DG =
4 A CE =H FFHJ D J E4K =
4< >= G LC V W H ? E4E =
4.2
& *
LC 4 F F4 > V
G 5 F 4 5 5 AG= H K =
G4 FF FF ] 4 F =
4M H H 4M H H L ,

. 1 . 3 . & !! E = ?
V FF FF] 5 4MHH
C L %
. 5VC F? 4< >
4 x 4 32 x 32 H > N
4MHH ? 4 A F FFHJ
4.3 *%
5
4 A V N F
FFHJ SIMD 16 ? 4 A 3 F ?
LC 4 A G 5 > 4 x 4
32 x 32 V FF K = 4 A 102 21,250 F4 B ? FF] 5
G? 92 6,016 F J F N NA
D F ? 4MHH N N (4

5. 7 5 %

H K =>= G H FFHJ
F ? V FF K =
4 A B ? V FF] 5
>= I4> [4, 5] C I4H K = = FF
N N H K = LC E F
FFHJ F ? K > N C > N
V 5VC 4< > N C = J 5NJ V
FF] 5 K = 4 A ? LC H
= ? ll 5 ? A K = I
? 4 A ? C G? D K > C
V FF K = I 5 ?
A N

! " # $ $% 
100
80
Speedup SIM

40

Speedup Model

%

60

20

24

x1

02

4

12

56

x5
10




51
2

28

x2

x1

25
6

x6
4

12
8

x3
2

64

32

8

x1
6
16

8x

4x

4

0

G4 C 3 4 A C B ? >

H G4 C 3 DV 4 A C B
? L ? (Speedup) > V FF
LC F F FF] 5 H B E = ?
V FF ? LC
5 > C >VN H ? > N
4MHH C >VN J K = I4E = ? > >

3

K = n C 3 J F FF4x F Fedora Core 4.0
F LC Intel Pentium 4 B 3 GHz ? HJ 512 MB
4
DV F ll z > FFHJ (simulation cycle)

481

5 C C >VN ? 4 A
> V FF ? V
FF] 5 ? ? E B
= FF> N V F 5 > K l?
? 1,024 x 1,024 H F ? V
4 A B ? V FF] 5
LC K = LC E F FFHJ L F
E] F V J LC H C ?
4 A DG K = 5VC DV K =
4 ? 4 A = F ? V
FF N K =4 ?
4 A f C C = 95 5VC G ? FF] 5
DV = 14 HV I4E = ? V G 5 FF F 4 5 5 FF SIMD C ? 4 A ?
16 ? ? LC FF >? ? F F
? V FF] 5 LC 4 A F
5 > E ? 1,024 x 1,024
H N N NK = B ? V W > N
J FFHJ C LC E ? K =
V W A F C F4 K
4 A > V F D 4M f
? J E = ? H >VN 5VC H 4< 4
? F W AG= A 4 5 5 AG= =
4 W AG= r 4 K L
V C = 4< I J C J F
G 5 FF G? > F D 4M
? A
K =4 ? 4 A H J E4K = LC

4 F ? 4 A K
E =

6.
9
> > F I Professor Roland Ibbett J F
FF n FF SIMD 5VC AG= H E = J r ?

7. ; * *
[1] F.J. Villa, M.E. Acacio and J.M. Garcia, Memory Subsystem
Characterization in a 16-core Snoop-Based ChipMultiprocessor Architecture,€ In Proceedings of The 2005 on
HPCC, Springer, September 2005, pp. 213-222.
[2] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands and
K. Yelick, The potential of the cell processor for scientific
computing,€ In Proceedings of the 3rd conference on
Computing frontiers, 2006, pp. 9-20.
[3] F. Franchetti, Y. Voronenko and M. Puschel, FFT program
generation for shared memory: SMP and multicore,€ In
Proceedings of the 2006 ACM/IEEE conference on
Supercomputing, 2006.
[4] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction
to Parallel Computing design and analysis of algorithms, The
Benjamin/Cummings, 1994.
[5] A. Gupta and V. Kumar, Scalability of Parallel Algorithms
for Matrix Multiplication,€ In Proceedings of the 1993
International Conference on Parallel Processing, 1993.
[6] M. Krishnan and J. Nieplocha, SRUMMA: a matrix
multiplication algorithm suitable for clusters and scalable
shared memory systems,€ In Proceedings of IPDPS 2004. pp.
70-80.
[7] P. Bjorstad, F. Manne, T. Sorevik and M. Vajteric, Efficient
matrix multiplication on SIMD computers,€ SIAM Journal on
Matrix Analysis and Applications, 1992, pp. 386-401.
[8] P.S. Cole, F.W. Howell, R.N. Ibbett and L.M. Williams, A
Hierarchical Computer Architecture Design and Simulation
Environment,€ ACM Transaction on Modeling and Computer
Simulation, 1998, pp.431-446.

482


Related documents


peerasak wangsom jcsse2007
cs267 hw0 cyclades 1
proceedings
image processing and applications on cryptography
80211
scigen article

Link to this page


Permanent link

Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..

Short link

Use the short link to share your document on Twitter or by text message (SMS)

HTML Code

Copy the following HTML code to share your document on a Website or Blog

QR Code

QR Code link to PDF file Peerasak_Wangsom_JCSSE2007.pdf