mcprog_lecture_2_(Mapping_Apps_to_MC) (1)

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Process (thread):

Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Physical engine on which process executes Processes virtualize machine to programmer
2-87 KICS, UET
Cavium Univ Program © 2010
Some Important Concepts

Task:

Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks

Main goal: Speedup (plus low prog. effort and resource needs)
Speedup (p) = Performance(p) / Performance(1)

For a fixed problem:
Cavium Univ Program © 2010
1,400 1,200 1,000
Concurrency
800 600 400 200 0
150
219
247
286
313
343

380
415
444
483
504
526
564
589
633
662
702
Clock cycle number

Cannot usually divide into serial and parallel part
Cavium Univ Program © 2010
2-87
KICS, UET
Orchestration

Includes:

Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently

sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum

Time for first phase = n2/p Second phase serialized at global variable, so time = n2 2n2 2n2 Speedup <= or at most 2 2 2 + p2 n 2 2n Trick: divide second into two + phase n
Cavium Univ Program © 2010
Pictorial Depiction
(a)
1
n2
n2
work done concurrently
p
(b)
1
n2/p
p
n2
1
(c)
n2/p n2/p p
Time
Cavium Univ Program © 2010
2-87
KICS, UET
Concurrency Profiles

Structured approaches usually work well

As programmers, we worry about partitioning first

As architects, we assume program does reasonable job of it
Cavium Univ Program © 2010
2-87
KICS, UET
Mapping Applications to Multi-Core Architectures
Chapter 2 David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A Hardware/Software Approach, Morgan
KICS, UET
Agenda for Today

Mapping applications to multi-core applications Parallel programming using threads

POSIX multi-threading Using multi-threading for parallel programming

accumulate into private sum during sweep add per-process private sum into global sum
p

Parallel time is n2/p + n2/p + p, and speedup at best
2-87 KICS, UET
Speedup (p) = Time(1) / Time(p)
2-87 KICS, UET
Steps in Creating a Parallel Program
Partitioning D e c o m p o s i t i o n A s s i g n m e n t O r c h e s t r a t i o n M a p p i n g
Assignment

Specifying mechanism to divide work up among processes

E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions
Course Outline

Introduction Multi-threading on multi-core processors

Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples

Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors)
Speedup is the ratio:

Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning
Cavium Univ Program © 2010
2-87

Tasks may become available dynamically No. of available tasks may vary with time

i.e., identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many

Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O

Processor:

first write program in terms of processes, then map to processors
2-87 KICS, UET
Cavium Univ Program © 2010
Decomposition

Break up computation into tasks to be divided among processes
Kaufmann, 1998
Parallelization

Assumption: Sequential algorithm is given

Sometimes need very different algorithm, but beyond scope

Pieces of the job:

Programming Multi-Core Processors based Embedded Systems
A Hands-On Experience on Cavium Octeon based Platforms
Lecture 2 (Mapping Applications to Multi-core Arch)

Cavium Univ Program © 2010

Amdahl’s law applies to any overhead, not just limited concurrency
2-87 KICS, UET
fk k k=1 fk k p k=1
733
, base case:
1
s + 1-s p
p0
p1
p0
p1
P0
P1
p2
p3
p2
p3
P2
P3
Sequential computation
Tasks
Processes
Parallel program
Processors

4 steps: Decomposition, Assignment, Orchestration, Mapping

Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly

No. of tasks available at a time is upper bound on achievable speedup
2-87 KICS, UET
Cavium Univ Program © 2010
Limited Concurrency: Amdahl’s Law

Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation