mcprog_lecture_2_(Mapping_Apps_to_MC) (1)
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Process (thread):
Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Physical engine on which process executes Processes virtualize machine to programmer
2-87 KICS, UET
Cavium Univ Program © 2010
Some Important Concepts
Task:
Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks
Main goal: Speedup (plus low prog. effort and resource needs)
Speedup (p) = Performance(p) / Performance(1)
For a fixed problem:
Cavium Univ Program © 2010
1,400 1,200 1,000
Concurrency
800 600 400 200 0
150
219
247
286
313
343
380
415
444
483
504
526
564
589
633
662
702
Clock cycle number
Cannot usually divide into serial and parallel part
Cavium Univ Program © 2010
2-87
KICS, UET
Orchestration
Includes:
Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently
sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum
Time for first phase = n2/p Second phase serialized at global variable, so time = n2 2n2 2n2 Speedup <= or at most 2 2 2 + p2 n 2 2n Trick: divide second into two + phase n
Cavium Univ Program © 2010
Pictorial Depiction
(a)
1
n2
n2
work done concurrently
p
(b)
1
n2/p
p
n2
1
(c)
n2/p n2/p p
Time
Cavium Univ Program © 2010
2-87
KICS, UET
Concurrency Profiles
Structured approaches usually work well
As programmers, we worry about partitioning first
As architects, we assume program does reasonable job of it
Cavium Univ Program © 2010
2-87
KICS, UET
Mapping Applications to Multi-Core Architectures
Chapter 2 David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A Hardware/Software Approach, Morgan
KICS, UET
Agenda for Today
Mapping applications to multi-core applications Parallel programming using threads
POSIX multi-threading Using multi-threading for parallel programming
accumulate into private sum during sweep add per-process private sum into global sum
p
Parallel time is n2/p + n2/p + p, and speedup at best
2-87 KICS, UET
Speedup (p) = Time(1) / Time(p)
2-87 KICS, UET
Steps in Creating a Parallel Program
Partitioning D e c o m p o s i t i o n A s s i g n m e n t O r c h e s t r a t i o n M a p p i n g
Assignment
Specifying mechanism to divide work up among processes
E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions
Course Outline
Introduction Multi-threading on multi-core processors
Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples
Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors)
Speedup is the ratio:
Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning
Cavium Univ Program © 2010
2-87
Tasks may become available dynamically No. of available tasks may vary with time
i.e., identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many
Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O
Processor:
first write program in terms of processes, then map to processors
2-87 KICS, UET
Cavium Univ Program © 2010
Decomposition
Break up computation into tasks to be divided among processes
Kaufmann, 1998
Parallelization
Assumption: Sequential algorithm is given
Sometimes need very different algorithm, but beyond scope
Pieces of the job:
Programming Multi-Core Processors based Embedded Systems
A Hands-On Experience on Cavium Octeon based Platforms
Lecture 2 (Mapping Applications to Multi-core Arch)
Cavium Univ Program © 2010
Amdahl’s law applies to any overhead, not just limited concurrency
2-87 KICS, UET
fk k k=1 fk k p k=1
733
, base case:
1
s + 1-s p
p0
p1
p0
p1
P0
P1
p2
p3
p2
p3
P2
P3
Sequential computation
Tasks
Processes
Parallel program
Processors
4 steps: Decomposition, Assignment, Orchestration, Mapping
Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly
No. of tasks available at a time is upper bound on achievable speedup
2-87 KICS, UET
Cavium Univ Program © 2010
Limited Concurrency: Amdahl’s Law
Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation