Application tuning and debugging on Linux

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

•
•
•
Sleep for Network IO
•
Wait for the data be available on the buffer or queue (TCP or UDP) Wait for the buffer to be empty or the queue to be unplugged (TCP or UDP) Don't ignore the IPC channel ( Pipe, Unix socket, message queue )
•
•
•
NUMA
•
Multiple processor & memory are packed as “node”, memory in the same “node” is called local physical memory Shared but non-uniformed memory architecture, access on-node memory is faster than access off-node memory “on-node” PCI bus controller Lock must be scalable Memory barrier must be modest Memory distance is considered when allocating pages and migrating tasks (for NUMA)
•
Shorter “Wait for disk IO”
•
Use faster disk controller
•
Shorter “Wait for network IO”
•
Use 10gigbytes Ethernet
•
All those may cost lot of money
Deeper considerations to make your application run faster
•
•
Virtualized server environment
•
Multiple VM guests share the same physical host resources vm_exits is critical factor that cause poor VM guest performance
•
•
HA Cluster
•
Strong membership infrastructure Active-inactive or Active-active Fail-over delay is a most critical factor to consider A-A gives better fail-over delay and data consistency but reduce performance in normal status due to the overhead of status duplication
•
•
•
•
Why? Implicit lock contention in accessing file descriptor table in the kernel even for separate file access by different threads TLB synchronization caused by mmap()/munmap() in multi-threaded situation
•
•
Gettimeofday() causes vm_exit Process switching causes vm_exit
•
•
Para-virtualized IO is critical for the general IO performance of each guest Latest hardware support for improved memory management performance and direct device access by guests (VPID, EPT etc.) Memory sharing to maximize the number of guests on the same host
•
•
•
•
•
•
When one thread is in “computing”, the other one can be in “disk IO”
Forced Flow Law
•
Forced Flow Law
•
“Overall throughput of the system is determined by the slow component of the system
•
•
•
With a saturated system, one low utilized component indicates the saturation of other component
•
•
•
Modern PC server architecture (SMP & NUMA)
•
SMP
•
Multiple processors share single system bus Shared and uniformed memory architecture Shared PCI bus
•
•
•
Idle
•
Timed idle or wait for the signal for an event
First-sight ideas to make the application run faster
•
Shorter “Computing”
•
Use faster CPU and physical memory
•
•
Understand Linux memory system
Clustered Server architecture
•
HPC cluster
•
Closed coupled by application programming like MPI Performance is mostly affected by inter-connect speed
General conceptions related to performance
How do we think one application performs well ?
•
Lower latency for a single transaction Larger total number of requests or transactions served per unit time Shorter duration for a complex computing task How about CPU use ratio ?
•
•
•
•
Load-balancing Cluster
•
Add back-end nodes to serve more requests Entry server is critical for total performance Entry server may work in kernel (LVS) or application level (mod_proxy or mod_jk)
•
•
•
What the application do in single executing path ?
•
Computing
•
Memcpy, memcmp, malloc/free, system calls (address space switch), allocating kernel data structure, querying the tree or list All processor instructions to access memory and processor registers
Application tuning and debugging on Linux
Qianfeng Zhang frzhang@redhat.com Red Hat Software, Beijing
Summary
• • • • • •
Biblioteka Baidu
General conceptions related to performance Computation intensive application tuning Networking intensive application tuning IO intensive application tuning Optimization in coding Important tuning/debugging tools on RHEL6
•
•
Sleep for Disk IO
•
Wait for completion of buffered reading data from disk Wait for submission of dirty written data ( in case of too much dirty pages or too few free pages ) Wait for direct IO (read or write)
•
Quantitative time analysis
•
If “computing” takes 10%, “sleep for disk IO” takes 90%, shorten “sleep for disk IO” is more meaningful
•
Optimizing the codes for shorten “computing” Optimizing the memory access for shorten “computing” Utilize data caches for reduced disk visits and less “sleep for disk IO” Utilize RAID for concurrent disk head accesses Various tech for reducing disk head movement Parallelize application to maximize the total performance
•
•
•
Consequence
•
A saturated component is a bottleneck Reduce visit count ( Vresource) improve system throughput Increase throughput ( Xresource) improve system throughput
•
•
•
OS Support
•
•
•
Multi-process and Multi-threaded
•
Both are basic scheduling unit presented by a “struct task” in the kernel ( Linux using “1:1” threading model ) Multi-process or multi-threaded is necessary for the application to reach high total performance in a SMP or NUMA architecture Threads in one process share whole virtual space and all opened files and sockets, but bring issues involving protecting shared data Multi-threaded is not necessarily better than multi-process for performance
•
•
•
•
Higher CPU use ratio doesn't mean better performance For a specific application, improved CPU use ratio may be an indication of improved total performance Low CPU use ratio may be an indication of “bottleneck” in other aspect lower CPU use is a target of “Green Computing” but not that of one single application