With some knowledge of the probable types of applications the multi-core CPU will be used for, a performance analyst may make useful comparisons between multi-core CPUs using benchmarks. Prior sections of this paper have assumed that the performance analyst is examining the impact of a multi-core CPU on a particular application's performance, especially multi-threaded applications.
- Books & Videos?
- Designing and building parallel programs?
- Memory ordering?
- Shared Memory Application Programming?
- Chance Discovery.
- Shop with confidence!
Here, a different set of tools will allow a more isolated analysis of a particular CPU's performance. For example, the performance analyst may need to compare multiple models of multi-core CPUs to calculate an optimal price-performance tradeoff for an acquisition decision.
- Programming Many-core Chips.
- Biography - Shared Memory Application Programming [Book]?
- Steven Universe Original Graphic Novel Vol. 1. Rebecca Sugar?
- Shared Memory Application Programming - O'Reilly Media.
- Table of Contents.
- Wind Energy 1975–1985: A Bibliography;
- James E. Walker Library!
- Optimizing Applications for NUMA | Intel® Software.
- Computer Science: Books;
- Bestselling in Multi Core Processor;
The performance analyst might only have a general idea of the types of applications the workstation will run. An industry standard benchmark, such as the benchmarks maintained and published by Standard Performance Evaluation Corporation SPEC , provides a relative measure of performance in different types of computations [ Jain91 ]. A practitioner may compile and run the benchmarks themselves or reference existing published results for many different multi-core CPUs [ SPEC ].
Comparing these benchmark scores across multi-core CPUs will consequently provide useful predictions of relative performance if the benchmark is representative of the type of task.
Shared Memory Application Programming
CFP tests floating point operations with tests from fluid dynamics, quantum chemistry, physics, weather, speech recognition, and others [ SPEC ]. Though these two benchmarks will give a relative measure of raw CPU power in a given core, the focus of these tests is not parallelized workloads. For the analyst, benchmarks provided by the type of multi-threading package that will be used by a potential application would be more relevant.
In the context of multi-core CPUs, analysts may utilize benchmarking tools to get relative performance information among different multi-core CPU models across varying types of applications. The closer these types of applications are to ones run on the system being analyzed, the more relevant the results. The next subsection will provide examples that use empirical measurements to make performance observations about the multi-core CPU involved. This subsection will provide an example of how to use measurements to analyze the performance of a multi-threaded application running on a multi-core CPU.
The measurements can validate insights gained from other techniques and lead to new performance observations that otherwise might have been overlooked. This particular example is a numerical solution to Laplace's equation which has multi-threaded code implemented with OpenMP.
This example will investigate the performance of a numerical solution to Laplace's equation. Solutions to Laplace's equation are important in many scientific fields, such as electromagnetism, astronomy, and fluid dynamics [ Laplace's equation ]. Laplace's equation is also used to study heat conduction as it is the steady-state heat equation. This particular code solves the steady state temperature distribution over a rectangular grid of m by n points.
Examining the code of this application provides some initial insight into its expected performance behavior on a multi-core processor. Here, this particular code indicates that it is probably compute-intensive rather than memory-intensive. It visits each point, replacing the current point average with the weighted average of its neighbors. It uses a random walk strategy and continues iterating until the solution reaches a stable state, i. At default values, this would require approximately 18, iterations.
The largest memory required is two matrices of m by n double datatypes 8 bytes , so at the default of a by matrix, the memory requirements are relatively low. Because of the high number of iterations and computations per iteration, compared with relatively low memory requirements, this application is probably more compute-intensive than memory-intensive. Further observations may be made about expected behavior using concepts of parallelism. First, we know that this application is implemented to utilize thread-level parallelization through OpenMP.
Examining the code reveals that the initialization of values, calculation loops, and most updates except calculating the current iteration difference compared to epsilon are multi-threaded. Second, thinking about the way the code executes leads to some expected data-level parallelism. During an iteration each point utilizes portions of the same m by n matrix to calculate the average of its neighbors before writing an updated value. The impact of instruction-level parallelism will depend upon the details of the multi-core processor architecture and the type of instruction mix.
This particular example was executed on an Intel Core 2 Quad Q 2. The principal calculation in the code involves the averaging of four double-precision floating point numbers, then storing the result. Based on high-level literature, it would appear that the Intel Core architecture's Wide Dynamic Execution feature would allow each core to execute up to four of these instructions simultaneously [ Doweck06 ].
Using the code and conceptual understanding of thread-, data-, and instruction-level parallelism can lead to useful insights about expected performance. The next step is to use measurement techniques to confirm our expectations and gain further insights. Here we will focus on measuring the impact of thread-level parallelism by measuring execution time while varying the number of threads used by the application. The following results in Table 1, below, were obtained by executing the application on an Intel Q quad core four thread CPU, varying the number of threads, while profiling the application using gprof and measuring the elapsed execution time using the time command.
Table 1: gprof data and execution time for varying thread count on a quad-core CPU. From these measurements the following observations can be made. First, taking a single-thread execution as the base, we see a significant elapsed execution time improvement by moving to two threads This validates our expectation regarding thread-level parallelism; because this application employs parallelization for the majority of its routines, performance would improve significantly by assigning additional processor cores to threaded work.
This is just one example of how measurements can assist with analyzing the performance of a particular application on a multi-core CPU. If the application exists and can be tested, measurements are a robust technique to make performance observations, validate assumptions and predictions, and gain a greater understanding of the application and multi-core CPU. Performance measurements assist the analyst by quantifying the existing performance of an application. Through profiling tools, the analyst can identify the areas of an application that significantly impact performance, quantify speedups gained by adding threads, determine if work is evenly divided, and gain other important insights.
In the next section, these empirical observations will be supported with analytical models that assist with predicting performance under certain assumptions.kessai-payment.com/hukusyuu/application-android/tas-logiciel-pour.php
Programming Many-core Chips - Ericsson
This section introduces analytical techniques for modeling the performance of multi-core CPUs. These techniques generate predicted performance under certain assumptions which would then validate measurements collected [ Jain91 ]. Conversely, measurements can validate the analytical models generated earlier, which is a more common sequence when a system or application does not yet exist. The three subsections introduce Amdahl's law, Gustafson's law, and computational intensity in the context of multi-core CPU performance, with examples for illustration. This subsection will introduce Amdahl's law in the context of multi-core CPU performance and apply the law to the earlier example application that calculates Laplace's equation.
Though it was conceived in , long before modern multi-core CPUs existed, Amdahl's law is still being used and extended in multi-core processor performance analysis; for example, to analyze symmetric versus asymmetric multi-core designs [ Hill08 ] or to analyze energy efficiency [ Woo08 ]. Amdahl's law describes the expected speedup of an algorithm through parallelization in relationship to the portion of the algorithm that is serial versus parallel [ Amdahl67 ].
The higher the proportion of parallel to serial execution, the greater the possible speedup as the number of processors cores increases. This law also expresses the possible speedup by when the algorithm is improved. Subsequent authors have summarized Amdahl's textual description with the following equations in Figure 4, where f is the fraction of the program that is infinitely parallelizable, n is the number of processors cores , and S is the speedup of the parallelizable fraction f [ Hill08 ].
Figure 4: Amdahl's law - equations for speedup achieved by parallelization.
These relatively simple equations have important implications for modern multi-core processor performance, because it places a theoretic bound on how much the performance of a given application may be improved by adding execution cores [ Hill08 ]. The application from section 3. Here, the first value for f , the fraction of the program that is infinitely parallelizable, was chosen to be 0.
For this particular program that value is a pessimistic assumption, as the vast majority of execution time is spent in the parallelized loops, not in the single-threaded allocation or epsilon synchronization. For comparison, an optimistic value for f is also graphed at 0.
The observed values from the prior example for up to 4 cores processors are also plotted. From this illustration it is clear that the predicted speedups according to Amdahl's law are pessimistic compared to our observed values, even with an optimistic value for f. At this point it is important to discuss several significant assumptions implicit in these equations.
The primary assumption is that the computation problem size stays constant when cores are added, such that the fraction of parallel to serial execution remains constant [ Hill08 ]. Other assumptions include that the work between cores is and can be evenly divided, and that there is no parallelization overhead [ Hill08 ]. Even with these assumptions, in the context of multi-core processors the analytical model provided by Amdahl's law gives useful performance insights into parallelized applications [ Hill08 ]. For example, if our goal is to accomplish a constant size problem in as little time as possible, within certain resource constraints, the area of diminishing returns in performance can be observed [ Hill08 ].
Runtime memory ordering
Limited analytical data is required to make a prediction, i. However, our analysis goal might instead be to evaluate whether a very large size problem could be accomplished in a reasonable amount of time through parallelization, rather than minimize execution time. The next subsection will discuss Gustafson's law which is focused on this type of performance scenario for multi-core processors [ Hill08 ].
This subsection introduces Gustafson's law in the context of multi-core CPU performance, contrasts it with Amdahl's law, and applies the law to the earlier example for illustration and comparison. Gustafson's law also known as Gustafson-Barsis' law follows the argument that Amdahl's law did not adequately represent massively parallel architectures that operate on very large data sets, where smaller scales of parallelism would not provide solutions in tractable amounts of time [ Hill08 ].
Here, the computation problem size changes dramatically with the dramatic increase in processors cores ; it is not assumed that the computation problem size will remain constant. Instead, the ratio of parallelized work to serialized work approaches one [ Gustafson88 ].