Monday, April 14, 2008

Optimising Code for Faster Processors

Whenever a new processor is launched, the issue of optimising software applications for that processor arises. In fact, many times, lack of optimised software harms the chances of a new processor in the market. It can even happen in the case of Intel's Pentium 4.

In many ways, the computer industry is a fast lane, where software is the fuel, running cars (read computers) equipped with faster and better engines (processors). If there is no optimised fuel for the cars to run on, then the performance improvement in successive engines can never be gauged.
When the cost of a system is the driving factor behind its purchase, the method of choice for number-crunching applications often turns out to be writing optimised software. Although slower than the customised chip, the software solution has the advantage of being modifiable and reusable. A few simple modifications will allow the user to use its programme for other needs.
To solve the lack of upgradability of many imaging systems, many people call for an "open system" approach to designing these tools. Most of the materials used to build the equipment would be "off the shelf" components. The system's operation is now determined by in-house software and not by the unmodifiable custom-built hardware chip that was the heart of the system until recently. The hardware costs for the open system approach are lower than for custom-built components, and the software can be created relatively quickly using the vast amount of software libraries available to programmers.
An advantage of this technique is that whenever the system is not in use, it can be used for other applications such as word-processing. Another advantage is that what took two different specialised machines earlier can be done on a single computer, simply by using two different software components, and peripherals.

As long as one follows the open system approach, the hardware can be replaced as the old one becomes obsolete. The software can be easily ported from the old station to the new one, or recompiled to fit the architecture of the new host machine.

Writing your own software also has the advantage of fostering code reusability. The code is also easily modifiable if it doesn't fit the needs anymore.

The problem with having people write their own software is that the results will vary a lot depending on the level of knowledge of the programmer. The run-time of a programme is highly dependent on the skills of the programmer and the optimisation techniques used. It can make the difference between a system being too slow for some applications and the same system (with different software) being acceptable for the task at hand.
Code optimisation and data prefetching are two of a multitude of techniques that can enhance the performance of software.

What is Code Optimisation?
It can be defined as writing code so it runs as fast as possible on its host computer. The best way to achieve this goal is to write the code in a low-level language, such as assembly. Assembly language is a non-intuitive way of writing code. In other words, its structure is less "language-like" than high-level languages. Because it is non-intuitive, the development time is longer, which drives up the costs of developing the software. Also, very few people are familiar with assembly language. The best of both worlds is to embed assembly language instructions in high-level code. The programmer can then programme most of the code in an intuitive high-level language and then use assembly for small parts of the code, where code optimisation would be required to improve the programme run-time.

Most high-level language compilers offer options as to what type of code to generate at compile time. A few options are optimisation for run-time or for code size. An action as simple as checking the "optimise for time" option box could generate notable improvements in the processing time of a programme.

Golden Rules of Code Optimisation
Don't optimise as you go: Write your programme without regard to possible optimisations, concentrating instead on making sure that the code is clean, correct, and understandable. If it's too big or too slow when you've finished, then you can consider optimising it.

Remember the 80/20 rule: In many fields, you can get 80% of the result with 20% of the effort (also called the 90/10 rule - it depends on whom you talk to). Whenever you're about to optimise code, find out where that 80% of execution time is going, so you know where to concentrate your effort.

Always run "before" and "after" benchmarks: How else will you know that your optimisations actually made a difference? If your optimised code turns out to be only slightly faster or smaller than the original version, undo your changes and go back to the original, clear code.

Use the right algorithms and data structures: For example, don't use an O(n2) bubblesort algorithm to sort a thousand elements when there's an O(n log n) quicksort available. Similarly, don't store a thousand items in an array that requires an O(n) search when you could use an O(log n) binary tree.

Use efficient loops: Since loops repeat themselves, any efficiency will be compounded. An error as simple as initialising a variable inside the loop when it would have worked just fine outside the loop can increase the run-time dramatically.

Define variables that are used at the same time sequentially: Computers must fetch data from memory. That memory is sometimes brought into the cache in blocks. If the variables are defined sequentially, there is a good chance that one data fetch will be sufficient to bring the data into memory. See the next topic - data prefetching - for more information.

Do only the necessary input/output: Input/output to peripherals take time and should be limited to a minimum. A counter that says "XX% complete" is inefficient and should not be used inside a loop. Increase in run-time of one order of magnitude can be expected with such messages. If a warning to the user is required, use a general form like "please wait while this processes."
These are not a panacea, but are a good indication of how well a programme will perform.

Data Prefetching
During the past decade, CPU performance has outpaced that of dynamic RAM, the primary component of main memory. It is not now uncommon for scientific programmes to spend more than half their run-time stalled on memory requests.

Data prefetching is one of the techniques used to reduce or hide the large latency of main-memory accesses. With data prefetching, memory systems call data into the cache before the processor needs it, while processor computation takes place.

As an example, if a programme executes a FFT of an image of size 300 by 400 pixels, 120 Kbytes of data is required to store the image alone. Now suppose the result of the FFT is multiplied with a filter of the same size and the result is stored in a different array. It becomes obvious that most caches will not be enough to store that data and that the computer will require the main memory to store the data required for that calculation.

The processor will execute the calculations, getting the necessary information from the cache until such time as the information cannot be found in the cache. When the information is not found, the processor requests the data to the cache controller, which fetches it from main memory. While this fetch is being executed, the processor is wasting precious memory cycles, thereby increasing the total programme run-time. If it were possible to always have the required data in the cache, the run-time of the programme would be improved.

One has to use caution when using data prefetching since when one block of data is brought in the cache after a prefetching request, it is likely that one block will need to be evicted. If the data evicted is the data that is currently required, the processor will have to wait for it to be brought back into memory.

Because of this, the programme might actually run slower than it would have without the prefetching instruction. Prefetch timing is critical for prefetching to actually show notable improvements in the run-time of a programme. Done too late, the computer will wait for data, too early, required data might be evicted from memory.

Of the prefetching techniques available, let's discuss only software-initiated prefetching. Obviously, one other prefetching technique is "hardware initiated", which we won't discuss because it doesn't involve programmer or compiler intervention.

With software prefetching, before the processor requires the data, a fetch instruction specifies the required address to the memory system, which forwards the word to the cache. Because the processor does not need the data yet, it can continue computing while the memory system brings the requested data to the cache.

Before you plan to use data prefetching in your next programme, you need to know if your microprocessor contains a fetch instruction. Also, some compilers have optimization schemes that include prefetching statements. If you want to include your own prefetching statements, you should limit yourself to loops. Predicting the memory access patterns for code other than loops is unreliable and could even result in longer execution time since a fetch instruction does utilise processor time.

If the compiler you are using doesn't include prefetching optimisation, designing for data prefetching might not be the best solution. It is likely not a technique that will be profitable. Too much time will be spent designing the code, for only marginal improvements in run-time. Finally we take a look at general guidelines of optimising computer code:

Planning

  • Determine the magnitude of the effort required for the port. Gauge how much work is involved by identifying the following items:
    • Identify problem 32-bit code. Compile your 32-bit code with the new optimised compiler. Say, for example, Visual C++ 6.0 has a compiler which can be downloaded from the Microsft website which is customised for Pentium IV.
    • Identify shared components or dependencies. Determine which components in your application originate from other teams and whether those teams plan to develop 32-bit versions of their code.
    • Identify legacy or assembly code. 16-bit Windows-based applications do not run on 32-bit Windows and must be rewritten.
  • Port the entire application, not just portions of it. Although it is possible to port pieces of an application or to limit code to 2G with /LARGEADDRESSAWARE:NO, this strategy trades short-term gain for long-term pain.
  • Find substitutes for technologies that will not be ported. Some technologies, including DAO (Data Access Object) and the Jet Red database engine, will not be ported to 64-bit Windows.
  • Treat your customised software as a separate product release. Even though your Pentium 4 optimised code product may share the same code base as your Pentium I, II or III based product, it needs additional testing and may have other release considerations.

Development

  • Ensure that your code can be compiled by PIII and P4 processors. The new data model was designed to allow applications to be built from a single code base with few modifications.
  • Use the compiler's new optimisation features for best performance. Code optimisation for IA-32 processors is more important than it was for the x86. The compiler assumes many of the optimisation functions previously handled by the microprocessor. You can maximise the performance of a 32-bit application by using two new optimisation features of the compiler: Profile Guided Optimisation and Whole Program Optimisation. Both features result in longer build times and require the early development of good test scenarios.
    • Profile Guided Optimisation involves a two-step compile process. During the first compile, the code is instrumented to capture the execution behaviour. This information is used during the second compile to guide all optimisation features.
    • Whole Program Optimisation analyses the code in all application files, not just a single one. This approach increases performance in several ways, including better inlining, as well as improved side-effect analysis and custom calling conventions.

Testing

  • Determine whether you'll test 64- or 32-bit code running in WOW64. Some applications include both native 64-bit code and 32-bit code running in WOW64. Investigate this closely while developing a test plan, and decide whether your test tools should be 64-bit, 32-bit, or a combination. You will often need to test both the 64- and 32-bit versions of your application on 64-bit Windows.
  • Test frequently-used 32-bit components. First, recompile your code to 64-bit and test. Second, fix problems, recompile in 32-bits, and then test. Third, recompile to 64-bit and test.
  • Test COM and RPC components. Make sure that both 32- and 64-bit COM and RPC components communicate correctly. You may also have to test communications with 16-bit components over a network.
  • Test your 32-bit version on 64-bit Windows. Customers can continue to use 32-bit applications on 64-bit Windows where performance and memory issues are not major considerations.
  • Test different memory configurations. Adding large amounts of memory on the server sometimes exposes previously unnoticed problems in either the application or the operating system.