David Skillicorn - Older Papers |
This page attempts to make my papers available on the web without violating copyrights. The papers here are either early versions of papers that have appeared, in which case a hint is given about where they did eventually appear, or drafts, about which I would appreciate comments.
They are divided by topic like this:
BSP Design and Performance |
There are three big challenges for mainstream parallel computing: building useful hardware platforms, designing programming models that are effective, and designing a software construction process that builds correctness into software. The first has largely been solved, at least for current technology. The second has been an active area of research for perhaps fifteen years, while work on the third has barely begun. In this chapter, we describe the Bulk Synchronous Parallel (BSP) model which, at present, represents the best compromise among programming models for simplicity, predictability, and performance. We describe the model from the a software developer's perspective and show how its high-level structure is used to build efficient implementations. Almost alone among programming models, BSP has an associated cost model so that the performance of programs can be predicted on any target without laborious benchmarking. Some progress towards software construction has also been made in the context of BSP.
with Stephen Donaldson and Jonathan Hill.
We describe a transport protocol suitable for BSPlib programs running on a cluster of PCs connected by a 100Mbps Ethernet switch. The protocol provides a reliable packet-delivery mechanism that uses global knowledge of a program's communication pattern to maximise the performance of the switch. The performance is comparable to previous low-latency protocols on similar hardware, but the addition of reliability means that this protocol can be directly used by application software. For a modest budget of $US20,000 it is possible to build a machine that outperforms an IBM SP2 on all the NAS benchmarks (BT +80%, SP +70%, MG +9%, and LU +65% improvement), and an SGI Origin 2000 on half (BT +10, SP -24%, MG +10%, and LU -28%). The protocol has a CPU overhead of 1.5 microsecs for packet download and 3.6 microsecs for upload. Small packets can be communicated through the switch in a pipelined fashion every 21 microsecs. Application-to-application one-way latency is 29 microsecs plus the latency of the switch. A raw link bandwidth of 93Mbps is achieved for 1400-byte packets, and 50M bps for 128-byte packets. This scales to eight processors communicating at 91Mbps per link, to give a sustained global bandwidth of 728Mbps.
Postscript (Oxford TR-5-98)
European sites may wish to download it from Oxford.
Comment: This paper presents some compelling performance data suggesting that models such as BSP gain an advantage from the implicit global data available to each thread. This knowledge makes some novel protocols possible. What it doesn't explicitly say is that this same argument suggests that LogP is at a performance disadvantage with respect to BSP as well.
(Invited Talk at Europar '98, September 1998. Joint work with Jonathan Hill and Stephen Donaldson.)
Many popular parallel programming models are based on an independent threads executing on each processor, with some sort of coordination mechanism to handle communication. Each thread knows very little about what other threads are doing.
In contrast, in models that are based on machine-wide operations, such as skeletons and BSP, each thread has a great deal of explicit and implicit knowledge about what other threads are doing. This knowledge can be exploited to make better decisions about when and how to communicate which, in turn, result in improved performance.
We illustrate how global knowledge is used in the design of BSPlib, where the performance gains occur, and how large they are. So much improvement is possible that explicit extra communication might be worth using for models that would not otherwise share thread state.
This talk is loosely based on the paper "BSP Clusters", although it is not nearly so technical.
with Stephen Donaldson and Jonathan Hill.
The BSP cost model measures the cost of communication using a single architectural parameter, g, which measures permeability of the network to continuous traffic. Architectures such as networks of workstations pose particular problems for high-performance communication because it is hard to achieve high communication throughput, and even harder to do so predictably. Yet both of these are required for BSP to be effective. We present a technique for controlling applied communication load that achieves both. Traffic is presented to the communication network at a rate chosen to maximise throughput and minimise its variance. Significant performance improvements can be achieved compared to unstructured communication over the same transport protocols as in the case of, for example, MPI.
with Cristina Boeres and Vinod Rebello,
Keywords: latency, linear model, overhead model, multi-stage scheduling, MSA.
A version appears in the proceedings of Europar'98.
Comment: BSP's requirement that programs use supersteps feels restrictive to programmers in some application domains (interestingly, many `traditional' scientific programmers find supersteps very natural). This paper explores how well an arbitrary task graph can be packed into supersteps, i.e. does doing so extend its makespan. It turns out that it does pretty well.
with M. Danelutto, S. Pelagatti, and A. Zavanella,
We describe the use of the BSP cost model to optimise programs using skeletons or data-parallel operations, in which program components may have multiple implementations. The use of BSP transforms the problem of finding the best implementation choice for each component that minimises overall execution time into a one-dimensional minimisation problem. An algorithm which finds optimal implementations in time linear in the length of the program is given.
Keywords: skeletons, P3L, optimization, optimisation, shortest path, cost models.
A shorter version appears in the proceedings of Europar'98.
Comment: Many skeleton-based approaches face the problem of optimising compositions of skeletons when there are multiple choices for each one. Not much progress has been made because the data arrangements needed to interface between each pair of possible implementations get complicated. This paper explains how the BSP cost model simplifies the problem setting, and then gives a shortest-path algorithm that finds the optimal choices. It's not technically difficult, but it dramatically improves the state of the art.
with J.M.D. Hill and S. Donaldson.
The BSP cost model makes a new level of power available for designing parallel algorithms. First, it models the actual behaviour of today's parallel computers, and so can be used to choose appropriate algorithms without completely implementing them. Second, it becomes possible to characterise the range of architecture performance over which a particular algorithm is the best choice. This provides the foundation for developing software that is both portable at the source code level, and in its expectation of performance. We illustrate by comparing three possible implementations of broadcast, and show that a two-phase broadcast algorithm outperforms other techniques whenever the size of the data is large relative to the cost of synchronisation, and that broadcasting using trees is never a good technique (despite its continued popularity). We carry out a similar analysis for samplesort, and show that samplesort cannot perform well on networks of workstations unless the network bandwidth exceeds a certain threshold.
This paper appears in the proceedings of Massively Parallel Programming Models '97, London, November 1997, IEEE CS Press.
Comment: This paper gives a nice example of the clarity that the BSP cost model provides, and how this pays off in design. Poor broadcast techniques are still widely used, however. After reading this paper you'll know better.
Bulk Synchronous Parallelism (BSP) as a parallel model enables accurate costs of parallel programs to be predicted from the program structure and two architectural parameters, $g$, the permeability of the network, and $l$, the time required for barrier synchronisation. Networks such as ATM already play a role in parallel computers built as networks of workstations, and may become the standard mechanism for interconnecting processors at all scales. We present an analytic model for determining the BSP parameters of such architectures. Although the model is simple, there is substantial agreement with measured results where these are known. This represents the first time that these architectural parameters have been determined other than by benchmarking, and suggests that the approach may be serviceable for other wormhole routed networks.
Keywords: parallel computing, interconnection network, performance modelling, total exchange, bulk synchronous parallelism, latency, throughput.
Postscript (Queen's TR97-414)
Comment: This paper is the first analytic determination of BSP parameters. The results obtained agree quite well with those measured by other people. It doesn't look as if ATM is likely to become important in parallel computers, but the modelling techniques used here may be adaptable to some other networks.
The BSP cost model measures the cost of communication using a single architectural parameter, $g$, which measures permeability of the network to continuous traffic. Architectures, typically networks of workstations, pose particular problems for any high-performance communication because it is hard to achieve high throughput, and even harder to do so predictably. Yet both of these are required for BSP to be effective.
We present a technique for controlling applied communication load that achieves both. Traffic is presented to the communication network at a rate chosen to maximise throughput and minimise its variance. Performance improvements as large as a factor of two over MPI can be achieved.
Keywords: Bulk synchronous parallelism, network performance, throttling, MPI, CSMA/CD, Ethernet, slotting.
Postscript (Oxford TR-40-97)
European sites may wish to download it from Oxford. A shorter version appears in the proceedings of Europar'98.
Comment: It's interesting that even a clunky protocol like TCP/IP can be controlled well enough to match its low-level behaviour to a target network. This is another illustration of how BSP's view of communication as a bulk behaviour of programs pays performance dividends.
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a first order one, latency. We show that a second-order property, the standard deviation of the delivery times is also of interest. Most importantly, the average performance of a large communication system depends not only on the average performance of its components, but also on the standard deviation of these performances. In other words, building a high-performance system requires components that are themselves high-performance, but their performance must also have small variance. We illustrate this effect using distributions of the BSP g parameter. Lower bounds on the communication performance of large systems can be derived from data measured over single links.
Keywords: latency, network performance, Bulk synchronous parallelism, BSP.
Postscript (Oxford TR-39-97)
European sites may wish to download it from Oxford.
A shorter version of this paper appears in the Proceedings of High Performance Computing and Networks (HPCN'98) and was reprinted in Future Generation Computer Systems 1999.
Comment: This paper illustrates the simple, but little appreciated result that having large variance on the delivery times of e.g. individual links in a network affects not only the variance of the overall system behaviour but also its average. Thus you can tell how bad a large system is going to be just by knowing how its pieces perform. The bottom line is that manufacturers need to pay as much attention to reducing variance as to reducing averages. They don't at the moment.
The Bulk Synchronous Parallel model costs programs using three parameters, the processor speed (s), the network permeability (g), and the superstep overhead (l). This simple model is accurate over a wide variety of applications and parallel computers. However, all real parallel computers exhibit behaviour that is not captured by the cost model.
This paper is an extensive study of the accuracy and stability of g for a wide range of parallel computers. A series of parameterised kernel benchmarks are applied to each communication system, and the resulting values of g tabulated. The resulting values of g are normally distributed with small standard deviation across all of the architectures investigated.
There will always be some variation between a computer's behaviour in reality and that predicted by the cost model. We extend the cost model with a confidence parameter that describes the standard deviation in g as a function of applied traffic pattern. This can be regarded as providing parameterised kernel benchmarks for each architecture. Programmers can use this information to decide how much discrepancy they are likely to encounter when using the cost model in situations where this is critically important. In general, the results show that the BSP g parameter does accurately predict communication performance for real parallel computers.
Keywords: network performance, performance modelling, Bulk synchronous parallelism, BSP, Silicon Graphics Power Challenge, Silicon Graphics Origin 2000, IBM SP2, Cray T3D, Cray T3E, Intel Pentium Pro, Ethernet, cost modelling.
Postscript (Oxford TR-33-97)
European sites may wish to download it from Oxford
Comment: Do BSP's ridiculously simple architectural parameters really reflect how machines behave? This paper answers that question by exhaustively analysing performance under a benchmark that applies a wide mixture of communication patterns.
Keywords: BSP, barrier synchronisation, barrier synchronization, total exchange, broadcast, performance modelling
Postscript (Oxford TR-21-96)
This paper appears as Oxford University Computing Laboratory Technical Report 96-21. European sites may wish to download it from Oxford. A version appears in the proceedings of HPCN'97, Springer LNCS1225, 762-771 and is reprinted in the March 1998 issue of Future Generation Computer Systems.
Keywords: barrier synchronisation, barrier synchronization, lock, semaphore, cache coherence, total exchange, hardware barriers, collective communication.
Postscript (Oxford TR-16-96)
This paper appears as Oxford University Computing Laboratory Technical Report 96-16. European sites may wish to download it from Oxford. A short version appears in the proceedings of Euromicro'98, Barcelona, January 1998.
Comment: Barriers have the reputation of being inherently expensive. This prejudice is ably supported by manufacturers who seem unable to design barriers that perform well (this is often because simple OS activities are horribly expensive -- local locks taking hundreds of microseconds are common). In this paper we show how to implement barriers right. For shared memory architectures, this often means tens of microseconds. For distributed memory architectures, things are worse, largely because you can't get enough control of the machine (so it does a context switch you can't avoid, for example).
Since this paper was written, things have changed with the introduction of network interface cards mapped into user process memory. Expect some new results.
Keywords: Bulk synchronous parallelism, program transformation, cluster computing, dataflow.
Comment: This paper argues that the cost of a barrier is modelled fairly well by pg (i.e. a total exchange). This is indeed true for almost all architectures in use today. With this simplification, some nice BSP transformation rules exist. I've never quite been able to decide what the point of the paper is, however, which is why it's never appeared anywhere.
Keywords: BSP, program transformation, cost measures.
Comment: One of the nicest properties of BSP is that two programs running on the same platform should not, according to the model, interfere with each other's communication. This isn't quite true for architectures like the CRAY T3D and T3E, but it's not a bad assumption for most other architectures.
This makes it possible to work out how best to allocate a program to a compute server so that it is in balance, i.e. so that its communication needs match its compute needs.
Keywords: parallel programming, parallel software development, program transformation, refinement, structure-directed refinement, refinement calculus, weakest precondition, bulk synchronous parallelism.
Postscript (Queen's TR96-400)
A version of this paper appears in the Formal Methods for Parallel Programming and Applications workshop at IPPS/SPDP'98.
Comment: One of the reasons why formal methods are not popular for sequential program development is that each tiny statement in a program is produced by the application of a (possibly complex) law. Things are different in parallel programming, for the application of a single law can produce the entire top-level structure of a program. There seems to be to be much more hope for cost-effective applications of formal approaches in the parallel setting.
This paper illustrates, using a small example. The big win will come when someone figures out how (or whether) data refinement can be applied to automatically generate data distributions (and the matching control distributions).
Keywords: parallel programming model, bulk synchronous parallelism, barrier synchronisation, cost modelling, superstep, locality, BSPlib, SGI Powerchallenge, Cray T3D, IBM SP2, Hitachi SR2001, Convex Exemplar, Digital Alpha Farm, Parsytec GC, MPI.
Postscript (Oxford TR-15-96, revised to final form 11th November 1996.)
This paper appears as Oxford University Computing Laboratory Technical Report 96-15. European sites may wish to download it from the BSP Worldwide Page.
An improved version of this paper appears in Scientific Programming, Vol. 6, No. 3, 1997, 249-274..
Comment: This is the definitive introduction to BSP.
Parallel Programming Models |
We examine plausible motivations for both using and building computational grids. We find two reasons to use such grids: the existence of a workload in which tasks have deadlines, but the load varies over time; and the existence of an upper limit on cost-effective parallel systems, forcing replication when greater degrees of parallelism are required. We speculate that there may be scope for public grids, in which protecting the integrity of information is not guaranteed, but that there is much larger potential for virtual private grids within organizations. In both cases, the form of markets, execution planning, and pricing is likely to be different from the frictionless markets predicted in the literature.
Postscript (Queen's TR2001-450)
A version of this paper appears in the Workshop on Global and Peer-to-Peer Computing on Large Scale Distributed Systems at CCGrid 2002.
with Freeman Huang
We take the position that large-scale distributed systems are better understood, at all levels, when locality is taken into account. When communication and mobility are clearly separated, it is easier to design, understand, and implement goal-directed agent programs. We present the Spider model of agents to validate our position. Systems contain two kinds of entities: spiders which represent service providers, and arms, which represent goal-directed agents. Communication, however, takes place only between an arm and the spider at which it is currently located. We present both a formal description of the model using the ambient calculus, and a Java-based implementation.
Postscript (Queen's TR2001-447)
(A shorter version appears in the proceedings of Third International Workshop on Mobile Agents for Telecommunication Applications, Montreal, August 2001.)
with Susanna Pelagatti
The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main difference between NOT and other task graphs is that it is designed to make the performance of the graph predictable from knowledge of the performance of the component node programs and the visible structure of the graph. It can therefore be regarded as a coordination language that is transparent about performance. For large-scale computations that are distributed to geographically-distributed compute servers, the NOT model helps programmers to plan, assemble, schedule, and distribute their problems.
Submitted to a journal.
Conventional software development does not allow proper design choices to be made when abstract machine interfaces are involved. The best design at one level of abstraction does not necessarily lead to the best design at the next, and subsequent levels. This problem has not been obtrusive in imperative, sequential programming, but becomes much more significant for targets such as superscalar processors, FPGAs, and parallel computers.
We suggest a perspective for software development where the goal of each phase is to produce all potential implementations to be passed to the next phase. We explore the implications of this simple idea, expressing the process of software design as a sequence of traversals of transformation spaces between two abstract machines.
Postscript (Queen's TR1999-432)
We present a new model, the Networks of Tasks (NOT) model, that allows modules from skeleton-like languages to be embedded in static task graphs. The model is designed to provide transparent cost information, so that program designers can accurately predict the execution time performance of their programs as they assemble them. This is done by using an implementation technique called work-based allocation which uses adaptivity of the component node programs to execute the task graph with the same work and communication cost that is visible when the task graph is assembled. The semantics of NOT programs is simple enough that formal methods for developing them are straightforward. A refinement-based calculus for the NOT model is also outlined, and a law for handling residuals is given.
Postscript (Queen's TR1999-427)
Slides of a talk at a Dagstuhl workshop
(Invited Talk at CMPP, Sweden, June 1998).
Parallel software construction is moving beyond an early emphasis on runtime performance as the only goal, towards a more mature view in which other facets of design become as important. However, there are many difficulties that prevent the construction of a complete parallel software development methodology. These include: the variety of possible targets, and the portability issues that this raises; the difficult of determining costs of execution, and representing them in ways that are simple enough to be useful; and our lack of knowledge about how to integrate transformations into development, which prevents a calculational style from being effective. All of this is further complicated by the fact that architectures themselves are designed pragmatically, and do not necessarily lend themselves to simple, elegant descriptions.
Comment: At the time I thought that, as a community, we hadn't gone very far attacking the problems of formal, cost-based, parallel software design. However, these problems are extremely difficulty, so I don't think we've done too badly. I'm cautiously optimistic about a cost model that is confluent with respect to some useful set of program constructors, i.e. makes it possible to break programs into reasonably sized pieces, inside which costs are well-behaved.
ACM Classification: D.1 Programming Techniques, C.4 Performance of Systems, D.3.2 Language Classifications.
General Terms: Languages, Performance, Theory.
Other Keywords: Parallel programming models, parallel programming languages, general-purpose parallel computation, taxonomy, software development methods, object-oriented languages, logic programming languages.
Postscript (Revised October 1996.)
An improved version of this paper appears in Computing Surveys, June 1998.
Comment: We tried to cover all of the interesting models that were around in 1996-97. The presentation depends very strongly on our view of parallel computation and its future, which is not the view often espoused by current numerical/scientific programmers. To that extent, it's quite idiosyncratic.
Hypermedia and Applications |
Existing interfaces to large-scale hypermedia such as the world wide web have poor conceptual models and poor rendering of navigational and contextual information. New technologies that make it cheaper to use three-dimensional representations suggest the use of richer conceptual models. We discuss criteria for assessing more powerful conceptual models and design decisions that have to be made to exploit richer interfaces. The Treeworld model is suggested as one attractive example of such a model.
Keywords: conceptual model, navigation, world wide web, large-scale hypermedia, 3-d glasses, search, relevance structuring, hierarchy, focus + context, visualisation, teleportation, Treeworld.
Postscript, html (Queen's TR1999-430)
A version of this paper appears in JUCS.
A description of a two-phase filtering technique that can be used to generate different versions of a document dynamically.
Keywords: Hyperwave, Microcosm, hypermedia courseware, automatic generation of tags, filter.
A short version of this paper appears in the proceedings of Ed-Media'98.
Comment: This paper illustrates the power of interposing filters between a web browser and servers. This allows every document to be processed in any desired way. Examples such as automatic glossary generation, and user-selected content customisation are illustrated. Today you might do this using XML, but there are still attractive aspects to the amount of control the reader gets using filters.
with Sumit Varma.
An information management system should reduce the workload of the user. It should also learn about user's preferences and reduce the effort in managing information. We propose a non-intrusive approach to building information management systems, based on software agents. Software agents are programs that carry out actions on behalf of the user autonomously. We describe the issues involved in designing such systems. As an example, we design BMA (a Bookmark Management Agent). BMA is a non-intrusive, lightweight software agent that manages bookmarks (web browser pointers to favourite sites).
Tests show that BMA is a useful tool in managing bookmarks. This shows that software agents are a promising approach to information management systems. It also demonstrates the feasibility of a non-intrusive agent.
Postscript (TR97-411) (big: 2.5M)
Comment: Agents that run locally and interact with users only via the file system (e.g. by creating html documents) are non-intrusive and unannoying. This paper describes one such agent.
We describe our experiences teaching a first-year programming course using the Hyperwave hypermedia system completely without lectures.
Keywords: Hyperwave, hypermedia, courseware, first-year teaching.
Poster: ACH-ALLC'97 Conference.
Comment: A short overview of our experiences using technology for first-year programming courses.
This paper is a preliminary report of our experiences offering CISC104, a first-year introductory computing course, using hypermedia courseware to completely replace lectures.
Keywords: Hyperwave, technology for learning, distance education.
A version of this paper appears in the proceedings of Ed-Media'98.
Hypermedia technology provides both an opportunity for universities to provide a better learning experience for their students, and a way to cope with funding reductions. Second-generation hypermedia systems makes it cost-effective to develop and deliver multimedia courseware, while permitting learning to occur within a community. We illustrate by describing our experiences developing and offering hypermedia courses in the Department of Computing and Information Science at Queen's University. These include an advanced course in computer architecture, and a first year programming and applications course, in which lectures were replaced by on-line courseware, using the Hyperwave (Hyper-G) system.
Postscript (TR95-386)
Parallelism and Structured Text |
We present a theory of trees and tree homomorphisms, modelling structured text archives and operations on them, from which it can be seen that:
Keywords: structured text, categorical data type, software development methodology, parallel algorithms, query evaluation.
Postscript (Queen's TR95-379). A final version of this paper appears in J. Universal Computer Science, Vol.3, No.1, January 1997.
Comment: This paper shows how the homomorphic skeleton approach can be used to design new algorithms for tree problems. Structured documents provide a natural application domain for such algorithms. This paper lays out a collection of algorithms that make sense for structured text, shows how they can be expressed as homomorphisms, and hence how they can be computed efficiently.
We present a generalisation of indexes based on regular languages, called indexing languages, chosen to be homomorphic images of languages generated by typical search patterns. Precomputing properties of text strings relative to indexing languages makes it fast to exclude large parts of the text from consideration before executing a direct search.
Postscript (Queen's TR95-383)
Comment: This paper shows how to mix structured text search with a limited kind of indexing that is not unlike signatures.
Postscript (Queen's TR95-380)
A much improved version of this paper appears in the Journal of Parallel and Distributed Computing, Vol.39, 1996, 115--125. If at all possible, you should read that version, rather than this.
Comment: The key contribution of this paper is that homomorphisms on trees can be decomposed into homomorphisms on trees of trees. This decomposition is not particularly obvious. Having it means that techniques, such as tree contraction, that have logarithmic running times, can be extended to trees of trees.
Postscript (Queen's TR95-381)
An improved version of this paper appears in Information Processing Letters, Vol.60, No.5, 1996, 231--235.
Comment: A nice fast algorithm based on universal hashing.
Back to David Skillicorn's home page