################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Design and Implementation of a Simulation Library using Lightweight Processes Research supported in part by NSF award CCR-9102331, NATO-900108, ONR-93-1-0233, and ARO-93-G-0045. Janche Sang Ke-hsiung Chung Vernon Rego Department of Computer Sciences Purdue University West Lafayette, IN 47907 Abstract The \Si/ lightweight-process based system for simulating process interactions is an enhancement to the C programming language in the form of library primitives with sets of predefined data structures. The \Si/ system encapsulates an existing lightweight-process library to provide a discrete-event simulation environment supporting the process view. It was developed as a research testbed for investigating methods which support simulations efficiently. Easy extensions and modifications to the \Si/ system are a major design objective, accomplished through modularity and layering. This paper describes the system, our experiences with its implementation, and its applicability to simulation modeling. We report on performance measurements of different implementations of the simulation scheduler, and of different algorithms for simulating service disciplines. \section{Introduction} The process view of simulation, developed and refined over the past two decades~\cite{Evans,Franta}, is a view than enables an analyst to model a discrete-event system in terms of interacting processes. A key advantage of the process-interaction approach to simulation modeling is that model description via processes and their interactions makes the entire modeling activity, from model design to code debugging and execution, require relatively little effort. This is in comparison to the level of effort required in alternate techniques, such as the event-scheduling simulation method, which offers a lower level of abstraction than the process-interaction method. In the process view of simulation, processes employ a variety of statements to define the flow of entities (transactions, customers, jobs, etc.) through the system. Two or more processes may compete for resources and exchange messages with one another for the purpose of synchronization. The relatively high level of abstraction enjoyed by process-oriented models make model construction relatively effortless. Another benefit is the added flexibility given by application-level processes, and the ease with which process-based models can be scaled to implement simulations of large systems~\cite{MacDougall}. In this work we describe the design of a process-interaction system based on lightweight processes. \Si/, a system for \S/imulating process \i/nteractions, is a process-oriented discrete-event simulation system. It has been designed to enhance the capabilities of the C programming language through a set of primitives which provide a quasi-parallel programming environment. The primary goal of this work is to develop a research vehicle for conducting experiments and obtaining measurements in empirical evaluations of algorithms used in simulation systems. Such a research vehicle helps in identifying principles of good design and alternatives which contribute toward efficient, accurate, and reliable simulation software. A secondary goal is to present an interface that is simple and straightforward, but sophisticated enough to meet the needs of a wide range of applications. The primitives provided in the \Si/ skeleton are sufficient for a variety of tasks, including process manipulation and synchronization, random number generation, and statistics collection - all of which enable the description of an application to the maximum extent possible without application-specific details. There are two main approaches to designing process-oriented simulation software. One approach is to design a specific simulation language. Several existing languages such as SIMULA\cite{SIMULA}, GPSS\cite{GPSS}, etc. belong to this category. The other approach is to construct and place simulation primitives on top of an existing language. Examples of this approach can be found in \cite{Bruno,Kriz,CSIM,Sharma}, where Ada, Extended Pascal with coroutines, C, and Modula II have been used as target languages. We decided to take the second approach with \Si/ for the following reasons. First, it requires less effort to extend and modify a library than to design and implement a new language. Second, users tend to be more comfortable with using a familiar and trusted language instead of having to learn yet another new language~\cite{Ferrari}. To construct an efficient process-oriented environment, we used so-called lightweight processes which can operate within a single address space. In fundamental structure, a lightweight process is no different from a process; each has its own stack, local variables, and program counter. However, as compared to a process, a lightweight process is lighter in terms of overheads associated with creation, context-switching, interprocess communication, and other routine functions. We decided to base \Si/ on the Sun Lightweight-Process library (LWP) \cite{LWP}. We chose this as a kernel because of its availability in our research environment, its capacity for reliable context-switching and allocation of protected stacks, and its continued development. Further, ease of portability of the system, due to potential conformation with POSIX standards makes the library attractive. The remainder of the paper is organized as follows. In Section 2 we briefly introduce the Sun Lightweight-Process library. Some design issues are discussed in Section 3, and Section 4 details the implementation of the four major modules in \Si/. Section 4 contains a description of the process management, process coordination, resource management and statistics functions. In Section 5 we illustrate some of the strengths of the \Si/ system, with early measures of performance given in Section 5.1 and 5.2, a simple but remarkable performance enhancement, easily implementable in \Si/, described in Section 5.3, and a brief comparison of different algorithms for simulating service disciplines given in Section 5.4. A short conclusion is presented in Section 6. \section{Design Issues} The design of the \Si/ system was motivated by three primary objectives, namely, \begin{itemize} \item to provide a simple and effective interface which encapsulates the LWP library, \item to provide modularity that supports easy extensions and modifications, and \item to achieve efficient simulation executions. \end{itemize} The \Si/ system employs the LWP library as its kernel. A layered design enables applications to be created without direct utilization of LWP functions. Therefore, any changes to the LWP library is transparent to the application level. \Si/ was designed with a modular structure to simplify the replacement and modification of specific parts of the simulator. Modularity is well-recognized as an important concept in system design because it provides for convenient system maintenance. This property is invaluable in software systems like \Si/ which are used as research testbeds, frequently undergoing modifications to improve existing algorithms or to add new features which improve system functions and performance. For example, two different approaches are provided in \Si/ to simulate the round-robin service discipline. The modular structure allow us to implement these two algorithms without modifying system structure. The performance of executing simulations is a major concern in our design. Since simulations are usually time-consuming, the need for significantly reducing the execution time of simulation programs is a critical one. One source of overhead which exhibits the potential to contribute to long execution times is due to context-switching between lightweight processes. Although a lightweight process's context-switching overhead is cheaper than that of its heavier UNIX counterpart, it nevertheless can still contribute in a significant way towards execution time, particularly when the number of context-switches is large. Another source of overhead may be attributed to \Si/'s layered design, which results in a series of function calls. This is unavoidable, unless the layers are broken up and combined. But the overhead can be minimized if only a thin layering of sparse code is used in \Si/ to encapsulate LWP primitives. In general, our objectives in designing and building the \Si/ system are to build a layered, modular, and efficient simulation system for process-interaction based simulations. Our eventual goal is the integration of the \Si/ and EcliPSe systems~\cite{Naka,Eclipse}. The latter system provides high level simulation primitives which enable simulation tasks to be executed in parallel. With such support, the applications developed in \Si/ can be conveniently executed on networks of heterogeneous workstations. The requisite layering for such a system is shown in Figure~\ref{layers}. \section{Implementation Issues} The \Si/ system consists of four major components, namely, a process management component, a process coordination component, a resource management component, and a probability and statistics library. Figure~\ref{modules} depicts layout of components and examples of functions provided within each. \subsection{Module Descriptions} The process creation task in \Si/ is achieved through an invocation of the function {\sl si\_create(func, nargs, arg1,$\cdots$, argn)} which encapsulates the LWP function {\sl lwp\_create()}. Following the creation of a process, another function invocation {\sl si\_insert(E)} is used to ensure that the specified process with reactivation record {\sl E} is reactivated at time {\sl E.clock} by saving the record in the simulation calendar. \noindent{\\ \bf Process Management} Transfer of control between specific processes is most easily done with the aid of the {\sl lwp\_yield()} function. However, a problem arises because the LWP library provides its own process scheduler which schedules processes based on LWP scheduling priority; the scheduling rule always selects a process with the highest priority to run. If more than one such process exists, the scheduler uses a FIFO policy within the priority class. In building \Si/ on top of the LWP library, a potential problem arises when an executing process terminates. At this point, the LWP scheduler will transfer control to the highest LWP priority process, one which is not necessarily the process whose execution is imminent in simulation logic, i.e., one whose activation record in the simulation calendar has the smallest reactivation time. While an apparent solution is to require a mapping between priorities in \Si/ and priorities in LWP, the integer priority formats used in LWP preclude the use of this option because \Si/'s priorities are double precision numbers, representing reactivation (simulation) times. Therefore it is necessary to determine an alternate scheme for transferring control between a terminating process and \Si/'s scheduler. Our strategy is to force the LWP scheduler to schedule only a limited number of processes in \Si/. To achieve this, each process is assigned an LWP priority, either a MINPRIO (minimum priority) value or a MAXPRIO (maximum priority) value. This principle is stated as follows: \begin{description} \item[Principle:] At any given instant, at most two processes exist with the highest priority value MAXPRIO. One of these is \Si/'s process scheduler, and the other is a currently executing application-level process. \end{description} With this principle, when a currently executing process terminates, the process scheduler obtains control under the LWP library's own scheduling policy since it is the unique process left with priority value MAXPRIO. This solves the control transfer problem. Figure~\ref{Scheduler} shows the pseudo-code required for the process scheduler. As implied by the principle, an executing process scheduler selects the highest priority process, say process {\sl pid}, from the simulation calendar, using smallest reactivation time as a measure of priority. It raises the LWP priority of process {\sl pid} to MAXPRIO and subsequently transfers control to process {\sl pid}. When the process scheduler is invoked due to process suspension, the scheduler follows exactly the opposite routine. It reduces the LWP priority of the suspended process, say process {\sl pid}, from a value MAXPRIO to a value MINPRIO, prior to saving its reactivation record in the simulation calendar, if necessary. Observe that two distinct processes are created within the {\sl main()} routine shown in Figure~\ref{Scheduler}. The first process executes a {\sl si()} function which the application code treats as a main routine. The second process is the process scheduler whose function it is to execute the {\sl schedule()} routine. Both are created with the highest priority value MAXPRIO. The notation \Sip/ is used to emphasize the fact that the scheduler is implemented as a process here. An alternate approach toward implementing transfer of control between processes is through function invocations. To demonstrate this point, we implement a different version of the process scheduler, through a small modification of the function {\sl schedule()}. In the remainder of the text, we use the notation \Sif/ to emphasize the use of function invocations in the implementation of the scheduler. The function {\sl schedule()} in \Sif/ uses a strategy similar to that used in \Sip/ to switch control between processes. Since the scheduler in \Sif/ is no longer a process, an executing application-level process is the only process with the highest priority value MAXPRIO. First, in function {\sl schedule()}, the selection of the next process to execute, say {\sl pid}, is made by accessing the simulation calendar. Following this, process {\sl pid}'s priority is raised to value MAXPRIO, and the currently executing process (i.e., the process seeking suspension) reduces its own priority by invoking function {\sl lwp\_setpri()}. Through this priority reduction scheme, control is transferred to the single remaining highest priority process, i.e., {\sl pid}. The scheme described above allows for transfer of control between a process requiring suspension and a process whose execution is imminent. Unfortunately, a problem arises when an executing process terminates naturally, without either voluntarily or involuntarily requesting suspension. Upon its termination, there is no unique highest-priority process to take control. Left to its own devices the LWP scheduler would give control to the first process in line within the single priority class shared by all other processes. Needless to say, this would result in incorrect simulation logic. To get around this problem, we use an additional function called {\sl si\_exit()} which a process must invoke just prior to its termination. The function {\sl si\_exit()} invokes function {\sl schedule()} which uses function {\sl lwp\_destroy()} to eliminate the currently executing process, effectively a suicide operation. As before, {\sl schedule()} also selects a process whose execution is imminent, so that the invocation of function {\sl lwp\_destroy()} causes control to switch to the new high-priority process. These actions are summarized in the pseudo-code shown in Figure~\ref{Scheduler1}. It is worth mentioning that there is yet another way to solve the ``natural termination" problem. By modifying the return address of a process, control can be transfered to some function specified in the address when the process terminates naturally. This can be accomplished either by modifying {\sl lwp\_create()} to set the address of the function {\sl schedule()} as the return address, or by implementing a new thread creation routine which achieves the same effect~\cite{Gautron}. Since our design goals emphasize high level abstractions and layered design, we chose not to adopt this option. \begin{figure} {\singlespace \begin{pseudocode} { \item schedule() \\ \{ \\ \item \+ while(future\_event\_set is not empty) \{ \\ \+ if previously executing process is suspended\\ \+ lwp\_setpri(ppid,MINPRIO); \\ \- E = extract\_min(future\_event\_set); \\ clock = E.clock; \\ lwp\_setpri(E.pid,MAXPRIO); \\ /* Now only E.pid and the scheduler have highest priority.*/\\ si\_yield(E.pid) /* resume execution of process pid */\\ \- \} \\ \- \} \\ \item main()\\ \{ \\ \item \+ int si(); $\cdots$initialization$\cdots$ \\ lwp\_create(si\_tid,si,MAXPRIO, $\cdots$); \\ lwp\_create(sched\_tid,schedule, MAXPRIO, $\cdots$); \\ \- \} \\ } \end{pseudocode} } \caption{Process Scheduling in $Si_p$} \label{Scheduler} \end{figure} \begin{figure} {\singlespace \begin{pseudocode} { \item schedule() \\ \{ \\ \item \+ while(future\_event\_set is not empty) \{ \\ \+ E = extract\_min(future\_event\_set); \\ clock = E.clock; \\ lwp\_setpri(E.pid,MAXPRIO); \\ if the current process is dying\\ \+ lwp\_destroy(cpid); /* suicide*/ \\ /* control transfers to E.pid automatically. */ \\ \- else \+ lwp\_setpri(cpid,MINPRIO); \\ /* control transfers to E.pid automatically. */ \\ \- \- \} \\ \- \} \\ \item main()\\ \{ \\ \item \+ int si(); $\cdots$initialization$\cdots$ \\ lwp\_create(si\_tid,si,MAXPRIO, $\cdots$); \\ \- \} \\ } \end{pseudocode} } \centerline{} \caption{Process Scheduling in $Si_f$} \label{Scheduler1} \end{figure} There is another important function in \Si/ known as the {\sl delay()} function. When an executing process decides to suspend itself for {\sl t} units of simulated time, it invokes the function {\sl delay(t)}. This function inserts the invoking process's reactivation record, including its reactivation instant {\sl clock + t}, into the simulation calendar. Since the invoking process must undergo suspension, the process scheduler selects as the next process to execute that process in the simulation calendar with the lowest reactivation time. In the \Sip/ design, transfer of control to the process scheduler is done via the {\sl lwp\_yield(scheduler)} action, while in the \Sif/ design, the scheduler is invoked directly through a call to function {\sl schedule()}. This is described in the pseudo-code shown in Figure~\ref{delay}. \noindent{\\ \bf Process Coordination} The \Si/ system provides two distinct coordination mechanisms to support synchronization between processes. One mechanism is through user-declared events, effected by calling {\sl wait\_event()} and {\sl set\_event()} primitives. A process is suspended if it invokes function {\sl wait\_event(e)} because it is forced to wait until event {\sl e} occurs. Event {\sl e} is said to occur when some other process invokes function {\sl set\_event(e)}. At this point, all processes waiting for event {\sl e} are reactivated simultaneously. In \Si/, an event {\sl e} is declared to be of type {\sl Event} and initialized by the {\sl create\_event()} function. The other mechanism for process synchronization is through message-passing. There is a predefined data structure called {\sl Mailbox} which is created by function {\sl create\_mailbox}. Messages can be sent and received through the mailbox by using functions {\sl si\_send()} and {\sl si\_receive()}, respectively. The function {\sl send(mb,msg)} deposits the message {\sl msg} in the mailbox {\sl mb}. If there is a process awaiting the arrival of this message, {\sl si\_send()} enables the process to access the message and consequently be reactivated. The reverse function {\sl si\_receive(mb,\&msg)} obtains the message from the mailbox. If no message is available, the invoking process has to be suspended until a message arrives. For simplicity, the size of a message is limited to one word (i.e., the size of an integer or pointer). This is patterned after the design of the XINU system~\cite{XINU}. \begin{figure} {\singlespace \begin{pseudocode} { \item delay(t) \\ \{ \\ \+ E.pid = CurrentPID; \\ E.clock = clock + t;\\ si\_insert(E); \\ lwp\_yield(scheduler); \\ /* or schedule() in $Si_f$ */\\ \- \} \\ } \end{pseudocode} } \caption{The function {\sl delay()}} \label{delay} \end{figure} \noindent{\\ \bf Resource Management} In contrast to processes which are used to model active components of a system, resources are used to model passive system objects with mutually exclusive access. In other words, processes request access to resources, use these resources for a certain length of time, and finally release them and proceed with different activities. The \Si/ system supports two basic functions for resource access, called the {\sl request(r)} and {\sl release(r)} functions (see Figure~\ref{Resource}). The resource object is declared to be of type {\sl Resource} and initialized by a {\sl create\_resource()} function. When a resource {\sl r} is occupied, other requesting processes must wait in a queue associated with resource {\sl r}. When resource {\sl r} is released, a suspended process in the front of the queue is given permission to resume, with access to {\sl r}. In a real application, a variety of queueing disciplines is possible, including first-in first-out (FIFO), round-robin (RR), processor-sharing (PS), etc.. Some, such as the the latter two, are more complicated than others. These complex disciplines utilize different rules in selecting the next process to execute, when faced with a choice. To give analysts a common interface to these disciplines, the \Si/ system employs a function {\sl use(r,t)} (patterned after CSIM~\cite{CSIM}) to allow a process to utilize a resource {\sl r} for a given length of time {\sl t}. The queueing discipline is specified as a parameter when resource {\sl r} is initialized. For example, the statement {\sl r = create\_resource(ps)} \noindent binds resource {\sl r} with the {\sl ps} function which is predefined in \Si/ to simulate the PS discipline. This simple interface also provides a level of modularity which lets analysts develop their own queueing discipline in a fairly effortless manner. \begin{figure} {\singlespace \small \begin{pseudocode} { \item request(r) \\ \{ \\ \+ if (resource r is free) \\ \+ flag r as occupied; \\ \- else \{ \\ \+ compute required statistics; \\ insert current process's pid at tail of queue. \\ lwp\_yield(scheduler); \\ /* or schedule() in $Si_f$ */\\ \- \} \\ \- \} \\ \item release(r)\\ \{ \\ \+ if (waiting queue for resource r is not empty) \{\\ \+ remove process pid from head of queue; \\ compute required statistics; \\ si\_insert(E); /* with E.pid = pid, E.clock = clock */\\ \- \} \\ else \{ \\ \+ compute required statistics; \\ flag resource r as free; \\ \- \} \\ \- \} \\ \item fcfs(r,t) /* first-come, first-served */ \\ \{ \\ \+ request(r); \\ \+ delay(t); \\ \- release(r); \\ \- \} \\ \item ps(r,t) /* processor-sharing */ \\ \{ \\ \+ insert\_into\_pool(r,t); \\ lwp\_yield(scheduler); \\ /* or schedule() in $Si_f$ */\\ /* regain control from scheduler */ \\ delete\_from\_pool(r); \\ \- \}\\ $\cdots$ \\ \item create\_resource(fp) \\ \{ \\ \+ $\cdots$initialization$\cdots$; \\ r->fp = fp ; /* fp is a scheduling function pointer */\\ $\cdots$ \\ \- \} \\ \item use(r,t)\\ \{ \\ \+ (*(r->fp))(r,t); \\ \- \} \\ } \end{pseudocode} } \caption{Resource Management in {\bf ${\cal S}i$}} \label{Resource} \end{figure} \noindent{\\ \bf Probability and Statistics Functions} The \Si/ system provides some necessary random number generation functions such as {\sl uniform()} for generating deviates from a uniform distribution, and {\sl expon(u)} for generating deviates from an exponential distribution with mean {\sl u}, etc.. In addition, \Si/ supports two types of statistics. The first type is a prefined type, involving data that is implicitly associated with a resource to be automatically collected, summarized and reported. For example, applications can use functions such as {\sl util(r)} and {\sl qlen(r)} to obtain the utilization of and queue-size at resource {\sl r}. The second type is a user-defined type, requiring user-tables for statistics collection. For example, an explicit invocation of the function {\sl putstat(t,x)} inserts datum {\sl x} into a user-defined table {\sl t} which is declared with the type {\sl Table}. The sample mean and variance can be obtained by calls to functions {\sl getmean(t)} and {\sl getvar(t)}, respectively. At present, the \Si/ system also supports functions {\sl reset\_resource(r)} and {\sl reset\_table(t)} to clear the statistic-collection fields in resource {\sl r} and table {\sl t}, respectively. There are two advantages to using these functions. The first advantage is that these functions can be used to eliminate the effects of the start-up transient in simulations. The second is that they can be used in the regenerative simulation method, where independent samples are obtained from independent cycles of a simulated system~\cite{Crane}. \section{Performance Measurements} In this section we first present some performance measurements obtained from the \Si/ system, a simple performance enhancement which can significantly reduce simulation execution time, and finally some specific algorithms for service disciplines. An important reason for pursuing modular design and layering with the \Si/ system is potential for easy modification, and thus use of \Si/ as a testbed for new ideas in simulation. In the following subsections we present two different examples of how this approach has proven beneficial. The first example is motivated by recognition of the simple fact that simulation execution performance can be significantly enhanced if the scheduler's interaction with the simulation calendar is slightly modified. The second example demonstrates that new, computation-based algorithms for simulating service disciplines more efficiently than direct process-mapped algorithms, are easy to incorporate in a modular \Si/ system. The configuration used in our measurements was a Sun SPARC IPC workstation (15.7 MIPS, 8 MB memory, 64KB cache) running SunOS 4.1. The measured times presented here are averages, and it should be emphasized that the measured averages are not intended to represent absolute performance but rather relative performance for a particular parameter configuration. Thus the comparison of average times is of more interest than a comparison of raw numerical data. \subsection{Cost of Operations} In Table~\ref{CTX} is shown a set of measured system overheads for the tasks of process creation and context switching in both the \Sip/ and \Sif/ subsystems. Also included, for the purpose of comparison, are the corresponding overheads in the LWP library. Not surprisingly, the \Sif/ subsystem exhibits less context-switching overhead than its \Sip/ counterpart, primarily due to its use of a function, instead of an additional process, for process scheduling. \begin{table} \begin{center} \begin{tabular}{|l||r|r|r|} \hline \multicolumn{1}{|c||}{Operations} & \multicolumn{1}{c|}{LWP} & \multicolumn{1}{c|}{\Sip/} & \multicolumn{1}{c|}{\Sif/} \\ \hline {Creation + Deletion} & 660 & 960 & 860 \\ \hline {Process Switches } & 70 & 220 & 170 \\ \hline \end{tabular} \end{center} \caption{Latency of Operations (in microseconds)} \label{CTX} \end{table} \subsection{Benchmark Measurements} A pragmatic approach toward evaluating the performance of the \Si/ system is through the use of benchmark models, using both ease of model development and execution times as indicators of performance. Though our primary interest is in the \Sip/ and \Sif/ subsystems, we have included the CSIM system as a basis of comparison. We chose CSIM because it is a sound C-based process-oriented simulation system that has been gaining an increasing amount of popularity in the modeling of complex computer systems~\cite{Complex}. In addition, using the same language (i.e., C) to realize models makes the coding of equivalent definitions considerably easier. In this subsection, we develop models for a single-server queueing system, and a multiqueueing system for a token ring local area network. Both models have previously been used as benchmarks in comparing simulation systems~\cite{Benchmarks,Sharma}. \bigskip \noindent{\\ \bf Benchmark I: A Single Server Model} The first model is simple, describing a single-server queueing station. The customer arrival process is Poisson, and customer service times are independent, exponentially distributed random variables. We assume that customers are served in their order of arrival (i.e., FIFO discipline). Figure~\ref{MM1} shows the code of this model. In the first experiment, we execute the benchmark program to measure average execution times versus a varying number of simulated customers. Each execution is repeated several times, using different random number seeds in each case, and the average execution time is computed. \begin{figure} {\singlespace \begin{pseudocode} { \item \#include \\ \#define NMAX 10000 /* \# simulated cust. */\\ \#define SM 4.0 /* mean service time */\\ \#define IM 5.0 /* mean cust. interarrival time */\\ Resource r; \\ Event done; \\ si() \\ \{ \\ \+ r = create\_resource(); \\ done = create\_event(); \\ si\_create(gen\_cust,0); \\ wait\_event(done); \\ printf("\%f, \%f$\backslash$n", qlen(f), util(f)); \\ \- \} \\ gen\_cust() \\ \{ int k; \\ \+ for(k=0; ; k++) \{ \\ \+ delay(expon(IM)); \\ si\_create(cust,1,k); \\ \- \} \\ \- \} cust(k)\\ \{ \\ \+ request(r); \\ \+ delay(expon(SM)); \\ \- release(r); \\ if(k==NMAX) \\ \+ set\_event(done); \\ \- \- \} } \end{pseudocode} } \caption{A Single Server Queueing Model} \label{MM1} \end{figure} \noindent{\\ \bf Benchmark II: A Multiple Queue Model} The second model is a little more elaborate, utilizing a multiqueue system with roving server to emulate a token-ring protocol~\cite{Spragins}. The multiqueue system represents $N$ independent computer stations situated on a ring, and the roving server represents the token. Messages made up of packets are generated by each station for transmission to other stations on the ring. A single token is passed unidirectionally from one station to its successor on the ring, to provide stations with a mechanism for conflict-free access to the ring for packet transmissions. A station which acquires the token and has queued packets is allowed to complete transmission of a single packet before relinquishing control of the token to the succeeding station on the ring. It is of some interest to determine queueing characteristics of packets at different stations as a function of ring parameters and station traffic. The parameters of the model include message interarrival time distributions, and packet transmission time distributions at the different stations on the ring. For convenience, we assume that all interarrival time distributions are identical, each being exponential with mean $1/\lambda$. Also, for convenience, assume that all packet transmission time distributions are identical, each being exponential with mean $1/\mu$. Finally, assume that the token-passing time between consecutive stations on the ring is a small constant, a function of ring delay. A similar setup which uses CSIM to model the token ring network can be found in~\cite{Edwards}. \noindent{\\ \bf Empirical Results and Interpretation} The results of both experiments, given in Tables~\ref{B1} and \ref{B2}, suggest that the two \Si/ subsystems are competitive with CSIM in terms of performance. In particular, the \Sif/ system outperforms CSIM by up to 20\%. We emphasize that this does not imply the \Si/ system is better than CSIM in all respects. Indeed CSIM is a very stable system, developed and used over several years, while \Si/ is still an experimental and evolving system. \begin{table} \begin{center} \begin{tabular}{|l|r|r|r|r|r|} \hline &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\ & 10 & 50 & 100 & 200 & 500 \\ \hline \hline \Sip/ & 1.5 & 7.4 & 14.8 & 29.7 & 73.8 \\ \hline \Sif/ & 1.2 & 6.1 & 12.1 & 24.0 & 60.8 \\ \hline CSIM & 1.5 & 7.6 & 15.1 & 30.4 & 75.9 \\ \hline \end{tabular} \end{center} \caption{Execution time (in seconds) for Benchmark I} \label{B1} \end{table} \begin{table} \begin{center} \begin{tabular}{|l|r|r|r|r|r|} \hline &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\ & 10 & 50 & 100 & 200 & 500 \\ \hline \hline \Sip/ & 3.5 & 16.7 & 33.5 & 67.3 & 168.5 \\ \hline \Sif/ & 2.1 & 10.0 & 19.9 & 39.7 & 100.3 \\ \hline CSIM & 2.6 & 13.1 & 26.1 & 51.9 & 130.8 \\ \hline \end{tabular} \end{center} \caption{Execution time (in seconds) for Benchmark II} \label{B2} \end{table} It is interesting to observe that the \Sif/ subsystem consistently outperforms the \Sip/ subsystem, even attaining a 40\% improvement in execution time for Benchmark II. This is simply due to the fact that \Sif/ employs functions to schedule processes instead of using a dedicated process for this task. Consequently, \Sif/ suffers significantly less context-switching overhead. Based on the empirical results, we can claim that we have achieved a certain level of efficiency, one of the main considerations in our original set of objectives. \subsection{A Scheduling Enhancement} Upon examining the code given for the function {\sl delay()} (see Figure~\ref{delay}), it will be clear there is some likelihood that a process reactivation record which has just been inserted into the simulation calendar may very well represent the process whose execution is imminent. In such a case, the insertion of this activation record will immediately be followed by its deletion from the simulation calendar. Clearly, the cost of insertion and subsequent deletion can be avoided if the process in question is recognized to be the process whose execution is imminent. Apart from insertion and deletion savings, unnecessary context-switches between such a process, i.e. one undergoing a potential suspension, and the scheduler may be avoided. Recognition of such a situation entails a comparison operation in which the scheduler determines if the simulation priority of the process in hand is greater than the simulation priority of the highest priority process in the simulation calendar. Because this comparison operation must now be done for each process scheduled for execution, there is a trade-off between the new scheme (in terms of the additional comparison cost) and the old scheme (where there is no comparison cost, but avoidable insertion-deletion pairs of operations and context-switches). In order to obtain a rough assessment of the frequency of such unnecessary insertion-deletion actions of process reactivation records, we measure the ratio of such occurrences to the total number of times that function {\sl delay()} is invoked. Recall that invocation of function {\sl delay()} initiates a process's suspension by a control-switch from the process to the scheduler, and a subsequent control-switch to a new, or the same, process. Table~\ref{Ratio} contains percentages of such unnecessary actions for both Benchmark programs. Surprisingly, the avoidable cost can be seen to be as high as 78\% for the multiqueue model. This simple idea is incorporated in \Si/ through a small modification of the original code in the function {\sl delay(t)}. An additional function, {\sl findmin()}, is required for actually performing the comparison. It determines if the highest priority process in the simulation calendar, with reactivation record {\sl E} and reactivation time {\sl E.clock} is to be given control before or after the function that invoked {\sl delay()} and requires control at simulation time {\sl clock + t}. Clearly, if the quantity {\sl E.clock} is smaller, then the currently executing process must be suspended; otherwise, the savings will include an insertion, a deletion, and two context-switching actions in the case of \Sip/. In the latter situation, the process continues execution, upon an immediate return from function {\sl delay()}. A succinct description of this scheme is given by the pseudo-code shown in Figure~\ref{Newdelay}. \begin{figure} {\singlespace \begin{pseudocode} { \item delay(t) \\ \{ \\ \item \+ E = findmin(future\_event\_set); \\ if( t+clock < E.clock) \{ \\ \+ /* no need to insert; process continues to execute */ \\ clock += t; \\ return; \\ \- \} \\ else \{ \\ $\cdots$ original code in delay() $\cdots$ \\ \} \\ \- \} \\ } \end{pseudocode} } \caption{Add an extra check to {\sl delay()}} \label{Newdelay} \end{figure} Using the modified {\sl delay(t)} function, we repeat the experiments described above to obtain average execution times for the two benchmark models. The results, shown in Table~\ref{Better}, indicate a significant improvement in performance as compared to the previous results (see Tables~\ref{B1} and \ref{B2}). The impact of this modified piece of code on the performance of the \Sip/ subsystem is larger than its impact on the \Sif/ subsystem. This is largely attributable to the fact that both, process switching overheads and simulation calendar overheads are reduced in \Sip/, while only simulation calendar overheads are reduced in \Sif/. Though the difference in performance between the \Sip/ and \Sif/ subsystems decreases with the comparison modification, the \Sif/ subsystem is still a consistently better performer than the \Sip/ subsystem. The performance improvement to be had from the additional comparison operation depends on the frequency of the desired property (i.e., the currently executing process must continue execution without incurring calendar and context-switching overheads) relative to the actual cost of performing the comparison for every process that invokes the {\sl delay()} function. It should be apparent that the more frequent the condition supporting the property, the larger will be the savings. Also, the larger the ratio of this count to the number of times the {\sl delay} function is invoked, the more gain is to be had by adding this comparison test. Using $f$ to denote the frequency with which the condition is true, $C_{test}$ and $C_{overhead}$ to represent the costs of comparison and overhead, respectively, and $r$ the ratio $C_{test}/C_{overhead}$, the enhancement is beneficial whenever \begin{equation} \nonumber C_{overhead} ~ > ~ C_{test} + ( 1 - f) \times C_{overhead} \end{equation} or equivalently, whenever \begin{equation} \nonumber f ~ > ~ \frac{C_{test}}{C_{overhead}} ~ = ~r . \end{equation} In support of this modification, it is known that when scheduling distributions (which dictate how reactivation-times are dispersed in the simulation calendar) are mixture distributions, reactivation-times tend to pile up towards the beginning of the simulation calendar~\cite{Sargent:study2}. Because such scheduling distributions are realistic, the modified scheme is likely to almost certainly yield reduced execution times for most applications. A detailed analysis of the savings given by this method can be found in~\cite{CBI}. \begin{table} \begin{center} \begin{tabular}{|l|c|} \hline &$saved/total$ \\ \hline Benchmark I & $27802/100000 \approx 28\% $ \\ \hline Benchmark II & $337954/433437 \approx 78\% $ \\ \hline \end{tabular} \end{center} \caption{Percentages of the saved Insert and Extract\_min operations} \label{Ratio} \end{table} \begin{table} \small \begin{center} \begin{tabular}{|ll|r|r|r||r|} \hline & &\multicolumn{4}{|c|}{Simulated Customers ($\times 100$)} \\ & & 100 & 200 & 500 & gain \\ \hline \hline Benchmark I & \Sip/ & 13.6 & 27.2 & 67.9 & 8\%\\ \cline{2-6} & \Sif/ & 11.8 & 23.5 & 59.2 & 2\%\\ \hline Benchmark II & \Sip/ & 18.9 & 37.6 & 94.2 & 44\%\\ \cline{2-6} & \Sif/ & 16.9 & 34.0 & 84.7 & 16\%\\ \hline \end{tabular} \end{center} \caption{Execution time (in seconds) for Model I and Model II (improved)} \label{Better} \end{table} \subsection{Round-Robin and Processor-Sharing Algorithms} Service disciplines such as first-in-first-out (FIFO), round-robin (RR), processor-sharing (PS), etc. are functional parameters of service stations in queue-based simulations. As systems that are designed and built become increasingly complicated both in functionality and description, we are faced with a choice between hypothetically weak (in the sense of model assumptions) analytic models and conclusively weak (in the sense of execution data) simulation models. In most instances of practical interest we usually have little choice but to rely on good simulation models to answer questions related to system performance. Hence efficient techniques for implementing simulation algorithms, including algorithms for service disciplines, are useful as simulation execution enhancements. In the round-robin (RR) service discipline, a job is serviced for a single quantum $q$ at a time, sharing the service resource with other jobs undergoing the same service allocation. If the remaining service time required by a job exceeds the quantum size $q$, the job's processing is interrupted at the end of its quantum and it is returned to the rear of the queue, awaiting the service quantum it will receive in the next round. Instead of taking a naive approach which simulates the round-robin discipline by physically switching control from one process to another, we propose a computational scheme which is based on predicting departure instants of serviced jobs leaving the pool of queued jobs. The computed departure instant of a particular job may be invalidated by one or more newly arriving jobs, i.e., one or more arriving after the corresponding departure event is scheduled, but before the departure event can occur. This is because the server must now attend to one or more previously unaccounted for job arrivals, and decrease the amount of attention it planned on giving to jobs already in the system prior to the arrival of the new job(s). Consequently, a scheduled departure event that has been invalidated in this manner must be cancelled, and an updated departure instant for the same or another job scheduled in its place. In addition, it is necessary that the remaining service time requirements of each job in the system be adjusted whenever an arrival event or a departure event occurs. The processor-sharing (PS) discipline schedules jobs as if the server were processing all the jobs in the facility queue simultaneously. That is, each job receives service for a time which is inversely proportional to the number of competing jobs in the pool. A naive algorithm for simulating the PS discipline is based on a computation and prediction method which doles out an equal amount of service time to all jobs in the pool. This approach suffers in that it is computationally demanding, especially when there are frequent updates of the remaining service time requirements for each job in the pool. To alleviate the computational requirement to some extent, we propose a lazy-update algorithm which accumulates the requisite amount of updating required in a special variable, instead of directly performing the update on all jobs exhaustively. When the departure time of the earliest job to leave the pool is to to be computed, the value resident in the special variable is be subtracted from the remaining service time of the next job to depart, reflecting the true remaining service time. Because of this modification, the only required operations for handling the pool are INSERT and EXTRACTMIN primitives. By combining such infrequent updates with the use of an efficient priority queue data structure, the lazy-update algorithm can be shown to reduce the $O(n^2)$ complexity of the naive algorithm to $O(n \log n)$ We conduct experiments to measure the execution performance of the proposed RR and PS algorithms, respectively, in comparison to the naive algorithms. We use the single server model for this experiment. In Tables~\ref{rr} and~\ref{ps} it can be seen that the proposed algorithms perform well compared to the naive algorithms. The RR discipline exhibits larger performance differences between the naive and computational approaches because of the large number of context-switches required by the former. The differences are not particularly significant for the PS discipline because both approaches exploit computation. A detailed description of these algorithms and related experiments can be found in \cite{RRPS}. \begin{table} \small \begin{center} \begin{tabular}{|ll|r|r|r|r|r|} \hline & &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\ & & 10 & 50 & 100 & 200 & 500 \\ \hline \hline & Naive & 3.5 & 17.2 & 34.1 & 67.8 & 169.5 \\ \cline{2-7} & Computational & 1.7 & 8.2 & 16.3 & 32.8 & 83.0 \\ \hline \end{tabular} \end{center} \caption{Comparison of Execution time (in seconds) for RR algorithms} \label{rr} \end{table} \begin{table} \begin{center} \begin{tabular}{|ll|r|r|r|r|r|} \hline & &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\ & & 10 & 50 & 100 & 200 & 500 \\ \hline \hline & Naive & 1.8 & 8.5 & 16.5 & 34.2 & 86.0 \\ \cline{2-7} & Lazy-Update & 1.5 & 7.4 & 14.9 & 29.3 & 73.2 \\ \hline \end{tabular} \end{center} \caption{Comparison of Execution time (in seconds) for PS algorithms} \label{ps} \end{table} \section{Conclusions} Our experiences with the the design and implementation of the \Si/ system have been amply rewarding. The support of a very reliable lightweight process library has greatly reduced the effort required in building an efficient, experimental simulation test-bed. During the early stages of design and implementation, we faced several different design choices which were at times not altogether consistent with our design principles and objectives. For example, the process scheduler could be implemented by either a dedicated process or by function invocation, with each method leading to different versions of \Si/. While the former suffers overheads typically associated with process switching and control, the latter suffers in terms of application-interface inelegance, in that an extra function call is required of a terminating process. Our experiences suggest that, despite the use of lightweight processes, context-switching costs are not insignificant. This is clearly seen in the use of the simple comparison check which eliminates certain unnecessary context switches during simulation execution, improving the performance of both the \Sif/ and \Sip/ subsystems. Furthermore, about two-thirds of all overheads are due to thread creation/deletion-related processing within the LWP library, and the rest is due to the layer above the library. Unfortunately, our experimental results show that such overheads can contribute up to 70\% of the total simulation time. Thus, simulation would speed up considerably if thread creation and deletion overheads are reduced. To this end, we have embarked on the design of a simple threads library\cite{Xtech} which can provide more efficient thread operations and support a simulator's scheduling disciplines. With the \Si/ system we have been successful in developing a software infrastructure which provides an ideal experimental environment. Central to this capability is the modular design philosophy, which unambiguously defines interfaces to the system's functional components. New algorithms can therefore be implemented, tested, and experimented with, almost effortlessly, requiring only the simple need to match interfaces. We have implemented new algorithms in \Si/ to simulate the round-robin and processor-sharing disciplines, and both the ease of algorithm incorporation in \Si/, and the performance of \Si/ with the new algorithms has been excellent. A current disadvantage of the \Si/ system is its lack of portability. This is due to its implementation on a lightweight process library which is machine dependent. However, since \Si/ adopts a layered design, and because almost all lightweight process libraries support the universal functions of process creation, switching (i.e., yielding), and deletion, the effort required in porting \Si/ to rest on top of another lightweight process library is minimal. This effort will need to focus only on the process manipulation layer, which is the innermost layer in the \Si/ system. Therefore, such a modification will be transparent to the application-layer. With the advent of the standard POSIX threads package (e.g., \cite{Mueller}), we believe that the \Si/ system will be easily portable to any POSIX supporting machine. ------------------- end -------------------- Note: Readers who are interested in the complete version of this paper which shows figures or charts, please see the USENIX Summer 1993 Conference Proceedings.