################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


   Design and Implementation of a Simulation Library using
                   Lightweight Processes

Research supported in part by NSF award CCR-9102331, NATO-900108,
ONR-93-1-0233, and ARO-93-G-0045.

	  Janche Sang 
          Ke-hsiung Chung 
          Vernon Rego 
          Department of Computer Sciences 
          Purdue University 
          West Lafayette, IN 47907

			Abstract

The \Si/ lightweight-process based system for simulating
process interactions is an enhancement to the C programming 
language in the form of library primitives with sets of predefined
data structures. The \Si/ system encapsulates an existing 
lightweight-process library 
to provide a discrete-event simulation environment supporting the
process view. It was developed as a research testbed for investigating
methods which support simulations efficiently. Easy extensions and 
modifications to the \Si/ system are a major design objective, accomplished
through modularity and layering. This paper describes the system,
our experiences with its implementation, and its applicability 
to simulation modeling. We report on performance measurements of different
implementations of the simulation scheduler, and of different algorithms for simulating service disciplines.

\section{Introduction}

The process view of simulation, developed and refined over the past two
decades~\cite{Evans,Franta}, is a view than enables an analyst to model a
discrete-event system in terms of interacting processes. A key advantage
of the process-interaction approach to simulation modeling is that
model description via processes and their interactions makes the entire
modeling activity, from model design to code debugging and execution,
require relatively little effort. This is in comparison to the
level of effort required in alternate techniques, such as the event-scheduling simulation method, which offers a lower level of abstraction than the
process-interaction method.


In the process view of simulation, processes employ a variety of statements to define
the flow of entities (transactions, customers, jobs, etc.) through the system.
Two or more processes may compete for resources and exchange messages with
one another for the purpose of synchronization. The relatively high level
of abstraction enjoyed by process-oriented models make model construction
relatively effortless. Another benefit is the added flexibility given
by application-level processes, and the ease with which process-based models
can be scaled to implement simulations of large systems~\cite{MacDougall}.
In this work we describe the design of a process-interaction system 
based on lightweight processes.
\Si/, a system for \S/imulating process \i/nteractions, is a 
process-oriented discrete-event simulation system. It has been designed to
enhance the capabilities of the C programming language through a set of
primitives which provide a quasi-parallel programming environment.

The primary goal of this work is to develop a research vehicle for
conducting experiments and obtaining measurements in empirical 
evaluations of algorithms used in simulation systems. 
Such a research vehicle helps in identifying principles of good design and
alternatives which contribute toward efficient, accurate, and reliable
simulation software. A secondary goal
is to present an interface that is simple and straightforward,
but sophisticated enough to meet the needs of a wide range of applications.  
The primitives provided in the \Si/ skeleton are sufficient for a variety
of tasks, including process manipulation and synchronization, random number 
generation, and statistics collection - all of which enable the description of an application to the maximum extent possible without application-specific
details.

There are two main approaches to designing process-oriented simulation software. One approach is to design a specific simulation language. Several existing languages such as SIMULA\cite{SIMULA}, GPSS\cite{GPSS}, etc.
belong to this category. The other approach is to construct and place simulation primitives on top of an existing language.  Examples of this approach
can be found in \cite{Bruno,Kriz,CSIM,Sharma}, where Ada, Extended Pascal with 
coroutines, C, and Modula II have been used as target languages.
We decided to take the second approach with \Si/ for
the following reasons. First, it requires less effort to 
extend and modify a library than to design and implement a new language. 
Second, users tend to be more comfortable with using a familiar and trusted
language instead of having to learn yet another new language~\cite{Ferrari}. 

To construct an efficient process-oriented environment, we used so-called
lightweight processes which can operate within a single address space.
In fundamental structure,
a lightweight process is no different from a process; each has its
own stack, local variables, and program counter. However, as compared
to a process,  a lightweight process is lighter in terms of 
overheads associated with creation, context-switching, 
interprocess communication, and other routine functions.
We decided to base \Si/ on the 
Sun Lightweight-Process library (LWP) \cite{LWP}. 
We chose this as a kernel because of 
its availability in our research environment, its capacity for
reliable context-switching and allocation of protected stacks, and its
continued development. Further, ease of portability of the
system, due to potential conformation with POSIX standards makes
the library attractive.

The remainder of the paper is organized as follows. In Section 2 we
briefly introduce the Sun Lightweight-Process library. Some 
design issues are discussed in Section 3, and Section 4 details the 
implementation of the four major modules in \Si/. Section 4 contains
a description of the  process management, process coordination, resource 
management and statistics functions.
In Section 5 we illustrate some of the strengths of the \Si/ system, with
early measures of performance given in Section 5.1 and 5.2, a simple but 
remarkable performance enhancement, easily implementable in \Si/, described 
in Section 5.3,
and a brief comparison of different algorithms for simulating 
service disciplines given in Section 5.4. A short conclusion is presented 
in Section 6.

\section{Design Issues}

The design of the \Si/ system was motivated by three primary objectives, 
namely, 

\begin{itemize}
\item to provide a simple and effective interface which encapsulates the
      LWP library,
\item to provide modularity that supports easy extensions and modifications,
      and
\item to achieve efficient simulation executions.
\end{itemize}

The \Si/ system employs the LWP library as its kernel. A layered 
design enables applications to be created without direct utilization of 
LWP functions. Therefore, any changes to the LWP library
is transparent to the application level. \Si/ was designed with a modular structure 
to simplify the replacement and modification of specific parts of the 
simulator. Modularity is well-recognized as an important concept
in system design because it provides for convenient system maintenance.
This property is invaluable in software systems like \Si/ which are 
used as research testbeds, frequently undergoing modifications to improve
existing algorithms or to add new features which improve system functions
and performance. For example,
two different approaches are provided in \Si/ to simulate the round-robin
service discipline. The modular structure allow us to implement these two
algorithms without modifying system structure.

The performance of executing simulations is a major concern in our design. 
Since simulations are usually time-consuming, the need for significantly
reducing the execution time of simulation programs is a critical one.
One source of overhead which exhibits the potential to
contribute to long execution times is due to context-switching
between lightweight processes. Although a lightweight process's context-switching overhead is cheaper than that of its heavier UNIX counterpart,
it nevertheless can still contribute in a significant way towards
execution time, particularly when 
the number of context-switches is large.
Another source of overhead may be attributed to \Si/'s layered design,
which results in a series of function calls. This is unavoidable, unless
the layers are broken up and combined. But the overhead can be minimized
if only a thin layering of sparse code is used in \Si/ to encapsulate 
LWP primitives.

In general, our objectives in designing and building the \Si/ system are
to build a layered, modular, and efficient simulation system for
process-interaction based simulations. Our eventual goal is the integration
of the \Si/ and EcliPSe systems~\cite{Naka,Eclipse}. The latter system provides
high level simulation primitives which enable simulation tasks to be
executed in parallel. With such support,
the applications developed in \Si/ can be conveniently executed on
networks of heterogeneous workstations. The requisite layering for
such a system is shown in Figure~\ref{layers}.

\section{Implementation Issues}

The \Si/ system consists of four major components, namely, a process management component, a process coordination component, a resource management component, 
and a probability and statistics library.
Figure~\ref{modules} depicts layout of components and examples of functions provided within each.

\subsection{Module Descriptions}

The process creation task in \Si/ is achieved through an invocation of the function 
{\sl si\_create(func, nargs, arg1,$\cdots$, argn)} which 
encapsulates the LWP function {\sl lwp\_create()}. Following the creation of
a process, another function invocation 
{\sl si\_insert(E)} is used to ensure that the specified process with
reactivation record {\sl E} is reactivated at time {\sl E.clock} by
saving the record in the simulation calendar.
 
\noindent{\\ \bf Process Management}

Transfer of control between specific processes is most easily done with
the aid of the {\sl lwp\_yield()} function. However, a problem arises because
the LWP library provides its own process scheduler which
schedules processes based on LWP scheduling priority; the scheduling rule
always selects a process with the highest priority to run.
If more than one such process exists, the scheduler uses a 
FIFO policy within the priority class. In 
building \Si/ on top of the LWP library, a potential problem arises when 
an executing process terminates. At this point, the LWP scheduler will
transfer control to the highest LWP priority process, one which is not
necessarily the process whose execution is imminent in simulation logic,
i.e., one whose activation record in the simulation calendar has the smallest
reactivation time. While an
apparent solution is to require a 
mapping between priorities in \Si/ and priorities in LWP, 
the integer priority formats used in LWP preclude the use of this
option because \Si/'s priorities are double precision numbers, representing
reactivation (simulation) times. Therefore it is necessary to determine an alternate
scheme for transferring control between a terminating process and 
\Si/'s scheduler.

Our strategy is to force the LWP scheduler to schedule only a limited number of
processes in \Si/. To achieve this, each process is assigned an LWP priority,
either a MINPRIO (minimum priority) value or a MAXPRIO (maximum priority)
value. This principle is stated as follows:

\begin{description}
\item[Principle:] At any given instant, at most two processes exist with the
highest priority value MAXPRIO. One of these is \Si/'s process scheduler,
 and the other is a currently executing application-level process.

\end{description}

With this principle, when a currently executing process terminates,
the process scheduler obtains control under the LWP library's own 
 scheduling policy since it is the unique process left with priority
value MAXPRIO. This solves the control transfer problem.
Figure~\ref{Scheduler} shows the pseudo-code required for the 
process scheduler. As implied by the principle, an executing process
scheduler selects the highest priority process, say process {\sl pid}, from the simulation calendar, using smallest reactivation time as a measure of
 priority. It raises the LWP priority of process {\sl pid} to 
MAXPRIO and subsequently transfers 
control to process {\sl pid}. When the process scheduler is invoked due to process suspension, the scheduler follows exactly the opposite routine. It
reduces the LWP priority of the suspended process, say process {\sl pid},
from a value MAXPRIO to a value MINPRIO, prior to saving its reactivation
record in the simulation calendar, if necessary. Observe that two distinct 
processes are created within the 
{\sl main()} routine shown in Figure~\ref{Scheduler}. The first process
executes a {\sl si()} function which the application code treats as a
main routine. The second process is the process scheduler whose function it is
to execute the {\sl schedule()} routine. Both are created with
the highest priority value MAXPRIO. The notation \Sip/ is used to
emphasize the fact that the scheduler is implemented as a process here.

An alternate approach toward 
implementing transfer of control between processes is through function 
invocations. To demonstrate this point, we 
implement a different version of the process scheduler, through a small
 modification of the function {\sl schedule()}. In the 
remainder of the text, we use the notation \Sif/ to emphasize the use of 
function invocations in the implementation of the scheduler.
The function {\sl schedule()} in \Sif/ uses a strategy similar
to that used in \Sip/ to switch control between processes.
Since the scheduler in \Sif/ is no longer a process, an executing 
application-level
process is the only process with the highest priority value MAXPRIO. First, in 
function {\sl schedule()}, the selection of the next process to execute, 
say {\sl pid}, is made by accessing the simulation calendar.
Following this, process {\sl pid}'s priority is raised to value MAXPRIO, and
the currently executing process (i.e., the process seeking suspension) 
reduces its own priority by invoking function {\sl lwp\_setpri()}.
Through this priority reduction scheme, control is transferred to
the single remaining highest priority process, i.e., {\sl pid}.

The scheme described above allows for transfer of control between a 
process requiring suspension and a process whose execution is 
imminent. Unfortunately, a problem arises when 
an executing process terminates naturally, without either
voluntarily or involuntarily requesting suspension. Upon its termination,
there is no unique highest-priority process to take control. Left to its
own devices the LWP scheduler would give control to the first process in
line within the single priority class shared by all other processes. Needless
to say, this would result in incorrect simulation logic. To get around this
problem, we use an additional function called {\sl si\_exit()} which 
a process must invoke just prior to its termination. The 
function {\sl si\_exit()} invokes function {\sl schedule()}
which uses function {\sl lwp\_destroy()} to eliminate the currently
executing process, effectively a suicide operation.
As before, {\sl schedule()} also selects a process whose execution is 
imminent, so that the invocation of function {\sl lwp\_destroy()} causes 
control to switch to the new high-priority process. These actions are
summarized in the pseudo-code shown in Figure~\ref{Scheduler1}.

It is worth mentioning that there is yet another way to solve the 
``natural termination" problem. By modifying the return address of a process,
control can be transfered to some function specified in the address when the 
process terminates naturally. This can be accomplished either 
by modifying {\sl lwp\_create()} to set the address of the
function {\sl schedule()} as the return address, or by implementing
a new thread creation routine which achieves the same effect~\cite{Gautron}.
Since our design goals emphasize high level abstractions and layered design,
we chose not to adopt this option.

\begin{figure}
{\singlespace 

\begin{pseudocode}
{
\item
 schedule() \\
\{  \\
\item
\+
   while(future\_event\_set is not empty) \{ \\
   \+
      if previously executing process is suspended\\
      \+
         lwp\_setpri(ppid,MINPRIO); \\
      \-
        
      E = extract\_min(future\_event\_set); \\

      clock = E.clock; \\

      lwp\_setpri(E.pid,MAXPRIO); \\

      /* Now only E.pid and the scheduler have highest priority.*/\\

      si\_yield(E.pid) /* resume execution of process pid */\\
   \-
   \} \\
\-
\} \\

\item
 main()\\
\{  \\
\item
\+
   int si(); 

   $\cdots$initialization$\cdots$ \\

   lwp\_create(si\_tid,si,MAXPRIO, $\cdots$); \\

   lwp\_create(sched\_tid,schedule, MAXPRIO, $\cdots$); \\
\-
\} \\

}
\end{pseudocode}
}
\caption{Process Scheduling in $Si_p$}
\label{Scheduler}
\end{figure}

\begin{figure}

{\singlespace 

\begin{pseudocode}
{
\item
 schedule() \\
\{  \\
\item
\+
   while(future\_event\_set is not empty) \{ \\
   \+
      E = extract\_min(future\_event\_set); \\

      clock = E.clock; \\

      lwp\_setpri(E.pid,MAXPRIO); \\

      if the current process is dying\\
      \+ 
         lwp\_destroy(cpid); /* suicide*/ \\ 
         /* control transfers to E.pid automatically. */ \\ 
      \-
      else
      \+
         lwp\_setpri(cpid,MINPRIO); \\
         /* control transfers to E.pid automatically. */ \\
      \-       
   \-
   \} \\
\-
\} \\

\item
 main()\\
\{  \\
\item
\+
   int si(); 

   $\cdots$initialization$\cdots$ \\

   lwp\_create(si\_tid,si,MAXPRIO, $\cdots$); \\
\-
\} \\

}
\end{pseudocode}
}
\centerline{}
\caption{Process Scheduling in $Si_f$}
\label{Scheduler1}
\end{figure}

There is another important function in \Si/ known as the {\sl delay()} function.
When an executing process decides to suspend itself for {\sl t} units of
simulated time, it invokes the function {\sl delay(t)}.
This function inserts the invoking process's reactivation record, including its
reactivation instant {\sl clock + t}, into the simulation calendar.
Since the invoking process must undergo suspension, the process scheduler selects as the next process to execute that process in the simulation
calendar with the lowest reactivation time. In the 
\Sip/ design, transfer of control to the
process scheduler is done via the {\sl lwp\_yield(scheduler)} action,
while in the \Sif/ design, the scheduler is invoked directly through 
a call to function {\sl schedule()}. This is described in the
pseudo-code shown in Figure~\ref{delay}.

\noindent{\\ \bf Process Coordination}

The \Si/ system provides two distinct coordination mechanisms to support synchronization between processes. 
One mechanism is through user-declared events, effected by calling {\sl wait\_event()} and 
{\sl set\_event()} primitives. A process is suspended if it invokes
function {\sl wait\_event(e)} because it is forced  to wait until event {\sl e} occurs. Event {\sl e} is said to occur when some other process
invokes function {\sl set\_event(e)}. At this point, all processes waiting
for event {\sl e} are reactivated simultaneously. In \Si/, an event {\sl e} 
is declared to be of type {\sl Event} and initialized by the
{\sl create\_event()} function.

The other mechanism for process synchronization is through message-passing.
There is a predefined data structure called {\sl Mailbox} which is
created by function {\sl create\_mailbox}. Messages can be sent and
received through the mailbox by using functions
{\sl si\_send()} and {\sl si\_receive()}, respectively.
The function {\sl send(mb,msg)} deposits the message {\sl msg} 
in the mailbox {\sl mb}.  If there is a process awaiting the arrival
of this message, {\sl si\_send()} enables the process to access the message
and consequently be reactivated.
The reverse function {\sl si\_receive(mb,\&msg)} obtains the message
from the mailbox. If no message is available, the invoking process
has to be suspended until a message arrives.
For simplicity, the size of a message is limited to one word
(i.e., the size of an integer or pointer). 
This is patterned after the design of the XINU system~\cite{XINU}.

\begin{figure}

{\singlespace 

\begin{pseudocode}
{
\item
 delay(t) \\
\{  \\
\+
    E.pid = CurrentPID; \\
    E.clock = clock + t;\\
    si\_insert(E); \\
    lwp\_yield(scheduler);  \\
    /* or schedule() in $Si_f$ */\\
\-
\} \\

}
\end{pseudocode}
}

\caption{The function {\sl delay()}}
\label{delay}
\end{figure}

\noindent{\\ \bf Resource Management}

In contrast to processes which are used to model active components
of a system, resources are used to model passive system objects 
with mutually exclusive access.
In other words, processes request access to resources, use these
resources for a certain length of time, and finally  
release them and proceed with different activities. The 
\Si/ system supports two basic functions for resource access, called the
{\sl request(r)} and {\sl release(r)} functions 
(see Figure~\ref{Resource}). The resource object is declared to be 
of type {\sl Resource} and initialized by a {\sl create\_resource()}
function. When a resource {\sl r} is occupied, other requesting processes
must wait in a queue associated with resource {\sl r}. 
When resource {\sl r} is released, a suspended process in the 
front of the queue is given permission to resume, with access to {\sl r}. 

In a real application, a variety of queueing disciplines is possible,
including first-in first-out (FIFO), round-robin (RR), processor-sharing (PS), etc.. Some, such
as the the latter two, are more complicated than others.
These complex disciplines utilize different rules in selecting the
next process to execute, when faced with a choice. To give analysts a
common interface to these disciplines, the \Si/ system employs a
function {\sl use(r,t)} (patterned after CSIM~\cite{CSIM}) to allow a process to utilize a resource {\sl r}
for a given length of time {\sl t}. The queueing discipline is specified
as a parameter when resource {\sl r} is initialized. For example,
the statement 

{\sl r = create\_resource(ps)}

\noindent
binds resource {\sl r} with the {\sl ps} function 
which is predefined in \Si/ to simulate the PS discipline.
This simple interface also provides a level of modularity which lets
analysts develop their own queueing discipline in a fairly effortless
manner.

\begin{figure}

{\singlespace \small

\begin{pseudocode}
{
\item
 request(r) \\
\{  \\
\+
   if (resource r is free)  \\
   \+
      flag r as occupied; \\
   \-
   else \{ \\
   \+
      compute required statistics; \\
      insert current process's pid at tail of queue. \\
      lwp\_yield(scheduler);  \\
      /* or schedule() in $Si_f$ */\\
   \-
   \} \\
\-
\} \\

\item
 release(r)\\
\{  \\
\+
   if (waiting queue for resource r is not empty) \{\\
   \+ 
      remove process pid from head of queue; \\
      compute required statistics; \\
      si\_insert(E);  /* with E.pid = pid, E.clock = clock */\\
   \-
   \} \\
   else \{ \\
   \+
      compute required statistics; \\
      flag resource r as free; \\
   \-
   \} \\
\-
\} \\

\item
 fcfs(r,t) /* first-come, first-served */ \\
\{  \\
\+
   request(r); \\
   \+
     delay(t); \\
   \-
   release(r); \\
\-
\} \\

\item
 ps(r,t) /* processor-sharing */ \\
\{  \\
\+
   insert\_into\_pool(r,t); \\   

   lwp\_yield(scheduler); \\
   /* or schedule() in $Si_f$ */\\

   /* regain control from scheduler */ \\

   delete\_from\_pool(r); \\

\-
\}\\

$\cdots$ \\

\item
 create\_resource(fp) \\
\{  \\
\+
   $\cdots$initialization$\cdots$; \\
   r->fp = fp ; /* fp is a scheduling function pointer */\\
   $\cdots$ \\
\-
\} \\


\item
 use(r,t)\\
\{  \\
\+
   (*(r->fp))(r,t); \\
\-
\} \\

}
\end{pseudocode}
}

\caption{Resource Management in {\bf ${\cal S}i$}}
\label{Resource}
\end{figure}

\noindent{\\ \bf Probability and Statistics Functions}

The \Si/ system provides some necessary random number generation functions such as {\sl uniform()} for generating deviates from a uniform distribution,
and {\sl expon(u)} for generating deviates from an exponential 
distribution with mean {\sl u}, etc.. In addition, \Si/ supports two types
 of statistics. The first type is a prefined type, involving data that
is implicitly associated with a resource to be automatically
collected, summarized and reported. For example, applications 
can use functions such as {\sl util(r)} 
and {\sl qlen(r)} to obtain the utilization of and queue-size at resource {\sl r}.
The second type is a user-defined type,  requiring user-tables for
statistics collection. For example, an explicit invocation of
the function {\sl putstat(t,x)} inserts datum {\sl x} into a user-defined
table {\sl t} which is declared with the type {\sl Table}. 
The sample mean and variance can be obtained by calls to functions
{\sl getmean(t)} and {\sl getvar(t)}, respectively.

At present, the \Si/ system also supports functions 
{\sl reset\_resource(r)} and {\sl reset\_table(t)} to clear the
statistic-collection fields in resource {\sl r} and table {\sl t},
respectively. There are two advantages to using these functions.
The first advantage is that these functions can be used to eliminate the
effects of the start-up transient in simulations. The second is that
they can be used in the regenerative simulation method, where 
independent samples are obtained from independent cycles
of a simulated system~\cite{Crane}.

\section{Performance Measurements}

In this section we first present some performance measurements obtained from
the \Si/ system, a simple performance enhancement which can significantly
reduce simulation execution time, and finally some specific algorithms for 
service disciplines.
An important reason for pursuing modular design and layering with
the \Si/ system is potential for easy modification, and thus use of \Si/ 
as a testbed for new ideas in simulation. In the following subsections 
we present two
different examples of how this approach has proven beneficial. The
first example is motivated by recognition of the simple fact that 
simulation execution performance can be significantly enhanced
if the scheduler's interaction with the simulation calendar is slightly
modified. The second example demonstrates that new, computation-based 
algorithms for simulating service disciplines more efficiently than direct
process-mapped algorithms,  are easy to incorporate in a modular \Si/ system.
The configuration used in our measurements was a Sun SPARC IPC workstation 
(15.7 MIPS, 8 MB memory, 64KB cache) running SunOS 4.1.

The measured times presented here are averages, and it should be 
emphasized that the measured averages are not intended to represent absolute
 performance but rather relative performance for a particular 
parameter configuration. Thus the comparison of average times is of 
more interest than a comparison of raw numerical data.

\subsection{Cost of Operations}

In Table~\ref{CTX} is shown a set of measured system overheads 
for the tasks of process creation and context switching in both the \Sip/ and \Sif/ subsystems. Also included, for the purpose of comparison, are the 
corresponding overheads in the LWP library. Not surprisingly, the
\Sif/ subsystem exhibits less context-switching overhead than its
\Sip/ counterpart, primarily due to its use of a function, 
instead of an additional process, for process scheduling.

\begin{table}
\begin{center}
\begin{tabular}{|l||r|r|r|} 
\hline
\multicolumn{1}{|c||}{Operations} 
& \multicolumn{1}{c|}{LWP} 
& \multicolumn{1}{c|}{\Sip/}
& \multicolumn{1}{c|}{\Sif/}  \\ \hline 
{Creation + Deletion} & 660 & 960 & 860            \\ \hline
{Process Switches   } &  70 & 220 & 170            \\ \hline
\end{tabular}
\end{center}
\caption{Latency of Operations (in microseconds)}
\label{CTX}
\end{table}

\subsection{Benchmark Measurements}

A pragmatic approach toward evaluating the performance of the \Si/ system 
is through the use of benchmark models, using both ease of model
development and execution times as indicators of performance.
Though our primary interest is in the 
\Sip/ and \Sif/ subsystems, we have included the CSIM system as 
a basis of comparison. We chose CSIM because it is a sound C-based
process-oriented simulation system that has been gaining an increasing 
amount of popularity in the modeling of complex computer systems~\cite{Complex}.
In addition, using the same language (i.e., C) 
to realize models makes the coding of 
equivalent definitions considerably easier. In this subsection, we develop
models for a single-server queueing system, and a multiqueueing system
for a token ring local area network. 
Both models have previously been used 
as benchmarks in comparing simulation systems~\cite{Benchmarks,Sharma}.

\bigskip

\noindent{\\ \bf Benchmark I: A Single Server Model}

The first model is simple, describing a single-server queueing station.
The customer arrival process is Poisson, and customer service times
are independent, exponentially distributed random variables. 
We assume that customers are served in their order of arrival
(i.e., FIFO discipline). 
Figure~\ref{MM1} shows the code of this model.

In the first experiment, we execute the benchmark program to
measure average execution times versus a varying number of simulated 
customers. Each execution is repeated several times, using 
different random number seeds in each case, and the average execution
time is computed. 

\begin{figure}

{\singlespace 

\begin{pseudocode}
{
\item
\#include <si.h> \\
\#define NMAX 10000         /* \# simulated cust.   */\\
\#define SM   4.0           /* mean service time            */\\
\#define IM   5.0           /* mean cust. interarrival time */\\
Resource     r;     \\
Event        done;  \\

si() \\
\{  \\
\+
   r = create\_resource(); \\
   done = create\_event();  \\
   si\_create(gen\_cust,0); \\
   wait\_event(done); \\
   printf("\%f, \%f$\backslash$n", qlen(f), util(f)); \\
\-
\} \\

gen\_cust() \\
\{ int k; \\
\+
   for(k=0; ; k++) \{ \\
   \+
       delay(expon(IM)); \\
       si\_create(cust,1,k); \\
   \-
   \} \\
\-
\}

cust(k)\\
\{ \\
\+
   request(r); \\
   \+
      delay(expon(SM)); \\
   \-
   release(r); \\
   if(k==NMAX) \\
   \+
      set\_event(done); \\
   \-
\-
\}

}
\end{pseudocode}
}

\caption{A Single Server Queueing Model}
\label{MM1}
\end{figure}

\noindent{\\ \bf Benchmark II: A Multiple Queue Model}

The second model is a little more elaborate, utilizing a multiqueue system
with roving server to emulate a token-ring protocol~\cite{Spragins}. 
The multiqueue system represents $N$ independent computer stations 
situated on a ring, and the roving server represents the token. 
Messages made up of packets are generated by each station for 
transmission to other stations on the ring. 
A single token is passed unidirectionally from one station to its successor on 
the ring, to provide stations with a mechanism for conflict-free access 
to the ring for packet transmissions.
A station which acquires the token
and has queued packets is allowed to complete transmission of a single
packet before relinquishing control of the token to the succeeding
station on the ring. It is of some interest to determine queueing 
characteristics of packets at different stations as a function of 
ring parameters and station traffic.

The parameters of the model include message interarrival time distributions,
and packet transmission time distributions at the different stations on
the ring. For convenience, we assume that all interarrival time
distributions are identical,  each being exponential with mean $1/\lambda$. 
Also, for convenience, assume that all
packet transmission time distributions are identical, each being exponential 
with mean $1/\mu$.
Finally, assume that the token-passing time between
consecutive stations on the ring is a small constant, a function of
ring delay. 
A similar setup which uses CSIM to model the token ring network 
can be found in~\cite{Edwards}.

\noindent{\\ \bf Empirical Results and Interpretation}

The results of both experiments, given in Tables~\ref{B1} and \ref{B2}, 
suggest that the two \Si/ subsystems are competitive with CSIM in 
terms of performance. In particular, the \Sif/ system  outperforms CSIM 
by up to 20\%. We emphasize that this does not imply the \Si/ 
system is better than CSIM in all respects. Indeed CSIM is a very stable 
system, developed and used over several years, while \Si/ is still
an experimental and evolving system.

\begin{table}
\begin{center}
\begin{tabular}{|l|r|r|r|r|r|}
\hline
&\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\
& 10 & 50 & 100 & 200 & 500 \\ \hline \hline
\Sip/ & 1.5 & 7.4 & 14.8 & 29.7 & 73.8 \\ \hline
\Sif/ & 1.2 & 6.1 & 12.1 & 24.0 & 60.8 \\ \hline
CSIM  & 1.5 & 7.6 & 15.1 & 30.4 & 75.9 \\ \hline
\end{tabular}
\end{center}
\caption{Execution time (in seconds) for Benchmark I}
\label{B1}
\end{table}

\begin{table}
\begin{center}
\begin{tabular}{|l|r|r|r|r|r|}
\hline
&\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\
& 10 & 50 & 100 & 200 & 500 \\ \hline \hline
\Sip/ & 3.5 & 16.7 & 33.5 & 67.3 & 168.5 \\ \hline
\Sif/ & 2.1 & 10.0 & 19.9 & 39.7 & 100.3 \\ \hline
CSIM  & 2.6 & 13.1 & 26.1 & 51.9 & 130.8 \\ \hline
\end{tabular}
\end{center}
\caption{Execution time (in seconds) for Benchmark II}
\label{B2}
\end{table}

It is interesting to observe that the \Sif/ subsystem consistently outperforms 
the \Sip/ subsystem, even attaining a 40\% improvement in execution time for Benchmark II. This is simply due to the fact that \Sif/ employs functions to schedule processes instead of using a dedicated process for this task. Consequently, \Sif/ suffers significantly less context-switching overhead. Based on the empirical results, we can claim that we have achieved a certain level of efficiency, one of the main considerations in our original set of objectives. 

\subsection{A Scheduling Enhancement}

Upon examining the code given for the function {\sl delay()} (see Figure~\ref{delay}), it will be clear there is some likelihood that a process reactivation record which has just been inserted into the simulation calendar may very
well represent the process whose execution is imminent. In such a case,
the insertion of this activation record will immediately be followed by its
deletion from the simulation calendar.  Clearly, the cost of
insertion and subsequent deletion can be avoided if the process in question 
is recognized
to be the process whose execution is imminent. Apart from insertion and
deletion savings, unnecessary context-switches between such a process, i.e. one undergoing a potential suspension, and the
scheduler may be avoided. Recognition of such a situation entails
a comparison operation in which the scheduler determines if the
simulation priority of the process in hand is greater than the simulation
priority of the highest priority process in the simulation calendar.
Because this comparison operation must now be done for each process
scheduled for execution, there is a trade-off between the new scheme
(in terms of the additional comparison cost) and the old scheme (where there
is no comparison cost, but avoidable insertion-deletion pairs of
operations and context-switches).

In order to obtain a rough assessment of the frequency of such unnecessary
insertion-deletion actions of process reactivation records, we measure 
the ratio of such occurrences to the total number of times that function 
{\sl delay()} is invoked. Recall that invocation of function {\sl delay()} 
initiates a process's suspension by a control-switch from the 
process to the scheduler, and a subsequent control-switch to a new, 
or the same, process. Table~\ref{Ratio} contains percentages of such
unnecessary actions for both Benchmark programs. Surprisingly, the
avoidable cost can be seen to be as high as 78\% for the multiqueue model.

This simple idea is incorporated in \Si/ through a small modification of 
the original code in the function {\sl delay(t)}. An additional function,
{\sl findmin()}, is required for actually performing the comparison. It 
determines if the highest priority process in the simulation 
calendar, with reactivation record {\sl E} and reactivation time {\sl E.clock} is to be given control before or after the function that invoked {\sl delay()} and
requires control at simulation time {\sl clock + t}. Clearly, if
the quantity {\sl E.clock} is smaller, then the currently executing process must
be suspended; otherwise, the savings will include an insertion, a deletion,
and two context-switching actions in the case of \Sip/. In the latter
situation, the process continues execution, upon an immediate return from function {\sl delay()}. A succinct description of this scheme is given by the pseudo-code shown in Figure~\ref{Newdelay}.

\begin{figure}

{\singlespace 

\begin{pseudocode}
{
\item
 delay(t) \\
\{  \\
\item
\+
   E = findmin(future\_event\_set); \\
   if( t+clock < E.clock) \{  \\
   \+
       /* no need to insert; process continues to execute */ \\
       clock += t; \\
       return; \\
   \-
   \} \\
   else \{ \\
          $\cdots$ original code in delay() $\cdots$ \\
   \} \\
\-
\} \\

}
\end{pseudocode}
}

\caption{Add an extra check to {\sl delay()}}
\label{Newdelay}
\end{figure}

Using the modified {\sl delay(t)} function, we repeat the experiments
described above to obtain average execution times for the two benchmark models. 
The results, shown in Table~\ref{Better}, indicate a significant 
improvement in performance as compared to the previous results (see Tables~\ref{B1} and \ref{B2}).
The impact of this modified piece of code on the performance of the \Sip/
subsystem is larger than its impact on the \Sif/ subsystem. This is largely
attributable to the fact that both,  process switching overheads and
simulation calendar overheads are reduced in \Sip/, while
only simulation calendar overheads are reduced in \Sif/.
Though the difference in performance between the \Sip/ and \Sif/ subsystems
decreases with the comparison modification, the \Sif/ subsystem is still
a consistently better performer than the \Sip/ subsystem.

The performance improvement to be had from the additional comparison
operation depends on the frequency of the desired property (i.e., the
currently executing process must continue execution without incurring
calendar and context-switching overheads) relative to the actual cost
of performing the comparison for every process that invokes the {\sl delay()}
function. It should be apparent that the more frequent the
condition supporting the property, the larger will be the
savings.  Also, the larger the ratio of this count to the number of
times the {\sl delay} function is invoked, the more gain is to be 
had by adding this comparison test. Using $f$ to denote the frequency
with which the condition is true,  $C_{test}$ and $C_{overhead}$ to
represent the costs of comparison and overhead, respectively, and $r$ the
ratio $C_{test}/C_{overhead}$, the enhancement is beneficial whenever
\begin{equation}
\nonumber
             C_{overhead} ~ > ~ C_{test} +  ( 1 - f) \times C_{overhead} 
\end{equation}
or equivalently, whenever
\begin{equation}
\nonumber
     f ~ > ~ \frac{C_{test}}{C_{overhead}} ~ = ~r . 
\end{equation}
In support of this modification, it is known that 
when scheduling distributions (which dictate how reactivation-times are 
dispersed in the simulation calendar) are mixture distributions, reactivation-times tend to pile up towards the beginning of the simulation calendar~\cite{Sargent:study2}. Because such scheduling distributions are realistic, the modified scheme is likely to almost certainly yield reduced execution times for most
applications. A detailed analysis of the savings given by this method can be found
in~\cite{CBI}. 
 
\begin{table}
\begin{center}
\begin{tabular}{|l|c|}
\hline
&$saved/total$ \\ \hline
Benchmark I  & $27802/100000 \approx 28\% $   \\ \hline
Benchmark II & $337954/433437 \approx 78\% $  \\ \hline
\end{tabular}
\end{center}
\caption{Percentages of the saved Insert and Extract\_min operations}
\label{Ratio}
\end{table}

\begin{table}
\small
\begin{center}
\begin{tabular}{|ll|r|r|r||r|}
\hline
& &\multicolumn{4}{|c|}{Simulated Customers ($\times 100$)} \\
& & 100 & 200 & 500 & gain \\ \hline \hline
Benchmark I  & \Sip/ & 13.6 & 27.2 & 67.9 &  8\%\\ \cline{2-6}
             & \Sif/ & 11.8 & 23.5 & 59.2 &  2\%\\ \hline
Benchmark II & \Sip/ & 18.9 & 37.6 & 94.2 & 44\%\\ \cline{2-6}
             & \Sif/ & 16.9 & 34.0 & 84.7 & 16\%\\ \hline
\end{tabular}
\end{center}
\caption{Execution time (in seconds) for Model I and Model II (improved)}
\label{Better}
\end{table}

\subsection{Round-Robin and Processor-Sharing Algorithms}

Service disciplines such as first-in-first-out (FIFO), round-robin (RR), 
processor-sharing (PS), etc. are functional parameters of
service stations in queue-based simulations. As systems that are designed and
built become increasingly complicated both in functionality and description,
we are faced with a choice between hypothetically weak (in the sense of
model assumptions) analytic models and conclusively weak (in the sense
of execution data) simulation models. In most instances of practical 
interest we usually have little 
choice but to rely on good simulation models to answer questions related to system performance. Hence efficient techniques for implementing simulation 
algorithms, including algorithms for service disciplines, are useful as 
simulation execution enhancements.  

In the round-robin (RR) service discipline, a job is serviced for a single 
quantum $q$ at a time, sharing the service resource with other
jobs undergoing the same service allocation. If the remaining service time 
required by a job exceeds the quantum size $q$, the job's processing 
is interrupted at the end of its quantum and it is returned to 
the rear of the queue, awaiting the service quantum it will receive in
the next round.
Instead of taking a naive approach which simulates the round-robin discipline
by physically switching control from one process to another, we propose a computational scheme which is based on predicting departure instants of 
serviced jobs leaving the pool of queued jobs. The computed departure instant of a particular job may be invalidated by one or more newly arriving jobs, i.e., one or more arriving after the corresponding departure event is scheduled,
but before the departure event can occur. This is because the server must now attend to one or more previously unaccounted for job arrivals, 
and decrease the amount of attention it planned on giving to
jobs already in the system prior to the arrival of the new job(s).
Consequently, a scheduled departure event that has been invalidated in this
manner must be cancelled, and an updated departure instant for the same or
another job scheduled in its place.
In addition, it is necessary that the remaining service time requirements of
each job in the system be adjusted whenever an arrival event or a
departure event occurs. 

The processor-sharing (PS) discipline schedules jobs as if the server
were processing all the jobs in the facility queue simultaneously. That is,
each job receives service for a time which is inversely proportional to 
the number of competing jobs in the pool.
A naive algorithm for simulating the PS discipline is based on 
a computation and prediction method which doles out an equal amount of 
service time to all jobs in the pool. This approach suffers in that it is computationally  
demanding, especially when there are frequent updates of the remaining service 
time requirements for each job in the pool. To alleviate the computational
requirement to some extent, we propose a lazy-update algorithm 
which accumulates the requisite amount of updating required in a special variable, instead of directly performing the update on all jobs exhaustively. 
When the departure time of the earliest job to leave the pool is to to be computed, the value resident in the special variable is be subtracted from 
the remaining service time of the next job to depart, reflecting the
true remaining service time.
Because of this modification, the only required
operations for handling the pool are INSERT and EXTRACTMIN primitives.
By combining such infrequent updates with the use of an efficient priority 
queue data structure, the lazy-update algorithm can be shown to reduce the
$O(n^2)$ complexity of the naive algorithm to $O(n \log n)$

We conduct experiments to measure the execution performance of 
the proposed RR and PS algorithms, respectively, in comparison to the
naive algorithms.  We use the single server model for 
this experiment. In Tables~\ref{rr} and~\ref{ps} it can be seen that the
proposed algorithms perform well compared to the naive algorithms.
The RR discipline exhibits larger performance differences between 
the naive and computational approaches because of the large number of
context-switches required by the former. The differences are not
particularly significant for the PS discipline because both approaches exploit
computation.
A detailed description of these algorithms and related experiments 
can be found in \cite{RRPS}.

\begin{table}
\small
\begin{center}
\begin{tabular}{|ll|r|r|r|r|r|}
\hline
& &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\
& & 10 & 50 & 100 & 200 & 500 \\ \hline \hline
& Naive         & 3.5 & 17.2 & 34.1 & 67.8 & 169.5 \\ \cline{2-7}
& Computational & 1.7 &  8.2 & 16.3 & 32.8 & 83.0 \\ \hline 
\end{tabular}
\end{center}
\caption{Comparison of Execution time (in seconds) for RR algorithms}
\label{rr}
\end{table}


\begin{table}
\begin{center}
\begin{tabular}{|ll|r|r|r|r|r|}
\hline
& &\multicolumn{5}{|c|}{Simulated Customers ($\times 100$)} \\
& & 10 & 50 & 100 & 200 & 500 \\ \hline \hline
& Naive         & 1.8 & 8.5 & 16.5 & 34.2 & 86.0 \\ \cline{2-7}
& Lazy-Update   & 1.5 & 7.4 & 14.9 & 29.3 & 73.2 \\ \hline
\end{tabular}
\end{center}
\caption{Comparison of Execution time (in seconds) for PS algorithms}
\label{ps}
\end{table}

\section{Conclusions}

Our experiences with the the design and implementation of the 
\Si/ system have been amply rewarding. The support of a very 
reliable lightweight process library has greatly reduced
the effort required in building an efficient, experimental simulation
test-bed. During the early stages of design and implementation, we faced 
several different design choices which were at times not altogether consistent with our design principles and objectives. For example, the process scheduler
could  be implemented by either a dedicated process or by function invocation, with each method leading to different versions of \Si/. While 
the former suffers overheads typically associated with process switching and
control, the latter suffers in terms of application-interface inelegance, in
that an extra function call is required of a terminating process.
Our experiences suggest that, despite the use of lightweight processes,
context-switching costs are not insignificant. This is clearly seen in the
use of the simple comparison check which eliminates certain unnecessary
context switches during simulation execution, improving the performance
of both the \Sif/ and \Sip/ subsystems.
Furthermore, about two-thirds of all overheads are due to thread 
creation/deletion-related processing
within the LWP library, and the rest is due to the layer above
the library.
Unfortunately, our experimental results show that
such overheads can contribute up to 70\% of
the total simulation time.
Thus, simulation would speed up considerably if thread creation and deletion
overheads are reduced. To this end, we have embarked on the 
design of a simple threads library\cite{Xtech}
which can provide more efficient thread operations and
support a simulator's scheduling disciplines.

With the \Si/ system we have been successful in developing a software infrastructure which provides an ideal experimental environment. Central to this capability is the modular design philosophy, which unambiguously defines interfaces 
to the system's functional components. New algorithms can therefore 
be implemented, tested, and experimented with, almost effortlessly,
requiring only the simple need to match interfaces. We have implemented
 new algorithms in \Si/ to simulate the round-robin and processor-sharing disciplines, and both the ease of algorithm incorporation in \Si/, and the
performance of \Si/ with the new algorithms has been excellent.

A current disadvantage of the \Si/ system is its lack of portability.
This is due to its implementation on a lightweight process library
which is machine dependent. However, since \Si/ adopts a layered design, 
and because almost all lightweight process libraries support the
universal functions of process creation, switching (i.e., yielding), 
and deletion, the effort required in porting \Si/ to rest on top
of another lightweight process library is minimal. This effort will need
to focus only on the process manipulation layer, which is the innermost 
layer in the \Si/ system. Therefore, such a modification will be 
transparent to the application-layer. With the advent of the standard
POSIX threads package (e.g., \cite{Mueller}), we believe that the
\Si/ system will be easily portable to any POSIX supporting machine.

------------------- end --------------------

Note: Readers who are interested in the complete version of this paper 
      which shows figures or charts, please see the USENIX Summer 1993 
      Conference Proceedings.