Check out the new USENIX Web site. next up previous
Next: A Simple Example Up: Design Previous: Design


Algorithm

The rsync implementation uses a programming construct referred to as a process triangle that defines three processes, one on the source and two on the target (Figure 2). The generator process runs on the target and generates the checksums for the target file ($F_t$) that get sent to the sender process ( $G \Rightarrow S$). The sender scans the source file ($F_s$) for the sums it receives and transmits the results to the receiver process ( $S \Rightarrow R$). The receiver applies the delta commands and notifies the generator if a file needs to be resent ( $R \Rightarrow G$). The generator and receiver processes run independently on separate files concurrently and proceed independently.

Rsync transmits ADD and COPY commands in the order in which they were detected during a sequential scan of the source file. The target applies the commands in the order received. No destination offsets need to be specified for COPY and ADD commands. The algorithm calculates the destination offset from the previous destination offset and the length of the previous command.

With ip-rsync, the commands undergo a reordering step to facilitate corruption-free, in-place reconstruction. As such, the transmission of the ADD and COPY commands in a non-sequential order requires the explicit specification of a destination offset with each command. The destination offset allows commands to be applied in any order at the target. The extra data in the ip-rsync codeword adds a small overhead to the bandwidth requirements.

Ip-rsync takes the following actions to synchronize a file (the bold text indicate where ip-rsync and rsync differ):

Figure 3: An example of a file synchronization and the associated dependency graph. The blocks on the target will move from their original location (bottom) to match their location on the source (top). To paraphrase the graph, $C_3$ must be completed before $C_1$ or $C_2$ since the destination of these blocks will corrupt the source for $C_3$; i.e., perform $C_3$ first so that $C_1$ or $C_2$ do not overwrite its source data.
\begin{figure*}\begin{center}
\input{depgraph.pstex_t}
\end{center}\par\vspace{-10pt}
\par\end{figure*}

  1. Generator: Generate weak and strong checksums for each block in the reference target file.
  2. Generator: Send checksums to the sender.
  3. Sender: Build a hash table from the checksums received.
  4. Sender: Scan the source file for matches. Buffer all COPY and ADD commands.
  5. Sender: Construct a dependency graph among COPY commands.
  6. Sender: Topologically sort the dependency graph, breaking cycles as they are detected.
  7. Sender: Send sorted COPY commands, followed by ADD commands, to the receiver.
  8. Receiver: Apply commands when received. For each command, seek to the given offset, and either copy data or insert the included data.
  9. Receiver: Truncate the file if it decreased in size.
  10. Receiver: Alert the generator of the file's completion.

In detecting and resolving dependencies, ip-rsync sacrifices some concurrency. The sender conducts an analysis of all COPY commands prior to transmitting any commands to the receiver. This causes ip-rsync to wait until all COPY commands are found and analyzed before transmitting data. Traditional rsync overlaps detecting and transmitting matches by sending data to the receiver immediately. The sender detects dependencies among all COPY commands and waits for the discovery of all commands before reordering.


next up previous
Next: A Simple Example Up: Design Previous: Design
2003-04-08