The workload of the global Internet is dominated by the Hypertext Transfer Protocol (HTTP), an application protocol used by World Wide Web clients and servers. Simulation studies of IP networks will require a model of the traSJic putterns of the World Wide Web, in order to investigate the effects of this increasingly popular application.
We have developed an empirical model of network trafic produced by HTTI? Instead of relying on server or client logs, our approach is based on packet traces of HTTP conversations.
Through traffic analysis, we have determined statistics and distributions for higher-level quantities such as the size of HTTPJiles, the number ofjles per “Web page”, and user browsing behavior. These quantities form a model can then be used by simulations to mimic World Wide Web network applications.
Our model of HTTP traffic captures logically meaningful parameters of Web client behavior, such as file sizes and “think times”. The traffic traces described in the preceding section provide us with empirical probability distributions describing various components of this behavior. It is used these distributions to determine a synthetic workload.
At the lowest level, our model deals with individual HTTP transfers, each of which consists of a request-reply pair of messages, sent over a single TCP connection. We model both the request length and reply length of HTTP transfers. At first glance, it may seem more appropriate for a model of network traffic to deal with the number, size, and interarrival times of TCP segments. However, we note that 3
It is shown that it is appropriate to model the first HTTP transfer on a Web page separately from subsequent retrievals for that page. For simplicity, we have postponed discussion of this distinction. these quantities are governed by the TCP flow control and congestion control algorithms.
These algorithms depend in part on the latency and effective bandwidth on the path between the client and server. Since thi information cannot be known a priori, an accurate packet-level network simulation will depend on a simulation of the actual TCP algorithms. This is in fact the approach taken for other types of TCP bulk transfers in the traffic model described in [lo]. In a similar fashion, our model generates transfers which need to be run through TCP’s algorithms; it does not generate packet sizes and interarrivals by itself.
A Web document can consist of multiple files. A server and client may need to employ multiple HTTP transactions, each of which requires a separate TCP connection, to transfer a single document. For example, a document could consist of HTML text 131, which in turn could specify three images to be displayed “inline” in the body of the document. Such a document would require four TCP connections, each carrying one request and one reply
The next higher level above individual files is naturally the Web document, which we characterize in terms of the number offiles needed to represent a document. Between Web page retrievals, the user is generally considering her next action. We admit the difficulty of characterizing user behavior, due to its dependency on human factors beyond the scope of this study.
However, we can model user think time based on our observations. Assuming that users tend to access strings of documents from the same server, we characterize the locality of reference between different Web pages. We therefore define the consecutive document retrievals distribution as the number of consecutive pages that a user will retrieve from a single Web server before moving to a new one!
Finally, the server selection distribution defines the relative popularity of each Web server, in terms of how likely it is that a particular server will be accessed for a set of consecutive document retrievals.