The PSC kernel monitoring package allows the visualization of kernel
networking variables, as illustrated in the example below.
The PSC kernel monitor was designed with the following goals in mind:
In order to achieve the goals of backwards-compatibility and the mixing and
matching of kernels with a single application program, the kernel monitoring
data sets are self-describing. This allows newer versions of
the monitor graphing tools to read old data sets, as well as allowing
users to exchange data sets that were collected in kernels with their
own modifications.
The primary component of the project is an in-kernel monitor which uses macros placed in the source code of the TCP stack to record the values of a set of TCP and socket variables into a circular table in the kernel. The macros are placed around certain "events" in the TCP stack, such as a change in the congestion window size, the discovery of a segment loss, or the occurrence of a retransmission timeout. The table can be read by applications with kvm. Another important component is the set of application tools that save the kernel monitor table to a file, or that process the data files.
The pscmonitor application reads the saved "snapshots" from the kernel's circular table using kvm, and saves the data to a binary file. The file is then read by pmgraph which graphs the specified parameters using xgraph [X99]. Pmgraph is also able to convert the binary monitor data file into an ASCII file containing the specified parameters.
In this paper, the layers will be described starting at the top (the display
application) and working down to the kernel data structures.
Pmgraph usage is described as:
Usage: pmgraph[options] Options: (default is show everything in a graph) -b show so_snd.sb_cc -c show cwnd -d print verbose debugging information -D output gaps in the data to stdout This allows the detection of wrapping of the table between readings by the pscmonitor -f show snd_fack -F show hiwat_fair_share -h print this help message -H show so_snd.sb_hiwat -l show calling location Unique identifiers for each snapshot (data sample, or set of concurrent datapoints in a graph), to allow tracing through the TCP stack -m show mbuf clusters in use (in bytes) -M print maximum amount of system memory used in trace If used without any other params, no other output will be produced -n show so_snd.sb_net_target -p lport.rport only include connections specified by local and remote ports More than one connection can be specified by using the "-p" option multiple times in the command line. Irrelevent connections can be ignored this way, or individual connections can be studied out of a group -P show graphs of one parameter for all connections (default is to show graphs of one connection with all parameters) -r show so_rcv.sb_cc -s show snd_max -S #c scale option c by floating point # Allows parameters of smaller magnitude to be graphed with a similar magnitude to other parameters on a single graph. (ex. -S 1000l graphs the call location with 1000 times the recorded value -t show so_snd.sb_mem_target -T do not graph, but convert binary input file into text sent to stdout Essentially, perform a binary-to-text conversion of the data file, but also support filtering of connections and specification of relevent parameters -v show calling values -x produce xgraph files without calling the xgraph program
Note that some of the options in pmgraph are for the display of variables used in experimental TCP implementations.
Samples of the graphical output of pmgraph are included below.
This graph shows the congestion window and the send socket buffer's high
water mark plotted for a single connection. This graph was generated with
the command pmgraph na.9903051452.pm -p 1055.5050 -c -H. (This
TCP connection uses an early version of Rate Halving
[MM97] with Autotuning sender-side socket buffers
[SMM98, PSC98].)
This graph shows cwnd plotted for three concurrent connections. It was
generated with the command pmgraph na.9903051452.pm -P -p 1055.5050
-p 1056.5051 -p 1057.5052 -c. (These TCP connections use an early version
of Rate Halving [MM97].)
The header consists of a number of variable-length records (inspired by the format for TCP header options [RFC793]) that look like the following:
| Kind (2 bytes) | Length (2 bytes) | Data (Variable, Length - 4 bytes) |
The Length field indicates the length of the header record. The minimum header record is 4 bytes, for records with no data.
Kind is one of the following:
| Kind name | Description | Kind Value | Record Length (bytes) | Data Length (bytes) |
|---|---|---|---|---|
| KIND_END | Indicates the end of the header | 4 |
0 |
|
| KIND_MODS | Contains the 32-bit psc_mods flag in the data field | 8 |
4 |
|
| KIND_VERSION | Contains the psc_version string for the psc mods | 12 |
8 |
|
| KIND_TABLE_ENTRY_SIZE | Length (in bytes) of one snapshot (from tpm_entry_size) | 8 |
4 |
|
| KIND_DATA_DEFINITION | Contains a struct tpm_entry_description that describes one variable appearing in all snapshots | 36 |
32 |
|
| KIND_ENDIAN | Specifies the byte-order of multi-octet data fields | 8 |
4 |
|
| KIND_MCLSIZE | Size (in bytes) of an mbuf cluster on the monitored machine | 8 |
4 |
More information on the header can be found in the appendix.
The features flags have been implemented as two kvm-readable variables:
| name | type | declared in | defined in | kvm-accessible |
|---|---|---|---|---|
| psc_version | tcp_subr.c | |||
| psc_mods | u_int16_t | tcp_subr.c |
| name | value | description |
|---|---|---|
| PSC_RENO | PSC Common kernel without SACK or FACK | |
| PSC_SACK | Selective Acknowledgments | |
| PSC_FACK | Forward Acknowledgements | |
| PSC_AUTO | Autotuning socket buffers |
The table is an array of struct tpm_entry (which is defined in tcp_pscmonitor.h):
struct tpm_entry {
/* general */
u_long seq_no; /* tpm entry number */
struct timeval time; /* time of entry */
u_long callvalue; /* used for debugging */
u_long location; /* unique value identifying
triggering code point */
/* Connection Specific */
/* Reno */
u_int16_t lport; /* local port number */
u_int16_t rport; /* remote port number */
u_long snd_cwnd; /* congestion window */
tcp_seq snd_max; /* highest sequence number sent */
/* FACK */
tcp_seq snd_fack; /* highest sequence number
(s)acked*/
/* Socket */
/* Socket: Reno */
u_long sb_hiwat; /* send socket buffer hi water
mark */
u_long snd_sb_cc; /* space used in send socket buf */
u_long rcv_sb_cc; /* space used in rcv socket buf */
/* Socket: Autotuning */
u_long sb_target_hiwat; /* sb_mem_target for same */
u_long sb_net_target; /* sb_net_target for same */
u_long hiwat_fair_share; /* per-connection buffer fair
share */
/* System-wide */
u_long m_clused; /* m_clusters - m_clfree */
} tpm_table[TPM_ENTRIES];
The entire table may be read by user-level applications using kvm. Some notes on members of the structure that may not be completely obvious are discussed below.
Each atomic snapshot taken by any of the connections in the system is interleaved in a single table with snapshots taken by other connections. The order in which they were recorded can be determined by the seq_no of each entry, and the relation of time between the entries can be found from the time field. As a practical matter, all the entries from a single connection can be associated with each other using the rport and lport fields. IP addresses are not currently included in the structure, since each connection is identifiable using only the port numbers in nearly all cases. (There are a few obscure cases in which the port numbers alone are not sufficient for distinguishing between different TCP connections, but we do not encounter any of these cases in the experiments that we are interested in. The extensible nature of the kernel monitor architecture should allow the addition of IP addresses,if necessary for other research projects.)
struct tpm_entry_description { /* Describe the current TPM entry */
#define TPM_STRLEN 24 /* Do not change, if to remain
compatible with other versions */
char name_string[TPM_STRLEN];/* name of param */
#define TPM_STR_SEQ_NO "seq_no"
#define TPM_STR_TIME "time"
#define TPM_STR_CALLVALUE "callvalue"
#define TPM_STR_LOCATION "location"
#define TPM_STR_LPORT "lport"
#define TPM_STR_RPORT "rport"
#define TPM_STR_SND_CWND "snd_cwnd"
#define TPM_STR_SND_MAX "snd_max"
#define TPM_STR_SND_FACK "snd_fack"
#define TPM_STR_SB_HIWAT "sb_hiwat"
#define TPM_STR_SND_SB_CC "snd_sb_cc"
#define TPM_STR_RCV_SB_CC "rcv_sb_cc"
#define TPM_STR_SB_TARGET_HIWAT "sb_target_hiwat"
#define TPM_STR_M_CLUSED "m_clused"
#define TPM_STR_LAST ""
u_int16_t offset; /* offset of param into tpm_entry */
u_char length; /* length of param */
u_char scope; /* scope of relevance of param */
#define TPM_SCOPE_PM 0 /* PSC monitor value */
#define TPM_SCOPE_SYS 1 /* System-wide variable */
#define TPM_SCOPE_INDIV 2 /* relevant only to indiv. conn. */
u_int16_t mask; /* Which bits in psc_mods this param
relates to
0 = all */
u_int16_t flags; /* Attributes */
#define TPM_F_INT_HOST 0 /* Integer in host order */
#define TPM_F_INT_NET 1 /* Integer in net order, ie. ports*/
#define TPM_F_STRING 2 /*raw bits, such as string or float*/
};
The description structure is made available to applications with kvm, and is stored with all data files, to allow the data files to be read later by applications that are able to deal with newer table formats, as well as old.
The relation between the table and the snapshot structure member descriptions are described below in the Summary.
The above diagram illustrates the relation between the structures described in the previous sections. Tpm_table points to the array of structures (struct tpm_entry) that were described in the Table section above. Each row is one structure (the fields are layed out horizontally) and is filled in each time a snapshot is recorded. The byte offsets are listed down the left side of the array. Tpm_entry_desc points to a second array of structures (struct tpm_entry_description) that were described in the Snapshot Structure Member Descriptions section above. This array gives the monitoring system its self-describing quality. Each structure in the second array describes one snapshot member (column) in the tpm_table, including an offset into the structure.
| name | defined in |
|---|---|
| struct tpm_entry | tcp_pscmonitor.h |
| struct tpm_entry_description | tcp_pscmonitor.h |
| name | type | declared in | defined in | kvm | desc |
|---|---|---|---|---|---|
| tpm_table | struct tpm_entry* | tcp_pscmonitor.h | tcp_pscmonitor.h | table of snapshots | |
| tpm_entry_desc | struct tpm_entry_description* | tcp_subr.c | description of columns | ||
| tpm_index | struct tpm_entry* | tcp_pscmonitor.h | tcp_subr.c | next row to fill | |
| tpm_seq | u_long | tcp_pscmonitor.h | tcp_subr.c | next seq_no to use | |
| tpm_num_entries | u_long | tcp_pscmonitor.h | tcp_subr.c | num of rows in table | |
| tpm_entry_size | u_char | tcp_pscmonitor.h | tcp_subr.c | octets per row |
| name | defined in | description |
|---|---|---|
| TPM_INDEX_INCR | tcp_pscmonitor.h | Move to next row of table |
| TPM_SNAPSHOT | tcp_pscmonitor.h | Take snapshot and put in table |
struct _pm_mods { /* make structure a word so subsequent records
are word aligned */
u_int16_t mods;
u_int16_t pad;
};
struct _pm_tes { /* make structure a word so subsequent records
are word aligned */
u_int16_t table_entry_size;
u_int16_t pad;
};
struct header_entry {
u_int16_t kind; /* designed so that data begins on a word
boundary */
u_int16_t length; /* designed so that data begins on a word
boundary */
union {
struct _pm_mods mods;
struct _pm_tes tes;
char version[VERSION_LENGTH];
struct tpm_entry_description desc;
} data;
};
#define pm_mods data.mods.mods
#define pm_table_entry_size data.tes.table_entry_size
[MM97] Matt Mathis and Jamshid Mahdavi. "TCP Rate-Halving with Bounding
Parameters," http://www.psc.edu/networking/papers/FACKnotes/current/,
Pittsburgh Supercomputing Center, 1997.
[PSC98] "Automatic TCP Buffer Tuning Research" web page, Pittsburgh
Supercomputing Center, 1998. Available at:
http://www.psc.edu/networking/auto.html
[RFC793] "Transmission Control Protocol." IETF RFC 793, September 1981.
Available from: http://www.rfc-editor.org/rfc/rfc793.txt.
[SMM98] J. Semke, J. Mahdavi, and M. Mathis. "Automatic TCP Buffer Tuning,"
ACM Sigcomm '98/Computer Communication Review, Volume 28,
Number 4, October 1998.
[X99] xgraph plotting and graphing tool.
http://www-mash.cs.berkeley.edu/xgraph/, U.C.Berkeley, 1999.