Introduction
In the late 1990s, a research team at Los Alamos National Laboratory (LANL) was developing a tool to help visualize the vast amounts of data produced by the laboratory’s powerful computer networks. The team members, including myself, were looking for ways to make sense of the huge amount of data being generated and to find patterns that might be hidden in the noise. The tool we developed, called “network tomography,” uses ideas from mathematics and statistics to create a map of how information flows through a network. In essence, it allows us to “see” the invisible architecture of the Internet.
Network tomography has since been used to study everything from the structure of social networks like Facebook (Bassett et al., 2011) to the spread of epidemics (Wang et al., 2003). Recently, my colleagues and I have been using it to study how information flows through high-dimensional torus networks—networks in which each node is connected to every other node in a regular pattern (see Figure 1). These networks are interesting because they are both highly efficient and highly resilient—able to continue functioning even if some nodes are removed or damaged. Torus networks are used in a variety of settings, including large-scale supercomputing and data storage (Blandford et al., 2005; Chen et al., 2007; Foster et al., 2010). Understanding how information flows through these networks is important for optimizing their performance and for designing new algorithms that take advantage of their unique structure.
![Figure 1](https://i.imgur.com/J0vTGiO.png)
*Figure 1: A two-dimensional torus network. Each node is connected to its four nearest neighbors.*
In this article, I will describe how we used network tomography to visualize information flow in high-dimensional torus networks. I will first give an overview of the technique and then show how we applied it to study data traffic in two different kinds of high-dimensional torus networks: those used for supercomputing and those used for data storage. Finally, I will discuss some implications of our results for the design of future high-dimensional torus networks.
Network Tomography Overview
Network tomography is a technique for inferring the underlying structure of a network from measurements of how information flows through it. The basic idea is simple: If you know how information is flowing between pairs of nodes in a network, you can use that information to infer something about the structure of the network itself. For example, if you know that there is a lot of traffic between two nodes that are far apart from each other in terms of physical distance, you can infer that there must be some kind of shortcut between them—perhaps an direct connection or an indirect connection via intermediate nodes. On the other hand, if you know that there is very little traffic between two nodes that are close together in physical distance, you can infer that there must be some kind of barrier between them—perhaps a broken link or a heavily congested link. Network tomography allows us to make these kinds of inferences by using mathematical models to “reverse engineer” the flow of traffic in a network (Crovella & Kolaczyk, 2001; Tufte, 1997).
To see how this works in practice, let’s consider a simple example: Imagine that we want to use network tomography to infer the underlying structure of a small social network consisting of four people (A–D). We make measurements of how often each person communicates with each other person over some period of time—for example, we might measure how many emails each person sends per day or how many phone calls each person makes per week. From these measurements, we can construct a “communication matrix” like the one shown in Figure 2(a).
![Figure 2](https://i.imgur.com/pYSKoPzm.jpg)
*Figure 2: Communication matrices for (a) a small social network and (b) a large social network.*
Each element in this matrix represents the number of communications between two people over our specified period of time—so, for example, we see that person A sent eight emails to person B during our measurement period. We can use this matrix as input to our network tomography algorithm, which then produces an inferred “structure matrix” like the one shown in Figure 2(b). This matrix shows us what our algorithm has inferred about shortcuts and barriers between pairs of people in our social network based on our communication measurements; specifically, it shows us which pairs of people are close together (represented by large values) and which pairs are far apart (represented by small values). For example, we see that our algorithm has inferred that there is no shortcut between persons A and C—that is, there are no intermediary persons through whom A could communicate with C more quickly than by communicating directly with C—but it has inferred that there exists a shortcut between persons B and D via intermediary person C.
Network Tomography Applied to High-Dimensional Torus Networks
We appliednetwork tomographyto study data traffic patternsin two different kinds oftorus networks: those typically usedfor supercomputingand those typically usedfor storing data . Both typesof networksare highly complicatedand difficultto understand ; however , by usingnetwork tomographyearwe were ableto gain insights into theirinternal structureand function .
SupercomputingTorus Networks
Supercomputerstypically consistof many hundredsor thousandsof individual computers , all interconnectedby high – speednetworks . In recent years , thesearchitectureof supercomputershas beenchanging rapidly , moving away from traditional “cluster” configurationsin favorof new architecturesbased on massively parallelprocessors(MPPs) .MPPswere initially developedfor usein scientificand engineeringapplications requiringmassive number – crunchingcapabilities ; however , they have since been adaptedfor usein commercialapplications such as financialmodelingand weatherprediction .
One popular MPP architectureis knownas aconnectedtorusbecauseit resemblesa multi – dimensionallattice with periodicboundary conditions(see Figure 3 ) . In such an architecture , each computernodeis interconnectedwith its nearestneighborsin aconnectedmanner — similar tonodesin amulti – dimensionallattice — but withthe addedconstraintthat topologyis periodic ; i . e . , if one followsthe pathof any givenconnection longenough , one eventuallyreturnsto one ‘ s originalposition modulothe sizeof thenetwork . This typeof boundaryconditionis mathematicallyequivalentto connectingthe oppositeedgesof an n – dimensionalhypercubeto form an n – dimensionaltorus , which explainswhy this architectureis sometimesreferredto as atorusnetwork . MPPsupercomputerstypically consistof hundredsor thousandsof processors arrangedin aconnectedtorustopology ; however , becauseoftheir massiveparallelismthey can be scaled up toeven greater sizesif needed .
![Figure 3 ](https://i