Simple But Not Easy
Working as a data analyst, one's constantly reminded of the importance of simplicity - the need to boil complicated questions and answers down to single lines, numbers, and answers. The advice is always the same: KISS! Analysts have to strive for simplicity because the job is fundamentally teleological - it's not ars gratia artis, but rather a task done above all to facilitate better decision-making.
A quote from Sebastian Mallaby's book The Power Law serves to illustrate the point in an interesting way: as soon as Kevin Efrusy arrived at Accel, at the time an up-and-coming venture capital firm, he was expected to contribute forcefully to meetings:
Nor was it enough to comment usefully; he was required to express a verdict, yes or no, and take responsibility for it. "There is a saying in our business, 'If you are treated like an analyst, you are going to act like an analyst'", Efrusy explained later. An analyst could point out the arguments on both sides of an issue, but that was different from taking a stand, and this difference defined the psychological gulf between being a venture capitalist and not being one.
As Efrusy makes clear, coming to a decision isn't easy by any means. But the difficulty of the job isn’t really the issue for me; it’s more of an epistemic problem. There's some sense that it can't all be that simple - can it? In particular, I often wonder what we lose when we simplify so aggressively - is it the case that all problems have a golden kernel of simple truth at their heart, and that perfection is achieved when there is nothing left to take away?
Richard Whatmore felicitously described the goal of intellectual history as "a more sophisticated sense of past thoughts, an understanding of how they arose and of why different solutions to historical problems made sense, and of the limits to human action in history imposed by the ideological frameworks which people faced in their lives." For Whatmore, there are no iron laws of the past, nor a simple story to be told - only greater nuance and understanding. I particularly like the idea of ideas not as simple unit-ideas, to be pinned down like dried leaves, but rather as contingent but rational responses to particular problems; rational inasmuch as they happened for a reason, even if that was not our reason; one rationality amongst many. The same idea is expressed by David Runciman's appraisal of the great English legal historian FW Maitland, whose work expressed "the English assumption that what works must make sense, rather than that something must make sense if it is to work." I just think that's a really cool idea. Maitland was a genius at lucidly explaining complexity.
Networks are really complicated things; but as analysts, we have to try to make them simple. One of the reasons I find exponential random graph models (ERGMs) so fascinating is the way that they embody this contradiction. ERGMs allow us to model the likelihood of a new edge forming between two nodes in a network not simply based on node-level covariates like age, gender, wealth, nationality, and so on, but also on the pattern of connections in the network as a whole: the effect that the connection would have on the counts of small topological structures (a.k.a. simplicial complexes) within the network; things like triangles or two-paths (the cases where nodes A and B are connected, and so are B and C). Important ideas about global structure stem from these hyper-local cases.
It also feels like ERGMs, and to be honest networks in general, are a long way from being perfectly understood. Since ERGM models have to calculate the impact of many different small perturbations on a cascade of other resulting structures, the normalising constant in an ERGM quickly becomes impossible even to estimate, and so they can only be used on networks of up to a few hundred nodes.
As such, the challenge is to find small networks, or subgraphs, which nonetheless encode valuable information. It's a bit like a dimensionality reduction problem - how can we filter out the nodes which don't matter, while maintaining interesting properties of the network as a whole? How can we find some simplicity in this complexity?
The most obvious approach is to select just the most important nodes. If we define importance as node degree - the most gregarious people, the most prolific traders, and so on - we could map the edges between those well-connected nodes. Indeed, by including their original node degree from the full network as a node covariate, we'd hope to maintain some properties of the full network.
However, doing so loses important detail. One consequence of Duncan Watts and Steven Strogatz's 1998 observation about the 'Collective dynamics of ‘small-world’ networks' is that if we add a few random edges to a ring lattice, where nodes are only connected to their nearest neighbours, these edges function as 'shortcuts'. These edges have a significant effect on the global structure of the network, by changing many of the shortest paths between nodes, but are relatively undetectable at the local level.
Some of the nodes omitted from the subset studied with an ERGM might function like these 'shortcuts'. If a node has a high betweenness centrality - that is, a lot of 'shortest paths' between nodes flow through it - then its omission will make the subset unrepresentative of the network as a whole. This example highlights the relevance of betweenness centrality for thinking about 'importance' in a network: we can't just focus on node degree, but also need to consider other metrics like closeness and eigenvector centrality. However, it also shows how difficult it is to select a representative sample for analysis.
In the past I've tried to get around the problem by selecting all the nodes from a larger network which rank highly according to any centrality metrics - as a result, nodes with a low degree but high betweenness centrality get included. However, rather than trying to find a subgraph that's representative of the global structure of the network, it might be a better idea to look for modularity - finding sub-communities of densely connected nodes that only have a few connections to other communities. If a network has a high modularity score, it should be possible to partition it into separate communities by cutting relatively few edges; these communities can then be studied independently with their own ERGM model, and subsequently compared.
I started thinking again about this problem after following Tom Forth's work on calculating journeys by public transport across the UK.
Friends fear he's thinking about getting 128GB of RAM. pic.twitter.com/zCmAY1HBeI
— Tom Forth (@thomasforth) November 11, 2021
If one sets out, for instance, to calculate the shortest path between London and other places in the UK, the complexity increases exponentially with travel time. Forth, whose work I really enjoy, takes this as an argument for decentralisation; it's computationally infeasible to plan a public transport system from a central point, and so transport should be devolved to local authorities.
The exponential means that having bigger central teams with more resources and "better" people won't work. No matter how good and how well-resourced the teams are, the exponential will beat them. When working on transport we're just much better working on smaller areas.
— Tom Forth (@thomasforth) November 12, 2021
However, I'm not sure that logic quite follows. While no single model can keep up with exponential growth in complexity, why couldn't several carefully-partitioned models perform an equivalent function - while still being run by a single team? Is it an argument for regional devolution, or for carefully-organised central teams working in parallel?
To move back to thinking about network analysis, if we can't use an ERGM with a network of a thousand nodes, it makes sense to divide the network up into lots of different subgraphs. The problem then becomes one of community detection as much as dimensionality reduction. Whereas there is a literal 'ground truth' in public transport, with privatised local bus and rail networks and the physical distance between nodes serving to demarcate boundaries, it's not quite always as easy to draw distinctions within other networks - especially, for instance, with an online marketplace where physical distance is irrelevant and everyone can interact with everyone else.
Specifically, public transport networks tend to have a 'hub-spoke' structure: just think about the railway stations of London and Paris! But while some other networks also have that structure, there are many others with a layered, or overlapping structure. Whereas Mark Granovetter, in his seminal paper The Strength of Weak Ties, conceived of densely connected communities linked by narrow, 'weak' bridges, in the last 10 years network scientists have stressed the alternative possibility that overlaps between communities are actually more densely connected than the communities themselves; see this graphic from Jaewon Yang and Jure Leskovec's 2013 paper Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach:
The idea is intuitive if we think about the analogy of a university; while the rugby club and the Faculty of Medicine are separate communities, they're not connected by occasional bridges, as the Granovetter weak-tie model would have it, but by a densely-connected overlap, where almost all the rugby-playing medic know each other. The idea is set out succinctly in another Yang and Leskovec paper, Overlapping Communities Explain Core-Periphery Organization of Networks (2014): "the connectedness of nodes is proportional to the number of shared community memberships, and not just their similarity along a single dimension or community."
The point is almost trite: if our network is characterised by isolated communities linked by weak ties, it will be possible to partition it into communities that can be studied in isolation; that partition can easily be achieved with modularity-based approaches. However, if our network has densely overlapping communities, a modularity-based approach to community detection won't be successful; hence the thrust of Yang and Leskovec's work. Just as importantly, it won't be possible to extract these local structures and study them independently, because the boundary zones, which get sundered by such an approach, are if anything more important than any individual community.
Which networks are which? From my own work, the New Orleans and Maryland slave markets, while co-dependent and linked by important individuals like Bernard Moore Campbell, Isaac Franklin and John Armfield, can't be said to be 'overlapping' because of the sheer geographic and economic distance between the two 'communities' - instead, they're linked by a few interstate traders, who do much more business with their suppliers and customers than they do with each other. Indeed, this might be generally true in some economic networks, where firms that share membership of several sectors tend to be competitors. But, for instance, in the Enlightenment 'Republic of Letters' or modern Twitter, we'd perhaps be more likely to see densely connected overlaps.
Network science might seem a long way from intellectual and legal history, but the same genres of challenge arise - how much detail can we afford to lose? How much detail dare we lose? Whenever questions like this come up I think about a paragraph from Nietzsche; maybe everyone's right to remind their analysts of the value of a simple Yes and No: