Modularity & Abstraction

(This is a sort of rambling post that I started in 2017 April.)

Modularity and abstraction feature prominently wherever computers are involved. This is meant very broadly: it applies to designing software, using software, integrating software, and to a lot of hardware as well. It applies elsewhere, and almost certainly originated elsewhere first, however, it appears especially crucial around software.

Definitions, though, are a bit vague (including anything in this post). My goal in this post isn't to try to (re)define them, but to explain their essence and expand on a few theses:

Modularity arises naturally in a wide array of places.
Modularity and abstraction are intrinsically connected.
Both are for the benefit of people. This usually doesn't need stated, but to echo Paul Graham and probably others: to the computer, it is all the same.
More specifically, both are there to manage complexity by assigning meaningful information and boundaries which allow people to match a problem to what they can actually think about.

What Are They?

People generally agree that "modularity" is good. The idea that something complex can be designed and understood in terms of smaller, simpler pieces comes naturally to anyone that has built something out of smaller pieces or taken something apart. (This isn't to say that reductionism is the best way to understand everything, but that's another matter.) It runs very deep in the Unix philosophy, which ESR gives a good overview of in The Art of Unix Programming - or, listen to it from Kernighan himself at Bell Labs in 1982.

Tim Berners-Lee gives some practical limitations in Principles of Design and in Modularity: "Modular design hinges on the simplicity and abstract nature of the interface definition between the modules. A design in which the insides of each module need to know all about each other is not a modular design but an arbitrary partitioning of the bits… It is not only necessary to make sure your own system is designed to be made of modular parts. It is also necessary to realize that your own system, no matter how big and wonderful it seems now, should always be designed to be a part of another larger system." Les Hatton in The role of empiricism in improving the reliability of future software even did an interesting derivation tying the defect density in software to how it is broken into pieces. The 1972 paper On the Criteria to be Used in Decomposing System into Modules cites a 1970 textbook on why modularity is important in systems programming, but also notes that nothing is said on how to divide a systems into modules.

"Abstraction" doesn't have quite the same consensus. In software, it's generally understood that decoupled or loosely-coupled is better than tightly-coupled, but at the same time, "abstraction" can have the connotation of something that gets in the way, adds overhead, and confuses things. Dijkstra, in one of few instances of not being snarky, allegedly said, "Being abstract is something profoundly different from being vague. The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." Joel Spolsky, in one of few instances of me actually caring what he said, also has a blog post from 2002 on the Law of Leaky Abstractions ("All non-trivial abstractions, to some degree, are leaky.") The principle of least privilege is likewise a thing. So, abstraction too has its practical and theoretical limitations.

How They Relate

I bring these up together because: abstractions are the boundaries between modules, and the communication channels (APIs, languages, interfaces, protocols) through which they talk. It need not necessarily be a standardized interface or a well-documented boundary, though that helps.

Available abstractions vary. They vary by, for instance:

…what language you choose. Consider, for instance, that a language like Haskell contains various abstractions done largely within the type system that cannot be expressed in many other languages. Languages like Python, Ruby, or JavaScript might have various abstractions meaningful only in the context of dynamic typing. Some languages more readily permit the creation of new abstractions, and this might lead to a broader range of abstractions implemented in libraries.
…the operating system and its standard library. What is a process? What is a thread? What is a dynamic library? What is a filesystem? What is a file? What is a block device? What is a socket? What is a virtual machine? What is a bus? What is a commandline?
…all other kinds of libraries a language might use, and entire frameworks that cross language boundaries. Consider something like Apache Spark, which deals in abstractions that may be accessed from various languages.
…the time period. How many of the abstractions named above were around or viable in 1970, 1980, 1990, 2000? In the opposite direction, when did you last use that lovely standardized protocol, CGI, to let your web application and your web server communicate, use PHIGS to render graphics, or access a large multiuser system via hard-wired terminals?

As such: Possible ways to modularize things vary. It may make no sense that certain ways of modularization even can or should exist until it's been done other ways dozens or hundreds or maybe thousands of times.

Other terms are related too. "Loosely-coupled" (or loose coupling) and "tightly-coupled" refer to the sort of abstractions sitting between modules, or whether or not there even are separate modules. "Decoupling" involves changing the relationship between modules (sometimes, creating them in the first place), typically splitting things into two more sensible pieces that a more sensible abstraction separates. "Factoring out" is really a form of decoupling in which smaller parts of something are turned into a module which the original thing then interfaces with (one canonical example is taking some bits of code, often that are very similar or identical in many places, and moving them into a single function). To say one has "abstracted over" some details implies that a module is handling those details, that the details shouldn't matter, and what does matter is the abstraction one is using.

One of Rich Hickey's favorite topics is composition, and with good reason (and you should check out Simple Made Easy regardless). This relates as well: to compose things together effectively into bigger parts requires that they support some common abstraction.

In the same area, Composition over convention is a good read on how frameworks run counter to modularity: they aren't built to behave like modules of a larger system.

The contrasting terms interface and implementation are commonly seen in software, with "implementation" loosely referring to what is inside a module, and "interface" referring to its "outside" boundaries and thus to the abstractions it supports. You'll commonly hear advice about separating interface from implementation, and some semi-related things:

Why?

It has a very pragmatic reason behind it: When something is a module unto itself, presumably it is relying on specific abstractions, and it is possible to freely change this module's internal details (provided that it still respects the same abstractions), to move this module to other contexts (anywhere that provides the same abstractions), and to replace it with other modules (anything that respects the same abstractions).

It also has a more abstract reason: When something is a module unto itself, the way it is designed and implemented usually presents more insight into the fundamentals of the problem it is solving. It contains fewer incidental details, and more essential details.

That's all very practical for people. It reduces the amount of information that they must handle, and it permits them to reason about the behavior of systems that are unknown or even completely hypothetical.

It can also be seen as serving as a contract which reduces the amount of communication and often the amount of disagreement. I think this is a useful definition too: it conveys the notion that there are multiple parties involved, that they have already agreed on some specific obligations, and that they are free to behave as needed provided that they fulfill those obligations.

Separation of Concerns gets at this same idea and expresses it in terms of "concerns" rather than contracts.

I referred earlier to the abstractions themselves as both boundaries and communications channels, and invoking "communications" raises the related question of what information is being communicated. (For whatever reason, Wikipedia defines a concern in terms of… information).

Some definitions refer directly to information, like the abstraction principle which aims to reduce duplication of information which fits with don't repeat yourself so that "a modification of any single element of a system does not require a change in other logically unrelated elements". Encapsulation likewise refers to it via information hiding. Alan Perlis in his epigrams had #20: "Wherever there is modularity there is the potential for misunderstanding: Hiding information implies a need to check communication."

Examples

Network stacks, in particular via the OSI 7-layer model, are a good example of all of this. Higher-level protocols can work in a way that disregards lower-level details (most of the time - matters of bandwidth and latency do sometimes matter). Lower-level protocols can advance and be replaced without much concern for their higher-level use.

Even the early innovation of packet-switching is a great instance of abstracting network and routing details away from communications

Disk caches, and memory caches, and most other kinds of caches, work because they still implement the same underlying abstraction (albeit with some minor leakage).

Even DOS had useful abstractions. Things like DriveSpace/DoubleSpace/Stacker worked well enough because most software that needed files relied on DOS's normal abstractions to access them - so it did not matter to them that the underlying filesystem was actually compressed, or was actually a RAM disk, or was on some obscure SCSI interface. Likewise, for the silliness known as EMS, applications that accessed memory through the EMS abstraction could disregard whether it was a "real" EMS board providing access to that memory, whether it was an expanded memory manager providing indirect access to some other memory or even to a hard disk pretending to be memory.

Less-Conventional Examples

One thing I've watched with some interest is when new abstractions emerge (or, perhaps, old ones become more widespread) to solve problems that I wasn't even aware existed.

It really is the future talks about a lot of more recent forms of modularity from the land of devops, most of which were completely unheard-of in, say, 2010. Functional Geekery episode 75 talks about many similar things.

Jupyter Notebook is one of my favorites here. It provides a notebook interface (similar to something like Maple or Mathematica) which:

allows the notebook to use various different programming languages underneath,
decouples where the notebook is used and where it is running, due to being implemented as a web application accessed through the browser,
decouples the presentation of a stored notebook from Jupyter itself by using a JSON-based file format which can be rendered without Jupyter (like GitHub does if you commit a .ipynb file).

I love notebook interfaces already because they simplify experimenting by handling a lot of things I'd otherwise have to do manually - like saving results and keeping them lined up with the exact code that produced them. Jupyter adds some other use-cases I find marvelous - for instance, I can let the interpreter run on my workstation which has all of the computing power, but I can access it across the Internet from my laptop.

Apache Zeppelin does similar things with different languages; I've just used it much less.

Another favorite of mine is Nix (likewise its cousin Guix). One excellent article, The fundamental problem of programming language package management, doesn't ever mention Nix but explains very well the problems it sets out to solve. To be able to combine nearly all of the programming-language specific package managers into a single module is a very lofty goal, but Nix appears to do a decent job of it (among other things).

The Lua programming language is noteworthy here. It's written in clean C with minimal dependencies, so it runs nearly anywhere that a C or C++ compiler targets. It's purposely very easy both to embed (i.e. to put inside of a program and use as an extension language, such as for plugins or scripting) and to extend (i.e. to connect with libraries to allow their functionality to be used from Lua). GNU Guile has many of the same properties, I'm told.

We ordinarily think of object systems as something living in the programming language. However, the object system is sometimes made a module that is outside of the programming language, and languages just interact with it. GObject, COM, and XPCOM do this, and to some extent, so does Qt & MOC - and there are probably hundreds of others, particularly if you allow dead ones created during the object-oriented hype of the '90s. This seems to happen in systems where the object hierarchy is in effect "bigger" than the language.

ZeroMQ is another example: a set of cross-language abstractions for communication patterns in a distributed system. I know it's likely not unique, but it is one of the better-known and the first I thought of, and I think their guide is excellent.

Interestingly, the same iMatix behind ZeroMQ also created GSL and explained its value in Model-Oriented Programming, for which abstraction features heavily. I've not used GSL, and am skeptical of its stated usefulness, but it looks like it is meant to help create compile-time abstractions that likewise sit outside of any particular programming language.

hypothes.is is a curious one that I find fascinating. They're trying to factor out annotation and commenting from something that is handled on a per-webpage basis and turn it into its own module, and I really like what I've seen. However, it does not seem to have caught on much.

The Unix tradition lives on in certain modern tools. jq has proven very useful anytime I've had to mess with JSON data. socat and netcat have saved me numerous times. I'm sure certain people love the fact that Neovim is designed to be seamlessly embedded and to extend with plugins. suckless perhaps takes it too far, but gets an honorary mention…

People know that I love Emacs, but I also do believe many of the complaints on how large it is. Despite that it is basically its own operating system, within this it has considerable modularity. The same applies somewhat to Blender, I suppose.

Consider Machine Learning: The High Interest Credit Card of Technical Debt, a paper that anyone working around machine learning should read and re-read regularly. Large parts of the paper are about ways in which machine learning conflicts with proper modularity and abstraction. (However, Neural Networks, Types, and Functional Programming is still a good post and shows some sorts of abstraction that still exist at least in neural networks.)

Even more abstractly: emulators work because so much software respected the abstraction of some specific CPU and hardware platform.

Submitted without further comment: https://github.com/stevemao/left-pad/issues/4