Math needs a Makeover

In college, my majors were Cognitive science (emphasis in computation) and mathematics (emphasis in computer science). The hardest math courses for me were the ones with the most weird symbols. By “weird” I don’t just mean Greek letters, although lowercase Zeta, Xi, Wau, and Sigma all look the same. I mean symbols that are supposed to mean a word, but don’t offer any clues. I believe they place an unnecessary burden on working memory, which needs to be available to to take in the structures behind the symbols. I can accept some weird symbols for brevity and compatibility with older texts, same as the convention of the negatively charged electron, but there is often too much.

Today I’ll explore the problems in detail, and take a first stab at a solution.

Sample text

Look at how this textbook (PDF) describes time series statistics. There are some acceptable symbols. “T” is for “Time,” and “I” is for “index,” and I’ll even accept with my context that “Epsilon” is for “error.” “Beta” is never described. Here’s formula (1).

Formula (1) is not so bad. The reader is understanably assumed to have been taught about the “regression model,” and knows that it is a linear model that tries to select a “beta” coefficient that minimizes the error value when predicting one random variable by the value of another. It’s how we get the lines that are often drawn over scatter plots. With that context, it’s even acceptable that “Y” corresponds to the y-axis of a graph. It’s just the k-12 “y = ax + b” for drawing a line on a graph, but with “beta” instead of “a” and “B” being the mean of the error function, which is assumed to be random noise. even though “Beta” was always a mysterious choice throughout my statistics courses, We can move forward confident in our understanding.

Formula (2), above, throws some curveballs. alpha_i, mu_i, and beta_i are “sequences of coefficients” that depend on “parameters.” k, q, and p are never described. In earlier courses, “mu” refers to the mean of a probability distribution, so if you try to use that as a hint for the meaning here, you’ll be easily confused. Plus, (t - i) has the shape of a binomial and the author chose to present each series as a function of t, instead of subscripting the index, so you also have to remember that (t - i) is not a separate term to multiply.

For formula (3), also above, we are invited to remember that phi_i is negative alpha_i, because we subtracted the (sum from 1 to p of alpha_i times y of t minus i) from both sides. You can tell because The little tiny 0 in the alpha sum became a little tiny 1 in the phi sum.

Perhaps I’m being unfair, since the above is an appendix meant as a review for people who have already studied the subject. Here is an introductory text (PDF). It starts with nice, easy capital letters, but by page 5 pulls in gamma_t, meaning autocovariance function, tau, meaning an integer that gamma_t does not depend on, and rho_t, meaning autocorrelation function. Suddenly on page 6 we’re told:

X-bar, you’ll naturally recall, is the mean or expected value of the random variable X, probably. This formula does at least resemble covariance functions the reader should have seen in past courses. My problem is that the rendition of the glyph r here has a branching structure like gamma, just with different lengths, and it’s a micro-smudge away from being tau.

Summary of issues

Everything is italicized, making letters feel more wishy-washy, priming the mind to see them as hard to grasp
A great deal of meaning, definitional and contextual, is crammed into single letters
- Readability relies on inference from the reader’s knowledge of Hidden traditional usages of symbols, sometimes conflicting or misleading inferences
some symbols are made so small, and it’s easy to miss changes from one equation to the next
Symbols look alike
Things named after people instead of behavior
Function operator can look like multiplication
One version of lowercase sigma looks a lot like lowercase zeta, and somewhat like lowercase xi
- Students taking notes are ambushed with these swirlies, and working out how to draw them takes attention away from important information in the lecture
the convention d²x/dy² violates the order of operations
- If you’re saying d is individual to make d*dx == d²x, then dy*dy == (dy)² which is different from d*(y²)

Each of these issues are things that programmers and programming languages try to avoid. We even have IDEs that make it very easy to look up the definition and other uses of a symbol. Decades ago, a one, two, three-letter var name saved typing strain and precious memory on a shared machine. But those days are gone. We strive for self-documenting code now, code whose function and variable names explain the process so well that comments aren’t needed.

I should now recognize the value of the traditional format of math expressions:

Concise and canonical for reference
common international understanding
saves paper and ink
compatibile with older work

We can keep these expressions around as the distillation of the idea, but there is so much to gain by a longer, more verbose construction.

Unpacking

I want to try rewriting equations and formulae from a functional programming angle because it invites the reader to digest one piece at a time, and build the whole thing up with a richly annotated structure. I picked time series statistics because there’s a direct application with datasets, so it would be natural to teach with code alongside the fancy tex.

How might we express the unwieldy formulae (2) and (3) in python? There will be compromises, because math notation is mostly declarative, and the variables aren’t given values. In that case I declare them with ... because they are assumed to come from somewhere random. it would be too messy to try and make a class definition with all those variables in the constructor. In this case i solve for the error series at the beginning, to communicate that x_series and y_series are observed.

# Random variables in time series.
y_series = [...]
x_series = [...]
# Predicted regression coefficients 
x_coeff = [...]
# Derive the regression error for each time step
err_series = [
  y_series[t] - (x_series[t] * x_coeff[t]) for t in range(len(y_series))
]
# The general temporal regression model predicts the Y value
# at a given time step, based on weighted sums of all series
# values from past time steps.
# The coefficient for current Y (where lag is zero) is fixed as 1
# for simplicity.
y_lag_weights = [1, ...]
def WeightedY(time, lag):
  return y_series[time - lag] * y_lag_weights[lag]
x_lag_weights = [...]
def WeightedX(time, lag):
  return x_series[time - lag] * x_lag_weights[lag]
err_lag_weights = [...]
def WeightedError(time, lag):
  return err_series[time - lag] * err_lag_weights[lag]
def SumLaggedX(time, total_lag):
  return sum([ WeightedX(time, i) for i in range(total_lag) ]
def SumLaggedY(time, total_lag):
  return sum([ WeightedY(time, i) for i in range(total_lag) ]
def SumLaggedError(time, total_lag):
  return sum([ WeightedError(time, i) for i in range(total_lag) ]
total_lag_y = ...
total_lag_x = ...
total_lag_err = ...
time = ...
assert(SumLaggedY(time, total_lag_y) == SumLaggedX(time, total_lag_x) + SumLaggedError(time, total_lag_err))
# By subtracting the Y-sum from both sides, except for the
# term where lag is zero, we arrive at an equation for y at the given
# time, based on past values of Y, and both current and past values of
# X and Error. 
y_lag_weights = y_lag_weights[1:]
assert(y_series[time] == -SumLaggedY(time - 1, total_lag_y - 1) + SumLaggedX(time, total_lag_x) + SumLaggedError(time, total_lag_err)

This is just a starting point, based on a commonly known language. Some clarity is lost, but far more is gained. It also feels more honest about the time the reader should expect to take, to integrate the equation into their semantic memory. A proper reimagining of math notation shouldn’t try to be executable source code, but writing the above is a good study exercise. If a student writes additional code to read data and derive the lag weights, they can modify the above code to check the results.

If someone came across this code as part of a library or application, I think it would be pretty easy to follow, even without the textbook. If I had replaced the names ...Lag_weights and total_lag... with names like beta, alpha, mu, k, p, q, however, they would be nonsense without a reference. Yet some code really uses variable names like alpha. It all feels very uncritical to me.

Name matters

I started this post because I was mad about operators named after people, like “the Laplacian.” Here, imagine talking about an operator like this:

The Herodotian is typically used to describe an affine combination of two elements of a metric space, algebraic group, or complex number. It often uses the same symbol as the Tarski operator applied to two strings.

That’s my new “dihydrogen monoxide” trick. It’s the plus sign, congrats if you caught it early.

The reason we don’t call it “the Herodotian” is not that herodotus didn’t invent it per se, but that it’s longer than the functional name. By contrast, “divergence of the gradient” is more letters and syllables than “Laplacian,” but that’s no excuse to stop thinking. Let’s break down what this operator is all about and come up with a better name.

It’s been a long time since I took vector calculus, but I remember there are three differential operators for functions in a vector space: gradient, divergence, and curl. Curl is the one that takes the most number crunching, and we don’t have to talk about it today.

The gradient operator 🔻f works on scalar functions of a vector space (aka scalar fields), usually 2D or 3D space: f(x,y,z) = w. W stands for “whatever” and is a scalar result of some formula of x, y, and z. So 🔻F is a vector function (aka vector field) g(x,y,z) = (df/dx, df/dy, df/dz) that points in the direction where the value of w grows the most from each point.

The Divergence operator 🔻·F, much like a dot product, takes a vector field function and produces a scalar function again.
If F(x, y, z) = (a, b, c), where A, B, and C are like W above, the divergence is da/dx + db/dy + dc/dz. If a vector field is thought of as force arrows at every point in the vector space, then the divergence of the vector field is the total amount of force “pushing away” from the point. For a vector field that represents the force of gravity, for example, the divergence would be an extreme negative value at the center of a star.

So when we take the divergence of the gradient, 🔻·🔻F, we put in a scalar function and get back a scalar function. Because this was all componentwise, the output is effectively the sum of the second derivative of f for each of x, y, and z, or:

🔻·🔻F(x,y,z) = d/dx(df/dx) + d/dy(df/dy) + d/dz(df/dz)

What makes that more interesting than, say, the magnitude of the gradient?

I’m looking at the Wikipedia page for the operator, and I feel like I’m being punked. The supposed first usage relates “gravitational potential” (or electrical potential), a scalar field, to the “mass distribution” (or charge distribution), another scalar field. the definition of the potential scalar field is a function whose gradient is the force vector field. To me, the force field is the main one. This StackExchange answer tells me that it’s what we directly measure, too. If the mass distribution is just the divergence of the gravitational force field, of what use is this potential scalar field?

I decided to try to find the first usage of the operator in Laplace’s works. It’s attributed to his 5-volume Celestial Mechanics. The latest translation, Mechanism of the heavens, is more of a revision for the sake of clarity. Laplace didn’t show his work, apparently. There are some second order derivatives there, but I wasn’t confident in calling any of them the original Laplace operator. Neither the french nor the English used the words “gradient” or “divergence” either, and I’m just not going to read hundreds of pages for this.

I found this website by Jeff Miller very helpful for tracking down the origins of today’s conventions. Laplace actually died 40 years before anyone started using 🔻, and 46 years before any known text said “Laplace operator.” The first known text is A treatise on Electricity and Magnetism by James Clerk Maxwell. On page 14 we get our first clue:

Here we see the Potential Function described again as something that another function is the gradient of. The author credits Laplace, in so many words, with interpreting a vector as the gradient of a scalar field. After explaining that the theory of gravity uses a negative gradient and electromagnetism uses a positive, he introduces a 🔻 operator attributed to Hamilton. Note that today, we do not call it “the Hamiltonian.”

Is that a discovery? It seems like the exact same thing phrased differently. Maybe the significance is that what was once a procedure, was distilled into something that can be treated as a multiplier. This does clarify why the same “del” symbol operates on a scalar, and on vectors with the syntax of a dot product, and a cross product. It’s a vector. Maxwell calls it the “space-variation” instead of “gradient.”

Skimming along, I see an interest in line integrals.

From definition (1) and (2) above, and changing the d to “c for constant,” this means -dΨ - dΨ - dΨ == -cΨ. What? Maybe I missed an earlier definition that is secretly re-established. The next except is the section just before my first excerpt.

So we’re looking at the projection of r onto the tangent of the curve at each point?

The exact differential was a good search term, but the outside sources comport with the substitution in (1) and (2). This is a huge problem for me. They keep pretending that differentiation is a multiplier, but it can’t be commutative or associative if any of this stuff is supposed to hold water, so it’s illegal to just slap it next to terms like it’s any old coefficient. Apparently Liebniz is responsible for this, as well as the integral sign, another weird symbol that’s his deranged version of an s. If we just used an s, maybe high school students would be less intimidated. There is also apparently a “D-notation” that could explain the D in Maxwell’s exact differential. I have to stop trying to read this garbage.

I mean look at this. The road to enlightenment is paved with weird squiggles that catch on your tires and make a nasty scraping sound until you pull over.

Anyway, at some point Maxwell describes negative divergence as “convergence,” and according to Miller, he suggested in a letter that the negative convergence be called “divergence,” and that term was first published by William Clifford in 1878.

returning to the original purpose of this adventure, Page 29 holds:

I like the sound of that, “concentration.” It was right there, the whole time, 3 short paragraphs below “Laplace’s Operator,” and later scholars just ignored it. Gradient, Divergence, curl, and concentration make sense. They follow a theme.

Now I understand why this is more interesting than the magnitude of gradient. The gradient is not best understood as a vector evaluated at a point, but a field in the region of the point, whose divergence gives the strength of the scalar field’s manifold inflection.

From now on, 🔻·🔻F is the concentration of F, not quite density, and not quite compression.

Forging new paths

I wanted to resolve my gripes about math notation so I could proceed, unimpeded, to read this paper by Alan Turing, The Chemical Basis of Morphogenesis, because I saw this video and wanted a more solid foundation for understanding turing patterns.

When you know what someone was doing when they wrote up a symbol or formula, you have more power over it. You can mold it to fit your internal system.

I wish I had known about Jeff Miller’s resource back in college, and had the wherewithal to try expressing formulas as if they were programs. But as I continue my self-directed education in the natural sciences, I’ll have new tools to help me.

were it easy to make up a new notation that makes math more transparent, we would have it already. One improvement at a time can lead ultimately to a comprehensive grammar of typographic math expressions. Maybe there already is one (Peano arithmetic?), used to provide inputs to theorem solver programs, but they haven’t caught on for general use. To start, I think we can make better use of brackets and other symbols. We can differentiate functions from multiplication of a group by using a different closure for groups, or requiring the dot symbol. Older books have used curly braces for grouping. Maybe one day there will be ISO standards. The foundation of math is consistency, and we must restore that foundation to advance the total knowledge and talent in future cohorts. They need to know that Liebniz can’t push them around anymore.

Cover art: Mansell Collection, Commémoration de la Prise de la Bastille (1901)