Fixing a really ugly model: A post-script to our demo of simple Bayesian inference

In a recent blog (click here) I illustrated how to conduct Bayesian inference by applying it to a toy fishing prediction problem. This blog post hopes to offer an improvement. For a full understanding you'll need to read the original post, although you can skip over the description of my fishing trip. There is revised code too for this blog, which runs in Matlab or Octave; you can get the code here.

The earlier blog was intended to show how easy it is to execute this form of inference and to provide some example code. Although I said the model was not to be taken too seriously, there were some “ugly” features of the model which were very awkward and gave paradoxical results. The purpose of this blog is to provide a neater version of the model and so avoid the inherent contradictions of the previous version.

In the model we tried to infer or predict the number of available fish, denoted v, in the swim which an angler is fishing. The “swim” is the part of the water in front of the lakeside or river bank from which the angler is fishing. We tried to predict v from the time it took from starting to fish until the first “bite”, denoted u. The model assumed that u (a time in minutes or seconds) was a decreasing function of v. This seems reasonable as the more fish available in the swim the less time the angler should have to wait for a bite. The functional relationship was denoted u=g(v).

The blog compared two variants of the function. The first was a non-linear function in which u initially decreased rapidly as v increased, before reaching an asymptote. At the asymptote of u, the value of v was large enough that the time to the first bite reached a minimum of about a minute. Larger values of v then had no further effect on u. This minimum for u is due to the physical constraints of the fishing process which mean that one cannot get the first bite in under 60 seconds.

The other function, by contrast, was linearly decreasing. This means that u decreases at a constant rate as more and more fish are available. The first ugly aspect of the model was the way I tried to deal with the case when v became so large that u would decrease below zero. Obviously it is not possible for times to be negative; indeed, I just noted that in reality u could not drop below 1 minute. The solution was to argue that when fishing a swim for Maxv minutes the maximum number of fish "available" AND that I could catch was Maxv, given that I could catch at most 1 fish per minute. Maxv was set to 45 as my fishing session in the same swim was 45 minutes long. The linearly decreasing function could then be then written as 

u = Maxv + 1 - v

where v ranged from 0 to 45. When v=0 this function gives u=46 (meaning that I would not get any bites in the 45 minute session); when v=Maxv then u=1 (meaning that the first bite would occur as quickly as possible, one minute after beginning to fish).

This is ugly because it is highly artificial. It seems likely that there might be scores of fish in your swim if there was a shoal feeding in the water in front of you. It would thus be more natural to try to infer the total number of fish present, not just the number one could catch in the time available. It is also much more natural, IMO, to replace Maxv in the above expression by Maxu, the maximum amount of time I could wait for my first bite. Thus it is better to write:- 

u = Maxu + 1 - v

The fishing trip was 45 minutes long so with the above expression Maxu could be set at 45 minutes so that, when there are no fish available (v=0), u becomes >45 minutes.

One could deal with controlling u to be always >=1 by using a so-called piecewise linear function for u: the first piece would be decreasing linearly as a function of v and then when u reached a value of 1 the second piece would be flat; u would remain at a value of 1 as v increased. The new code (see link above) makes these changes.

Another important way in which the old model was ugly (or at least overly-rigid) was that I was thinking of the scale for v as literally measuring the number of fish. In fact, it is better to consider the scale for v to be arbitrary, so that a unit change in v does not mean a difference of one fish. At first, this relaxation of the model seems unnatural but it is a reasonable perspective to take. Imagine that I rewrote the above linearly decreasing function as follows:- 

u = Maxu + 1 - k*v

where k is a parameter which controls the slope of linear decrease. This more general form of the function could indeed work with exact numbers of fish, by varying the value of k appropriately. By fixing k to have a value of 1 (as we will continue to do here), we are effectively putting the values of v on an arbitrary scale.

One further feature of the previous modelling needs fixing. Bayesian inference requires one to use a prior distribution. In this case the prior distribution was for v. The next ugly feature was the choice of prior and its impact on the modelling results, especially when using the linear function. I argued that I normally would expect to have about 10 bites in a typical 45 minute session in the swim where I was fishing, with rarely more than 16 bites and rarely fewer than 4. On this basis I used a normal distribution with mean 10 and standard deviation (s.d.) of 3 as the prior for v. Frankly, this now seems a very odd choice for the prior, as well as an ugly one. At 10 bites per 45 minutes, this might logically imply around a 4.5 minute wait until the first bite (assuming bites are uniformly distributed over the session). However, if we plug v=10 into the above expression for we get a value for u of 37 minutes?! Clearly, waiting so long for the first bite is not compatible with having 10 bites in the 45 minute session. The choice of prior was totally inappropriate (at least when working with the linearly decreasing function).

A much better approach would be to estimate the prior for v based on our typical experience with u, the entity we directly experience. So, we can use our typical experience of a 4.5 minute average wait for the first bite in this swim (u=4.5) to infer a prior mean for v. Rearranging the above expression and inputting a value of 4.5 for u we get an estimated prior mean for v=41.5.

To estimate the s.d. of the prior v distribution we can recall that our experience in this swim was that the number of bites in a 45-minute session could be approximated by a normal distribution with mean=10 and s.d.=3.  One s.d. below the mean would therefore give us 7 bites in the 45 minute session. Again assuming bites come at a uniform rate then the first bite might be expected after approximately 6.5 minutes. Plugging this into the function above we get a value one s.d below the mean of v which is 39.5, thereby giving an s.d. =2 for the prior v distribution.

Now, we are almost ready to rerun the revised version of the two models. First, we need to determine, in a comparable fashion, the parameters of the prior distribution for v when using the non-linear function. Using the same process we just went through for the linear function we get a prior mean of v=9 and an s.d. of 3. As this function is non-linear there is a slight asymmetry in the v distribution around the mean, but it is so minor at these values of v we can ignore it. We can see that we need to use very different values for the prior distribution of v in order to achieve our initial expectation of a first bite after 4.5 minutes. The model in the previous blog tried to use the same prior (mean v=10; s.d. =3) for both functions; while these values were a reasonable choice for the non-linear function they were completely inappropriate for the linear function, as we have seen. This mismatch in the prior distribution values for the two functions is a required feature rather than a bug, and we can feel relaxed about making this choice now we have also realised that the scale for v is arbitrary as discussed earlier.

Our data for u from the fishing session was that the first bite came after 1.5 minutes (u=1.5). Below we plot the prior and posterior distributions for v, first for the non-linear model and then for the linear model. Both plots show the characteristic features of Bayesian inference: shrinkage and regularisation. The posterior probability distribution, in either the linear or non-linear model case, is more precise (i.e., shows a lower s.d.) than the prior distribution. This reduction in error of estimation is shrinkage.





Regularisation is shown when the posterior distribution is shifted to occupy a position somewhere between the position indicated by the current observation and the mean (or mode) of the prior. When using the non-linear model the prior mean for v was 9 and, with this function, the observation that u=1.5 minutes on its own implies a value for v of 29. The posterior mean for was 13.1, indicating a strong regularising pull from the prior distribution. We can also see this clearly if we convert the means of these v distributions into equivalent u values. By design, the prior mean for v equated to a u of 4.5 minutes. The posterior mean for v equates to a u value of 3.2 minutes; this is regularised to be closer to the prior mean than it is to the current observation of u=1.5 minutes.

When using the linear function we see much weaker regularisation by the prior. In this case the prior mean for was 41.5 and the observation that u=1.5 minutes on its own implies a value for v of 44.5. The posterior mean for was 44.1. Again we can look at this via the equivalent u values. Once again, by design, the prior mean for v equated to a u of 4.5 minutes. The posterior mean for v equates to a u value of 1.9 minutes; this is obviously very close to the observed value of 1.5 minutes.

A test of these predictions for v is provided by observing the time from the first to the second bite. The posterior mean using the non-linear model gives an expected value of about 3.2 minutes. The posterior mean from the linear model suggests a value of about 1.9 minutes. In the actual fishing session concerned the linear model was far closer to reality, as the second bite came around 1-1.5 minutes after the first. This suggests that the linear model linking u to v is more accurate than the non-linear one. One can iterate Bayesian inference more formally by using the posterior after one inference to act as the prior for the next inference. This process is known as Bayesian updating. We will look at this explicitly in a subsequent blog.

A final question would be to ask why would you want to predict v from u in this situation, especially as we have noted that v is on an arbitrary scale? After all, u is relatively easily observed and v is a latent variable which is not easily measured. Of course, one could use various methods (e.g., sonar or electrofishing etc) to actually measure how many fish there are in a swim after observing u. One could do this across several sessions and see if the relative predictions for v were correct, based on the value for u measured in each session. But this is a toy model that no-one is going to take seriously enough to subject it to such a labour-intensive test.


Conclusions
The issues addressed in this blog post reveal several valuable lessons for modellers.

1) When a model behaves in a paradoxical way then this is a very strong clue that something is awry, and suggests that the model needs re-considering. In my earlier blog post, a seemingly well-fitting model with the linearly decreasing function indicated that, before observing any data, the first bite was expected after more than 35 minutes and yet it was based on a prior distribution that itself was supposedly based on expecting 10 bites in 45 minutes!?

2) After discovering a potential error, or even better before making one, you should think through your model choices very carefully. It is very easy to make a duff choice and convince yourself it is logical and reasonable.

3) For Bayesian models the choice of the prior is very important. 


Comments

Popular posts from this blog

Galton's most important statistical insights

Is it a pike or a perch? It's neither; it's a zander.