Using Bayesian inference to predict fishing success: It was surprisingly effective.

(Since the publication of this blog post, I have updated the toy model presented here with a new version that does not employ the poor choices of the model below. The revisions in the new model are described here.)

Having completed my second book of 2023, the excellent "Bournville" (see review here), I was keen to catch fish species number 2. In fact, my blogging is lagging well behind both my fishing and reading. Here is a list of the books I have read so far this year for the challenge (in case anyone might be interested in a sneak preview of what book reviews are in store).
The mild weather earlier in January 2023 had passed and rural Surrey was in the grip of a bitter cold spell, something that’s generally not conducive to catching fish. In fact, as mentioned in the blog post about the first species of the year (the zander), I carried out a scientific study a few years back which demonstrated that I caught fewer zander when the weather was colder! 

However, days were ticking by, and I needed to get out on the bank again: it was 25th January already. Given the cold weather, I thought I'd try out the most prolific venue that I regularly fish at: Stammerham Farm Lakes. Perhaps the large numbers of fish in these lakes would enable me to beat the chilly weather conditions and catch species number two. 
 
A simple way to think about surprise

In my review of Bournville (see link above), I noted how effectively the author (Jonathan Coe) had used sudden surprising revelations about some of the central characters. I therefore decided that the link from that book to something statistical would be an analysis of how scientists and/or statisticians might think about, and model, the surprisingness of an event. I'll introduce some statistical ideas first, before we see how we might analyse the surprisingness of the results of the fishing trip on January 25th.

A future event might be thought of as surprising if it is unpredicted. This natural and obvious way to think about surprise requires us to consider at least two separate timepoints, although they might be very close to each other in time. The event of interest might (or might not) occur at the second timepoint. At the first timepoint, occurring before the event of interest, an individual's brain might have made a prediction about what happens at timepoint two. For example, perhaps it is strongly predicted that a particular event will occur at that later timepoint. The prediction is made based on any relevant information about the world that was available at timepoint one (and also potentially information carried forward from many previous timepoints). For the moment, we can set aside questions of whether the prediction needs to be made explicitly or intentionally. As discussed below, popular modern theories of how the brain works suggest that it is continually, and largely automatically, making predictions about future events, and then comparing those predictions with what actually occurs. If the brain's prediction turns out to be wrong and the predicted event does not occur, or happens to a lesser extent than was expected, then there will be a certain amount of surprise, related to extent of the prediction error

Let's make these ideas more concrete with some everyday examples. First, let's consider the scenario where a predicted event does not happen when it was expected to occur. Imagine a Doomsday cult who predict that the world will end on a particular date and time. The cult members might be huddled together in a bunker as that moment approaches, expecting the end of the world. However, with no sign of the apocalypse at the appointed hour, the members of the cult would presumably be very surprised. It seems reasonable to suggest that an individual cult member's degree of surprise would be proportional to the strength of their belief in the prophecy of the world's end. 

A complementary scenario -- the occurrence of an unpredicted event -- can also create surprise. As a cricket batsman you expect to face a series of balls bowled at you by the opposition team's bowlers. The bowlers are always trying to surprise you to a certain extent by varying exactly how fast they bowl and where the ball will bounce before it reaches you. However, the laws of cricket do not allow the bowler to bowl a ball which does not bounce AND is above waist level when it reaches the batsman. Such an illegal ball is colloquially known as a "beamer". It is therefore always a completely unexpected occurrence -- a big surprise -- when the ball slips out of the bowler's hand and flies directly towards your head or chest without bouncing. You can watch a You Tube of some of the worst examples from professional cricket. The surprisingness of the event means that the batsman is often unable to take effective evasive action.

 

Surprise and predictions in the brain

In the last two decades the popular "predictive coding" theory (and variations of that account) assumes that our brains actively try to make sense of noisy information gleaned from the world around us, by continuously making predictions or inferences based on that information. According to Bogacz (2017), predictive coding suggests that the brain's "cortex infers the most likely properties of stimuli from noisy sensory input. The inference in this model is implemented by a surprisingly simple network of neuron-like nodes. The model is called 'predictive coding’, because some of the nodes in the network encode the differences between inputs and predictions of the network." In other words, the models will compute the prediction errors noted above.

Bogacz's manuscript is a tutorial paper which is intended to help students and researchers understand how to construct a variety of predictive coding models. It is quite technical but has a simple underlying message. It also gives some "toy" examples of model code, in Matlab, at the end of the paper.

The first example Bogacz presents is a straightforward implementation of Bayesian inference and makes no pretence of being at all brain-like. Later in the paper Bogacz shows how brain-like neural models can do the same kind of prediction: the general point is that brains have evolved to make predictions about future events by using neural apparatus to approximate a version of Bayesian inference. In an earlier blog (and associated videos) I introduced the notion of Bayes' theorem and Bayesian statistics. These ideas, of course, lie at the heart of Bayesian inference.


A simple example of Bayesian Inference

We will start by sticking to the example used by Bogacz for the first model: he considers trying to predict (or infer) the size of a spherical food item, which has a radius denoted v. In the model, the size prediction is based solely on the intensity of the food's visual image as captured by a light-sensitive receptor cell. The single cell's information is inevitably an error-prone (“noisy”) estimate of the light intensity (denoted u). In the model, the error associated with is assumed to be normally distributed. It is also assumed that the mean level of estimated light intensity will be a function of the size of the food pellet, denoted g(v). The size of the food pellet directly influences (we might even say “causes”) the level of light intensity as detected by the cell. Bogacz uses g(v)=v2 in his example but that choice is not very important. This relationship means that u, in effect, “contains” some information about v. Bayesian inference allows one to leverage this information in u (the thing you can estimate roughly), to predict v (the thing you want to infer).

Bogacz combines the above relationships into Equation 1 of his paper:

Click link to see equation 1

 

Expressed in words this means that the probability of detecting an image intensity u given (in mathematical notation | means "given") a food item of actual size is described by a normal probability density function, or normal pdf, denoted f. We will see below what this pdf looks like. The pdf for has a mean intensity of g(v) and a standard deviation (s.d.) of Σu. We also have some prior information about the likely size of the food object. This will have been learned through previous experience. Again, it is assumed that this can be expressed using another normal pdf, f. Bogacz writes the prior probability for v in Equation 3, thus:-

Click link to see equation 3

 

In words this means that the prior probability that the food is of size v is described by a normal pdf with mean of vp and an s.d. of Σp. You will probably be familiar with the bell-shaped normal pdf. Below is an example where vp=3 and Σp=1 (these are values which Bogacz used). This pdf implies that roughly 95% of the values of v will lie between 1 and 5. The scale for v is arbitrary.

 

At this point we have the prior probability of v, p(v), and the likelihood of u given v, p(u|v). Bayes’ theorem tells us that the posterior probability of v given u, p(v|u) is equal to the (prior*likelihood)/evidence. In this context the evidence is the probability of u, p(u). The evidence is sometimes also known as the marginal probability or the marginal likelihood.

Bogacz expressed this in Equation 4 of his paper:

Click link to see Equation 4

The evidence is just the product of the prior*likelihood summed up across all values of v. In mathematical terms this means integrating the prior*likelihood over all values of v. Bogacz gives this integral explicitly in Equation 5 of his paper.

Click link to see Equation 5

When we run Bogacz’ model using my slightly enhanced version of his code you get the figure below. In my code, and Bogacz’ original code snippet, the integration is done numerically. This involves dividing the range of possible values of into very small intervals, computing the prior*likelihood for each interval, and adding them up. In his paper Bogacz just gives the posterior probability distribution, in red below.


Plotting the posterior and prior in one figure allows you to see easily two ubiquitous features of Bayesian inference: shrinkage and regularisation. A really good account of regularisation is provided in this blog.

Shrinkage results in the posterior distribution having less error (and thus a smaller standard deviation) than the prior distribution. This is clear in the above figure. We are more confident in our estimate of v (the size of the food pellet) after using the information from the light intensity estimate, u, than we were before (without the light intensity information). A sensible form of inference/prediction must lead to this effect whenever the information being employed in our inference is relevant. In this case the degree of shrinkage will be affected by the accuracy of our estimate of the light intensity. One can alter the s.d. of the light intensity estimates in my code to demonstrate this.

Using a value of u to estimate v will typically mean that the posterior distribution will be shifted relative to the prior distribution in the direction of the value of implied by our estimate of u. Regularisation, in the context of Bayesian inference, means that the extent of this shift will be reduced by the information contained in the prior distribution. Recall that Bogacz used the function u=v*v in his model. He also used the value u=2 to produce the figure above. This would imply that the best guess as to the value of vpredicted from u and ignoring any prior information, would be the square root of u; the square root of 2 is roughly 1.41.

However, Bayesian inference takes account of the information in the prior distribution for v. Bogacz used a mean prior estimate of v=3. The mode (most common value, or peak) of the posterior distribution for in the above figure is just below 1.6. The posterior mode is often considered our best guess for v after carrying out the Bayesian inference. The value of 1.6 clearly lies somewhere between the prior mean of 3 and the point estimate of v (ignoring the prior distribution) of 1.41. The extent of the regularisation produced will depend on the precision of the the prior distribution and the value and precision of the estimate of u.

We can now move on to consider a fishing inference/prediction problem by adapting Bogacz's code for the above example. My experience is that, whenever someone provides a model in a scientific paper, I never really understand it fully until I have adapted it and changed it to capture an analogous but different problem. This is exactly what I am going to illustrate below. In future blogs we will build up other predictive coding accounts using more brain-like models including ones that directly compute prediction errors.

I will set up the simple fishing inference problem and make it directly analogous to his example about the size of food items. Before doing that, however, I need to briefly describe what happened on the Jan 25th fishing trip.
 
How did the fishing go?
In fact, the fishing trip on Jan 25th was full of surprises. First, when I arrived at Clover Lake, its surface was frozen solid. I haven't often fished the lake in winter before, but this was the first time I had ever seen it with ice on it, let alone completely frozen over. It's quite a big lake and the temperatures overnight hadn't dipped that low (they hovered just below freezing) so it was even more surprising that the ice was still there in the late afternoon. 

In normal conditions, the easiest fish to catch in Clover Lake are gudgeon (Gobio gobio). Normally a river species, these amazing, super-aggressive small fish are everywhere in this lake for some unknown reason. They can often be a pest, as they snatch baits meant for larger species. When fishing Clover Lake in summer I would expect to have a gudgeon eating my bait within 5 minutes or so. When I saw the ice covering the lake, I had no idea how active they might be in midwinter. Nevertheless, I thought that I would gently break the surface of the lake, and put some handfuls of ground bait in the resulting hole. Maybe this would bring the gudgeon into my fishing area (this is known as a "swim" and it’s the area in front of where you are sitting on the bank). I used the butt end of my landing net handle (which is about 2m long) to break the ice very gently. The hole I created was perhaps 1-1.5 metres from the bank and about 1 metre across. At least it was close enough to throw the ground bait accurately into the hole. A few minutes later I lowered my maggot-baited hook through the ice hole to see what would happen. I did once do some proper ice fishing on a huge frozen lake in Finland; that was completely different though and much, much colder.

After about 20-30 minutes I had the first clear signs of interest from a fish. Small "knocks" (i.e., small sudden twitching movements) occurred on my float as the fish on the lake bottom began nudging and tugging at the maggots on my hook. After missing a few bites I eventually caught a gudgeon, about an hour after I arrived at the lake. These tiny fish usually weight less than 1 ounce (28g). They have a pretty iridescent flank and blackish spots on their tails. See the one I caught in the picture below.


I could have stopped there and declared the gudgeon to be species #2 for 2023. However, I had planned a 2-hour trip and I had just under an hour of light left. So, I thought I would see what the lower lake (called Jenny's Lake) looked like. Maybe I could catch something different from there and give myself a choice of what to declare as species #2. Jenny's Lake is usually full of roach (
Rutilus rutilus), although again I had rarely fished it in winter.

I packed up my gear, returned to the car, and reparked by Jenny's Lake. My favourite roach swim was just about the only area of the lake that wasn't completely frozen. At least I didn't have to break any ice this time. I had roughly a 3m x 3m area of unfrozen water in front of me. Were there any active roach (i.e., hungry and feeding) in this swim? How many (if any) might I catch? This scenario and set of questions gave me the idea for a prediction problem to which I could apply the Bayesian inference code from the Bogacz (2017) paper.


Using Bayesian inference to make fishing predictions

Whenever I begin fishing, I am usually childishly excited (I lead a simple life). It stems from a mild thrill of the unknown: am I going to catch anything? Will I even get a bite? What will I catch? The time from when the first cast hits the water until the first bite is, for me, a key indicator. I have always used it as a rough way to predict how many fish are “available” in the swim in front of me. Intuitively, if the first bite comes quickly then my expectation is that there are lots of fish available, if it comes more slowly then there are fewer. The more fish that are available the more I am likely to catch. This act of prediction while I am fishing can be mapped directly onto Bogacz’s first model, where he tries to use the estimated intensity of a visual image (u) of a piece of food to estimate its size (v). Remember that this is also based on the intuitively reasonable belief that the image intensity is caused by the size of the food with a function, g, relating u to v; namely, u=g(v).

In the analogous fishing inference problem, I am trying to predict the number of fish available (v) to be caught in my swim from my mental estimate of the time between the first cast and the first bite (u). The number of fish available directly affect the time to the first bite. The mental estimate of the time elapsed is error-prone, as I don’t time it with a watch. This is just like the Bogacz example where the brain cell’s estimate of the food’s image intensity has error associated with it (it is a “noisy” estimate). If u reflects v, then u can be used to predict v, even if the value of u is known only approximately.

Next, I need to consider the form of the functional relationship, u=g(v), between the noisy time estimation value (u) and the thing I ultimately want to infer (the number of available fish in my swim, v).

There are lots of possibilities so I tried to come up with a plausible and reasonable functional relationship. Later we will vary the function and see if it affects the general pattern of our predictions. An “available” fish in my swim means one that is present in front of me, is hungry, and one I am able to catch. The total number available is therefore limited by the time limits implicit in each cast of my fishing line (a single cast is an attempt to catch a fish).

Each cast involves baiting the hook, throwing out the line, waiting to get a bite, reeling in and, if successfully landed, detaching the fish safely before returning it to the water, and then re-baiting the hook. The minimum period to do all of this is about 1 minute. I had about 45 minutes of roach fishing in this swim before dark, so I could have a maximum of 45 bites in the session and, at best, catch a maximum of 45 fish. The time to the first bite (which will be denoted by u) could come after 1 minute or after any other period up to 45 minutes, or not at all. So, if the first bite occurred after 1 minute there could be a maximum of 45 fish available to be caught in my swim. Obviously there may be more than 45 in the swim in total but in effect there are a maximum of 45 available that I could catch, because of the physical limits in the fishing process just discussed. If no bites came in the 45 minutes this suggests there may be no available fish in my swim.

A simple non-linearly decreasing function between these two limits, for both u and v, can be written as

u = Max(v)/(v+1)

where v ranges from 0 to 45 fish available in the swim and Max(v) =45. In the left-hand panel of the figure below you can see a plot of this non-linear relationship. The non-linear relationship might be a reasonable function because of a phenomenon called competitive feeding. Anglers love to suggest they catch more fish when competitive feeding is occurring. This effect means that the more fish there are in the swim the more quickly they might grab the bait in order to beat other fish to it (at least this should occur whenever there is limited food available). Thus, if there were 10 available fish in my swim then the time to the first bite would faster than a tenth of the time to the first bite if only a solitary fish was available in my swim. Hence, the function is non-linear.

An even simpler alternative function would be such that the estimate of the time to the first bite is a linearly decreasing function of the number of fish available to catch from your swim. Using this relationship implies that there is no competitive feeding. The linear relationship is shown in the right-hand panel of the figure below. The full Matlab code I used for the simulations in this blog, based on Bogacz’s code snippet, can be found here.

To run the model I also need to estimate the properties of the prior distribution of v, the number of fish available. Over the past occasions when I had fished this spot, I reckon that I caught an average of around 7 roach every 45 minutes. The range between 3 and 11 fish caught. So, if I assume that this quantity is normally distributed then the standard deviation (s.d.) would be around 2. But, I usually would catch only 70-75% of the bites, so that would mean there were, on average, roughly 10 fish available to be caught every 45 minutes, with an s.d. of roughly 3. Thus, I set the prior normal pdf parameters to be vp=10 and Σp=3 for the fishing prediction model.

Finally, I need to make an estimate for the error in the estimates of the time to the first bite, u. Recall that I never time these periods accurately and they might range from 1 minute to 45 minutes in the scenario we are trying to model. Once again we will assume that the time estimates are normally distributed. It is probably unreasonable to assume that the error is the same for time estimates of 1 and 45 minutes but we are going to ignore this issue and use a fixed s.d. for the whole range of intervals. I set the standard deviation of u to be 1 minute (Σu=1). Using what we know about normal distributions, this choice of s.d. means that if I estimated that the first bite came after 10 minutes it actually occurred somewhere between 8 and 12 minutes (95% of the time). That seems fairly reasonable to me. One can vary this parameter value to see what effect it has on the predictions. All that remains is to actually see how long it took to get the first bite.

How long did I have to wait for the first bite?

Very soon I had my actual estimate for u. In fact, it was only about 90 seconds before I had my first bite. The roach were clearly in my swim and were feeding and thus available to be caught. We will use the value of u=1.5 minutes to run our Bayesian inference below.

Before running the model, we can consider how surprising this value was in relation to the prior distribution? Our estimated prior mean of 10 fish available converts to waiting for a bite for 4.1 minutes using the non-linear function, and to 35 minutes using the linear function. Clearly, the rapid first bite on this freezing day came surprisingly quickly.

I landed the roach shown in the picture below on the first or second bite. Roach can occasionally grow to be 3 or 4 lbs in the UK but often they are little ones like this. This first roach weighed maybe 1-1.5 ounces.




What did our Bayesian inference model predict?

I applied the model with the non-linearly decreasing function linking u and v, illustrated above, coupled with an estimate for u=1.5 mins, with the s.d. for u and the prior pdf for v set as described above.

The posterior probability distribution for the number of fish available had a mode (most common value) of 13.3. This is our inference about the likely number of fish available to be caught in 45 minutes, and we can think of it as the most likely number of bites I will have in the session. With my assumption that I catch a fish from three-quarters of the bites in a roach-fishing session, then this would lead to a prediction of about 10 fish caught. Given my past experience in this swim then it is very likely that the fish caught would almost all be roach.

I said above that we would look at whether the function linking u and v changes the predictions. So I reran the model with exactly the same settings apart from changing to the linearly decreasing function illustrated above. Now, the posterior mode was 40.9 fish available to catch. With the assumed catch:bite success ratio of 0.75 this would lead me to expect to catch about 30 fish (roach) in the session. We can visually compare these two very different predictions in more detail below, with the non-linear function producing the results shown in the left-hand panel and the linear function the results in the right-hand panel. As well as the marked difference in the location of the peak of the posterior pdf, one can see also that the linear function gives a more precise prediction (the s.d. of the posterior pdf is 0.97) compared with the more smeared out posterior pdf generated by the non-linear function (s.d.=2.15). There is therefore more shrinkage with the linear function than with the non-linear function.

We can see differing amounts of regularisation in the above graphs too. Ignoring the prior information completely, we can use the functions linking u to v to see what the estimated value of u=1.5 minutes to the first bite would predict for v, the number of fish available to be caught. With the non-linear function this gives v=29 fish available. The regularising influence of the prior distribution was clearly quite strong when using this function in the full model; recall that the non-linear version of the model gave 13.3 as the mode of the posterior distribution for v. With the linear function the regularising effects of the prior distribution were much weaker and the likelihood function dominated the prediction of the full model. Using the estimate of u=1.5 and ignoring the prior, the linear function converts this to a value of v=44.5 fish available. As already noted the full Bayesian inference model with the linear function gave a posterior mode for v=40.9. As already noted the blog here gives a really nice (if slightly more formal) illustration of regularisation.

How did these predictions match up to what happened? The next period was crazy (in fishing terms), with a bite almost every cast (about every 60-90 seconds), and I caught the majority of them, in line with the 75% success rate assumed above. As it had started to get pretty dark, I had to stop fishing for roach after 45 minutes, by which time I had caught a total of 30. Clearly the number of fish I caught was a very good fit to the prediction of the linear function model and far more than the 10 fish that was predicted by the non-linear function model. Of course, this model is simply a toy to illustrate how one can carry out Bayesian inference; one shouldn’t take it too seriously.


One last cast
If nothing else, this large number of roach I caught were good for the "total number of fish in 2023" part of the challenge (the target is 365 fish).  However, I decided to have one last cast for something a bit bigger using a worm rather than maggots, and ensuring the bait was resting right on the bottom of the lake. Within 60 seconds a larger fish was landed. It was a common carp (Cyprinus carpio) weighing about 2lbs (1kg). By now, dusk had arrived, and the light was fading quickly. So, the picture of that carp below didn’t come out too clearly. As I can catch a roach more or less anywhere, and a gudgeon almost anytime I come to Stammerham Farm lakes, I decided to nominate the common carp as species #2 for the year.



 
I thought about the best strategy for choosing a species if, as on this trip, I were to catch several species in a particular session. Intuitively, it seems that one should pick the species that one catches least frequently. My choice of carp above fits that strategy: although I catch common carp quite often each year I catch a lot more roach and gudgeon. However, in the last 7 years since I resumed fishing, I have caught a gudgeon only at Stammerham Farm. So, there is also a logic to picking the gudgeon. However, as long as I am able to make at least one more trip to Stammerham farm in 2023 then I should always be able to catch a gudgeon and nominate that species later in the year. These musings about the “species nomination problem” gave me an idea for a future blog on data simulation. I’m sure you can’t wait!

 

Overall Fishing stats?

After this second trip my fishing stats looked like this: total fish for the year=33; total number of species caught=4; number of species satisfying the challenge rules=2. Here is a link to all the fish photos I have taken during the 2023 challenge sessions (including this one), and another link to a spreadsheet with all the gory details. The blogs are lagging well behind, and so the spreadsheet shows that I’ve caught more fish and additional species since this second trip.

 

References

Bogacz, R. (2017). A tutorial on the free-energy framework for modelling perception and learning. Journal of Mathematical Psychology76(Pt B), 198–211.

 

 

 

 

Comments

  1. Although the models presented in this blog are for illustration purposes, there is something paradoxical about them that is bugging me. As it is untidy/slightly annoying, I'll need to create a post-script shortly to sort this out. As usual, I still need to check the compatibility of the code with Octave, too.

    ReplyDelete
  2. An Octave compatible of the code for the model in this blog can be found at this repository: https://github.com/Alan-Pickering/Bayesian_inference

    ReplyDelete

Post a Comment

Popular posts from this blog

Galton's most important statistical insights

Fixing a really ugly model: A post-script to our demo of simple Bayesian inference