Thursday, March 29, 2018

an ancient annal of computer science

Over the last year I have been interested in developing my programming / coding, to get to the point where I can be more confident of sharing my code with other people.  And also to be able to contribute to general purpose numerical / statistical software.

As part of this effort I have dipped in to The Art of Computer Programming (TAOCP) by Donald Knuth.  The cover says "this multivolume work is widely recognized as the definitive description of classical computer science."  American Scientist listed it as one of the 12 top physical-science monographs of the 20th century alongside monographs by the likes of Albert Einstein, Bertrand Russell, von Neumann and Wiener - http://web.mnstate.edu/schwartz/centurylist2.html.

I am sure there are many other books that cover similar material at a more introductory level, but I find something exciting about going back the source and reading an author who was personally involved in fundamental discoveries and developments.

There are also probably more modern accounts of computer programming that better reflect more recent innovations.  Knuth himself encourages readers of TAOCP to look at his more recent work on Literate Programming.  But I also think it is worth dwelling on things that have proven to be useful to a wide range of people over an extended period of time.

I have Volume 1 in the Third Edition of TAOCP, published in 1997, which is already prehistoric in some senses - it is before Google was founded (1998) and way before Facebook was launched (2004).  However parts of the book date a lot further back than that - Knuth's advice on how to write complex and lengthy programs was mostly written in 1964!

Here is a summary of that advice (p191-193 of TAOCP Volume 1),

Step 1 : develop a rough sketch of the main top-level program.  Make a list of subroutines / functions that you will need to write.  "It usually pays to extend the generality of each subroutine a little."
Step 2 : create a first working program starting from the lowest-level subroutines and working up to the main program.
Step 3 : Re-examine your code starting from the main program and working down studying for each subroutine all the calls made on it.  Refactor your program and subroutines.

Knuth suggests that at the end of Step 3 "it is often a good idea to scrap everything and start again".  He goes on to say "some of the best computer programs ever written owe much of the success to the fact that all the work was unintentionally lost, at about this stage, and the authors had to begin again." - quite a thought-provoking statement!

Step 4 : check that when you execute your program, everything is taking place as expected, i.e., debugging.  "Many of today's best programmers will devote nearly half their programs to facilitating the debugging process in the other half; the first half, which usually consists of fairly straightforward routines that display relevant information in a readable format, will eventually be thrown away, but the net result is a surprising gain in productivity."

I don't know whether today's best programmers still do this.  I know some pretty good programmers and have been surprised how much effort they devoted to the kind of activity that Knuth is describing.  Personally I now rely quite a lot on the debugger in Visual Studio, and (indirectly) on compilers to give me most of the debugging information I need for not much effort.





Friday, March 2, 2018

Pensions for professors

It is not often that universities make front page news but the recent strike by university lecturers seems to have got quite a lot of media coverage.

On the surface it looks like quite a straight-forward dispute about money.  University vice-chancellors (represented by a body called Universities UK) are proposing to reduce the pensions that university staff will receive in the future.  The reason they are doing this is that existing contributions to the pension fund for universities (the USS) are not expected to cover the cost of future pensions.

One political commentator, who I have a lot of respect for, Daniel Finkelstein, has said that lecturers are striking against themselves.  He argues that increased contributions from universities to the USS would have a damaging effect on university lecturers.  As a result of increasing contributions, universities would have to either pay lecturers a lower salary and/or employ fewer of them.

He also argues that it would be unfair for the government to increase funding to universities in order to pay generous pensions at a time when the NHS is strapped for cash, prisons seem to be nearing a state of anarchy and universities are already generously funded by students through expensive tuition fees.  A large chunk of these tuition fees may end up being paid by the government if students are unable to pay back their loans.

While I find this line of reasoning quite persuasive, it seems to be predicated on the assumption that there will be an indefinite squeeze on the nation's finances.  As country we have had around 7 years of government austerity.  Recent news suggests that this austerity has been successful in eliminating the government deficit from around £100bn a year down to zero - https://www.ft.com/content/3f7db634-1cac-11e8-aaca-4574d7dabfb6.

So will the squeeze be indefinite or are we approaching the end of it?  Nobody really knows.  As of 12 months ago, the OBR, which produces official forecasts of the government deficit, was still forecasting a large deficit for 2018-19.  But tax receipts have been a lot stronger than expected.  Speaking from personal experience, these things are difficult to forecast!

My view is that economic growth and tax receipts will be stronger than they have been for much of the last 10 years.  As a result, the USS will probably not run out of money and if it does, the government should inject some extra cash to keep it afloat.  There are many competing spending priorities for the government, but I think that attracting and retaining bright people across the public sector is essential.  While there are many who are drawn to the public sector purely with a desire to contribute to society, generous public sector pensions do play a big role in encouraging people to stay.  I think these pensions should continue so that public services can flourish as they ought to.

Friday, December 8, 2017

tools for writing code, life, the universe and everything - can anything beat emacs?

I have recently finished a 6 month placement with NAG (the Numerical Algorithms Group) based in Oxford.  One of the things I picked up there was how to use emacs for writing code and editing other text.

Previously I have always written code in programs that are designed for specific languages, such as RStudio or Matlab.

Emacs is designed to be a more generic tool that, in principle, can be tailored to any kind of text editing, including coding.  As a popular open source project emacs has many contributed packages.  I used it mainly for writing code in Fortran, but it has modes for pretty much every widely used programming language.  I also used it for writing LaTeX and for writing / editing To Do lists using Org mode.

Beyond it's usefulness as a text editor emacs has many other functions.  For example it has a shell, which behaves similarly to a command-line terminal but with the useful property that you can treat printed output as you would any other text.  I find myself quite frequently wanting to copy and paste from terminal output, or to search for things, such as error messages.  This is quick and easy in emacs.

So will I ever use anything other than emacs again,... for anything?  I think truly hardcore emacs fans do use it for literally everything - email, web browsing, even games emacs -amusements.  But I am not part of that (increasingly exclusive) club.  I find emacs a pain for things that you do infrequently - a shortcut isn't really a shortcut if you have to use google to remind you what it is!

I think the two main selling points of emacs are (i) anything that you do repeatedly using a mouse, you will be able to do at least as quickly in emacs, (ii) it does great syntax highlighting of pretty much any kind of text.

Thursday, March 30, 2017

Statistics in medicine

Last week I went to the AZ MRC Science Symposium organised jointly by Astra Zeneca and the MRC Biostatistics Unit.  Among a line-up of great speakers was Stephen Senn, who has an impressively encyclopaedic knowledge of statistics and its history, particularly relating to statistics in medicine.  Unfortunately his talk was only half an hour and in the middle of the afternoon when I was flagging a bit, so I came away thinking 'that would have been really interesting if I had understood it.'  In terms of what I remember, he made some very forceful remarks directed against personalised medicine, i.e., giving different treatments to different people based on their demography or genetics.  This was particularly memorable because several other speakers seemed to have great hopes for the potential of personalised medicine to transform healthcare.

His opposition to personalised medicine was based on the following obstacles, which I presume he thinks are insurmountable.

  1. Large sample sizes are needed to test for effects by sub-population.  This makes it much more expensive to run a clinical trial than the more traditional case where you only test for effects at the population level.
  2. The analysis becomes more complicated when you include variables that cannot be randomized.  Most demographic or genetic variables fall into this category.  He talked about Nelder's theory of general balance which can apparently account for this in a principled way.  Despite being developed in the 1970's it has been ignored by a lot of people due to its complexity.
  3. Personalised treatment is difficult to market.  I guess this point is about making things as simple as possible for clinicians.  It is easier to say use treatment X for disease Y, instead of use treatment X_i for disease variant Y_j in sub-population Z_k.  
Proponents of personalised medicine would argue that all these problems can be solved through the effective use of computers.  For example,
  1. Collecting data from GPs and hospitals may make it possible to analyse large samples of patients without needing to recruit any additional subjects for clinical trials.
  2. There is already a lot of software that automates part or all of complicated statistical analysis.  There is scope for further automation, enabling the more widespread use of complex statistical methodology.
  3. It should be possible for clinicians to have information on personalised effects at their fingertips.  It may even be possible to automate medical prescriptions.
It's difficult to know how big these challenges are.  Some of the speakers at the AZ MRC symposium said things along the lines of 'ask me again in 2030 whether what I'm doing now is a good idea.'   This doesn't exactly inspire confidence, but at least is an open and honest assessment.

As well as commenting on the future, Stephen Senn has also written a lot about the past.  I particularly like his description of the origins of Statistics in chapter 2 of his book 'Statistical Issues in Drug Development',

Statistics is the science of collecting, analysing and interpreting data.  Statistical theory has its origin in three branches of human activity: first the study of mathematics as applied to games of chance; second, the collection of data as part of the art of governing a country, managing a business or, indeed, carrying out any other human enterprise; and third, the study of errors in measurement, particularly in astronomy.  At first, the connection between these very different fields was not evident but gradually it came to be appreciated that data, like dice, are also governed to a certain extent by chance (consider, for example, mortality statistics), that decisions have to be made in the face of uncertainty in the realms of politics and business no less than at the gaming tables, and that errors in measurement have a random component.  The infant statistics learned to speak from its three parents (no wonder it is such an interesting child) so that, for example, the word statistics itself is connected to the word state (as in country) whereas the words trial and odds come from gambling and error (I mean the word!), has been adopted from astronomy.

Monday, March 6, 2017

Pushing the boundaries of what we know

I have recently been dipping into a book called 'What we cannot know' by Marcus du Sautoy.  Each chapter looks at a different area of physics.  The fall of a dice is used as a running example to explain things like probability, Newton's Laws, and chaos theory.  There are also chapters on quantum theory and cosmology.  It's quite a wide-ranging book, and I found myself wondering how the author had found time to research all these complex topics, which are quite different from each other.  That is related to one the messages of the book - that one person cannot know everything that humans have discovered.  It seems like Marcus du Sautoy has had a go at learning everything, and found that even he has limits!

I think the main message of the book is that many (possibly all) scientific fields have some kind of knowledge barrier beyond which it is impossible to pass.  There are fundamental assumptions which, when you assume they are true, explain empirical phenomena.  The ideal in science (at least for physicists) is to be able to explain a wide range of (or perhaps even all) empirical phenomena from a small set of underlying assumptions.  But science cannot explain why its most fundamental assumptions are true.  They just are.

This raises an obvious question: where is the knowledge barrier?  And how close are we to reaching it?  Unfortunately this is another example of something we probably cannot know.

In my own field of Bayesian computation, I think there are limits to knowledge of a different kind.  In Bayesian computation it is very easy to write down what we want to compute - the posterior distribution.  It is not even that difficult to suggest ways of computing the posterior with arbitrary accuracy.  The problem is that, for a wide range of interesting statistical models, all the methods that have so far been proposed for accurately computing the posterior are computationally intractable.

Here are some questions that could (at least in principle) be answered using Bayesian analysis.   What will earth's climate be like in 100 years time?  Or, given someone's current pattern of brain activity (e.g. EEG or fMRI signal) how likely are they to develop dementia in 10-20 years time?

These are both questions for which it is unreasonable to expect a precise answer.  There is considerable uncertainty.  I would go further and argue that we do not even know how uncertain we are.  In the case of climate we have a fairly good idea of what the underlying physics is.  The problem is in numerically solving physical models at a resolution that is high enough to be a good approximation to the continuous field equations.  In the case of neuroscience, I am not sure we even know enough about the physics.  For example, what is the wiring diagram (or connectome) for the human brain?  We know the wiring diagram for the nematode worm brain - a relatively recent discovery that required a lot of work.  The human brain is a lot harder!  And even if we do get to the point of understanding the physics well enough, we will come up against the same problem with numerical computation that we have for the climate models.

There is a different route that can be followed to answering these questions, which is to simplify the model so that computation is tractable.  Some people think that global temperature trends are fitted quite well by a straight line (see Nate Silver's book 'The signal and the Noise'.)  When it comes to brain disease, if you record brain activity in a large sample of people and then wait 10-20 years to see whether they get the disease, it may be possible to construct a simple statistical model that predicts people's likelihood of getting the disease given their pattern of brain activity.  I went to a talk by Will Penny last week, and he has made some progress in this area using an approach called Dynamic Causal Modelling.

I see this as a valuable approach, but somewhat limited.  For its success it relies on ignoring things that we know.  Surely by including more of what we know it should be possible to make better predictions?  I am sometimes surprised by how often the answer to this question is 'not really' or 'not by much'.

The question of what is computable with Bayesian analysis is still an open question.  This is both frustrating and motivating.  Frustrating because a lot of things that people try don't work, and we have no guarantee that there are solutions to the problems we are working on.  Motivating because science as a whole has a good track record of making the seemingly unknowable known.

Tuesday, December 6, 2016

Writing tools, silence & parenthood

Collaborative writing tools

I have been working on a paper recently with two co-authors.  It has been a bit of a challenge finding the right pieces of software that will allow us to track edits while remaining in LaTeX.  When I worked in the civil service, Word was the de facto software for producing written documents.  It was a lot better than I thought it would be, and I still think the Track Changes functionality beats everything else I have tried hands down when it comes to collaborative editing.  I also learnt that, using Word, you can produce documents with typesetting that looks professional, if you know what you are doing, and if someone has invested the time in creating a good template for your needs.  However in the last couple of years I have returned to LaTeX, because it is what mathematicians use, and because I find it better for equations, and for references.

In the last few weeks I have been trying out Overleaf.  This is one of a handful of platforms for LaTeX documents with collaboration tools.  As with a lot of good user-friendly pieces of software you have to pay to get the most useful features.  With Overleaf, the free service provides a workable solution.  Overleaf allows you to access your LaTeX documents through a web browser, and multiple people can edit the same online version.  In the free version there are some basic bells and whistles, like being able to archive your work.  I found this a bit confusing at first because I thought it was setting up multiple active versions with some kind of forking process.  However this is not the case.

By combining Overleaf with git I have been able to fork the development process: I can edit one branch on my local computer (using my preferred LaTeX editor and compiler), while another person edits a different branch in the online version, or potentially on another computer.  Using git also makes it easy to create a change log, and visualise differences between different versions, although this doesn't work quite as well for paragraphs of text as it does for code.   Unless you put lots of line breaks into your paragraphs, you can only see which paragraphs have changed, and not which individual sentences have changed.

In the news...

2016 is drawing to a close and it has been a pretty shocking year for a lot of people in terms of national and global news.  In the last few weeks, I have found an increasing tendency for people to be silent - to not want to talk about certain issues any more (you know what I mean - the T word and the B word).  I guess this is partly because some topics have been talked to death, and nothing new is emerging, while a lot of uncertainty remains.  However I also find it a bit worrying, that people may no longer be capable of meaningful engagement with people of different opinions and backgrounds.  One thing I have become more convinced of over the last year is that blogs and tweets etc. are not a particularly helpful way of sharing political views (a form of silent outrage!?)  So maybe the less I say here the better, even though I do remain passionately interested in current affairs and am fairly opinionated.

And in other news...

I have a baby boy!  Born 4 weeks ago - both him and my wife are doing well.  In the first 2 weeks I took a break from my PhD, and it was a bit like being on holiday, in that we had a lot of time, and a lot of meals cooked for us (by my wonderful mum).  It hasn't all been plain sailing, but I am now under oath not to share the dark side on parenthood - especially not with non-parents, in case it puts them off!  The last 2 weeks I have been getting back into my PhD.  It is quite hard finding a schedule that works.  We have a routine where he is supposed to be more active and awake between 5pm and 7pm, so that he sleeps well between 7pm and 7am.  I have been trying to do a bit of work after he is settled in the evening and found it fairly challenging to be motivated and focused at that time.  I have been wondering whether it would work better to try and get up before him in the mornings.  I guess it will probably be challenging either way.

Tuesday, September 6, 2016

Learning about learning

I recently attended the INCF (International Neuroinformatics Coordinating Facility) short courses and congress in Reading.  It was quite wide-ranging with some people working primarily on MRI imaging, others on modelling of synaptic plasticity and learning algorithms, and quite a few other topics.

One area I was not really aware of before the conference was neuromorphic computing, which is about designing and building computing hardware based on principles of how the brain does computation.  At the INCF short courses, this was presented by Giacomo Indiveri, and I subsequently looked at an introductory article by Steve Furber, who has lead the SpiNNaker project,

http://digital-library.theiet.org/content/journals/10.1049/iet-cdt.2015.0171

I am quite impressed by the dedication of people working in this field.  Steve Furber says in his article that SpiNNaker has been 15 years in conception and 10 years in construction.  This is enabling fast simulation of large-scale neural models, such as Spaun.  On a standard computer, Spaun requires 2.5 hours of computation per second of real time.  The system can perform simple cognitive tasks such as reinforcement learning and arithmetic.  SpiNNaker aims to run Spaun in real time.

In the next few years, as part of the Human Brain Project, SpiNNaker will be used for larger models, and presumably be tested on progressively more demanding cognitive tasks.  From my perspective, I am interested to see how large-scale neural models of biological intelligence will compare to engineered intelligence systems such as deep neural networks.

Engineered intelligence is free from the constraint of having to be faithful to biology.  This gives it a massive advantage over simulated neural models when it comes to performing tasks.  Ideas from biology have been influential in machine learning and artificial intelligence, but they have been heavily supplemented by numerical analysis and statistical computing.

At the moment many machine learning algorithms require huge amounts of computing power.  So it will be interesting to see whether any new hardware emerges that can bring this down.  It would be cool if state-of-the-art machine learning algorithms that today require the use of a supercomputer, could be run on an affordable battery operated device.  And it will be interesting to see if the new neuromorphic machines that are emerging will drive engineers and scientists to further develop learning algorithms.