Friday, December 8, 2017

tools for writing code, life, the universe and everything - can anything beat emacs?

I have recently finished a 6 month placement with NAG (the Numerical Algorithms Group) based in Oxford.  One of the things I picked up there was how to use emacs for writing code and editing other text.

Previously I have always written code in programs that are designed for specific languages, such as RStudio or Matlab.

Emacs is designed to be a more generic tool that, in principle, can be tailored to any kind of text editing, including coding.  As a popular open source project emacs has many contributed packages.  I used it mainly for writing code in Fortran, but it has modes for pretty much every widely used programming language.  I also used it for writing LaTeX and for writing / editing To Do lists using Org mode.

Beyond it's usefulness as a text editor emacs has many other functions.  For example it has a shell, which behaves similarly to a command-line terminal but with the useful property that you can treat printed output as you would any other text.  I find myself quite frequently wanting to copy and paste from terminal output, or to search for things, such as error messages.  This is quick and easy in emacs.

So will I ever use anything other than emacs again,... for anything?  I think truly hardcore emacs fans do use it for literally everything - email, web browsing, even games emacs -amusements.  But I am not part of that (increasingly exclusive) club.  I find emacs a pain for things that you do infrequently - a shortcut isn't really a shortcut if you have to use google to remind you what it is!

I think the two main selling points of emacs are (i) anything that you do repeatedly using a mouse, you will be able to do at least as quickly in emacs, (ii) it does great syntax highlighting of pretty much any kind of text.

Thursday, March 30, 2017

Statistics in medicine

Last week I went to the AZ MRC Science Symposium organised jointly by Astra Zeneca and the MRC Biostatistics Unit.  Among a line-up of great speakers was Stephen Senn, who has an impressively encyclopaedic knowledge of statistics and its history, particularly relating to statistics in medicine.  Unfortunately his talk was only half an hour and in the middle of the afternoon when I was flagging a bit, so I came away thinking 'that would have been really interesting if I had understood it.'  In terms of what I remember, he made some very forceful remarks directed against personalised medicine, i.e., giving different treatments to different people based on their demography or genetics.  This was particularly memorable because several other speakers seemed to have great hopes for the potential of personalised medicine to transform healthcare.

His opposition to personalised medicine was based on the following obstacles, which I presume he thinks are insurmountable.

  1. Large sample sizes are needed to test for effects by sub-population.  This makes it much more expensive to run a clinical trial than the more traditional case where you only test for effects at the population level.
  2. The analysis becomes more complicated when you include variables that cannot be randomized.  Most demographic or genetic variables fall into this category.  He talked about Nelder's theory of general balance which can apparently account for this in a principled way.  Despite being developed in the 1970's it has been ignored by a lot of people due to its complexity.
  3. Personalised treatment is difficult to market.  I guess this point is about making things as simple as possible for clinicians.  It is easier to say use treatment X for disease Y, instead of use treatment X_i for disease variant Y_j in sub-population Z_k.  
Proponents of personalised medicine would argue that all these problems can be solved through the effective use of computers.  For example,
  1. Collecting data from GPs and hospitals may make it possible to analyse large samples of patients without needing to recruit any additional subjects for clinical trials.
  2. There is already a lot of software that automates part or all of complicated statistical analysis.  There is scope for further automation, enabling the more widespread use of complex statistical methodology.
  3. It should be possible for clinicians to have information on personalised effects at their fingertips.  It may even be possible to automate medical prescriptions.
It's difficult to know how big these challenges are.  Some of the speakers at the AZ MRC symposium said things along the lines of 'ask me again in 2030 whether what I'm doing now is a good idea.'   This doesn't exactly inspire confidence, but at least is an open and honest assessment.

As well as commenting on the future, Stephen Senn has also written a lot about the past.  I particularly like his description of the origins of Statistics in chapter 2 of his book 'Statistical Issues in Drug Development',

Statistics is the science of collecting, analysing and interpreting data.  Statistical theory has its origin in three branches of human activity: first the study of mathematics as applied to games of chance; second, the collection of data as part of the art of governing a country, managing a business or, indeed, carrying out any other human enterprise; and third, the study of errors in measurement, particularly in astronomy.  At first, the connection between these very different fields was not evident but gradually it came to be appreciated that data, like dice, are also governed to a certain extent by chance (consider, for example, mortality statistics), that decisions have to be made in the face of uncertainty in the realms of politics and business no less than at the gaming tables, and that errors in measurement have a random component.  The infant statistics learned to speak from its three parents (no wonder it is such an interesting child) so that, for example, the word statistics itself is connected to the word state (as in country) whereas the words trial and odds come from gambling and error (I mean the word!), has been adopted from astronomy.

Monday, March 6, 2017

Pushing the boundaries of what we know

I have recently been dipping into a book called 'What we cannot know' by Marcus du Sautoy.  Each chapter looks at a different area of physics.  The fall of a dice is used as a running example to explain things like probability, Newton's Laws, and chaos theory.  There are also chapters on quantum theory and cosmology.  It's quite a wide-ranging book, and I found myself wondering how the author had found time to research all these complex topics, which are quite different from each other.  That is related to one the messages of the book - that one person cannot know everything that humans have discovered.  It seems like Marcus du Sautoy has had a go at learning everything, and found that even he has limits!

I think the main message of the book is that many (possibly all) scientific fields have some kind of knowledge barrier beyond which it is impossible to pass.  There are fundamental assumptions which, when you assume they are true, explain empirical phenomena.  The ideal in science (at least for physicists) is to be able to explain a wide range of (or perhaps even all) empirical phenomena from a small set of underlying assumptions.  But science cannot explain why its most fundamental assumptions are true.  They just are.

This raises an obvious question: where is the knowledge barrier?  And how close are we to reaching it?  Unfortunately this is another example of something we probably cannot know.

In my own field of Bayesian computation, I think there are limits to knowledge of a different kind.  In Bayesian computation it is very easy to write down what we want to compute - the posterior distribution.  It is not even that difficult to suggest ways of computing the posterior with arbitrary accuracy.  The problem is that, for a wide range of interesting statistical models, all the methods that have so far been proposed for accurately computing the posterior are computationally intractable.

Here are some questions that could (at least in principle) be answered using Bayesian analysis.   What will earth's climate be like in 100 years time?  Or, given someone's current pattern of brain activity (e.g. EEG or fMRI signal) how likely are they to develop dementia in 10-20 years time?

These are both questions for which it is unreasonable to expect a precise answer.  There is considerable uncertainty.  I would go further and argue that we do not even know how uncertain we are.  In the case of climate we have a fairly good idea of what the underlying physics is.  The problem is in numerically solving physical models at a resolution that is high enough to be a good approximation to the continuous field equations.  In the case of neuroscience, I am not sure we even know enough about the physics.  For example, what is the wiring diagram (or connectome) for the human brain?  We know the wiring diagram for the nematode worm brain - a relatively recent discovery that required a lot of work.  The human brain is a lot harder!  And even if we do get to the point of understanding the physics well enough, we will come up against the same problem with numerical computation that we have for the climate models.

There is a different route that can be followed to answering these questions, which is to simplify the model so that computation is tractable.  Some people think that global temperature trends are fitted quite well by a straight line (see Nate Silver's book 'The signal and the Noise'.)  When it comes to brain disease, if you record brain activity in a large sample of people and then wait 10-20 years to see whether they get the disease, it may be possible to construct a simple statistical model that predicts people's likelihood of getting the disease given their pattern of brain activity.  I went to a talk by Will Penny last week, and he has made some progress in this area using an approach called Dynamic Causal Modelling.

I see this as a valuable approach, but somewhat limited.  For its success it relies on ignoring things that we know.  Surely by including more of what we know it should be possible to make better predictions?  I am sometimes surprised by how often the answer to this question is 'not really' or 'not by much'.

The question of what is computable with Bayesian analysis is still an open question.  This is both frustrating and motivating.  Frustrating because a lot of things that people try don't work, and we have no guarantee that there are solutions to the problems we are working on.  Motivating because science as a whole has a good track record of making the seemingly unknowable known.

Tuesday, December 6, 2016

Writing tools, silence & parenthood

Collaborative writing tools

I have been working on a paper recently with two co-authors.  It has been a bit of a challenge finding the right pieces of software that will allow us to track edits while remaining in LaTeX.  When I worked in the civil service, Word was the de facto software for producing written documents.  It was a lot better than I thought it would be, and I still think the Track Changes functionality beats everything else I have tried hands down when it comes to collaborative editing.  I also learnt that, using Word, you can produce documents with typesetting that looks professional, if you know what you are doing, and if someone has invested the time in creating a good template for your needs.  However in the last couple of years I have returned to LaTeX, because it is what mathematicians use, and because I find it better for equations, and for references.

In the last few weeks I have been trying out Overleaf.  This is one of a handful of platforms for LaTeX documents with collaboration tools.  As with a lot of good user-friendly pieces of software you have to pay to get the most useful features.  With Overleaf, the free service provides a workable solution.  Overleaf allows you to access your LaTeX documents through a web browser, and multiple people can edit the same online version.  In the free version there are some basic bells and whistles, like being able to archive your work.  I found this a bit confusing at first because I thought it was setting up multiple active versions with some kind of forking process.  However this is not the case.

By combining Overleaf with git I have been able to fork the development process: I can edit one branch on my local computer (using my preferred LaTeX editor and compiler), while another person edits a different branch in the online version, or potentially on another computer.  Using git also makes it easy to create a change log, and visualise differences between different versions, although this doesn't work quite as well for paragraphs of text as it does for code.   Unless you put lots of line breaks into your paragraphs, you can only see which paragraphs have changed, and not which individual sentences have changed.

In the news...

2016 is drawing to a close and it has been a pretty shocking year for a lot of people in terms of national and global news.  In the last few weeks, I have found an increasing tendency for people to be silent - to not want to talk about certain issues any more (you know what I mean - the T word and the B word).  I guess this is partly because some topics have been talked to death, and nothing new is emerging, while a lot of uncertainty remains.  However I also find it a bit worrying, that people may no longer be capable of meaningful engagement with people of different opinions and backgrounds.  One thing I have become more convinced of over the last year is that blogs and tweets etc. are not a particularly helpful way of sharing political views (a form of silent outrage!?)  So maybe the less I say here the better, even though I do remain passionately interested in current affairs and am fairly opinionated.

And in other news...

I have a baby boy!  Born 4 weeks ago - both him and my wife are doing well.  In the first 2 weeks I took a break from my PhD, and it was a bit like being on holiday, in that we had a lot of time, and a lot of meals cooked for us (by my wonderful mum).  It hasn't all been plain sailing, but I am now under oath not to share the dark side on parenthood - especially not with non-parents, in case it puts them off!  The last 2 weeks I have been getting back into my PhD.  It is quite hard finding a schedule that works.  We have a routine where he is supposed to be more active and awake between 5pm and 7pm, so that he sleeps well between 7pm and 7am.  I have been trying to do a bit of work after he is settled in the evening and found it fairly challenging to be motivated and focused at that time.  I have been wondering whether it would work better to try and get up before him in the mornings.  I guess it will probably be challenging either way.

Tuesday, September 6, 2016

Learning about learning

I recently attended the INCF (International Neuroinformatics Coordinating Facility) short courses and congress in Reading.  It was quite wide-ranging with some people working primarily on MRI imaging, others on modelling of synaptic plasticity and learning algorithms, and quite a few other topics.

One area I was not really aware of before the conference was neuromorphic computing, which is about designing and building computing hardware based on principles of how the brain does computation.  At the INCF short courses, this was presented by Giacomo Indiveri, and I subsequently looked at an introductory article by Steve Furber, who has lead the SpiNNaker project,

I am quite impressed by the dedication of people working in this field.  Steve Furber says in his article that SpiNNaker has been 15 years in conception and 10 years in construction.  This is enabling fast simulation of large-scale neural models, such as Spaun.  On a standard computer, Spaun requires 2.5 hours of computation per second of real time.  The system can perform simple cognitive tasks such as reinforcement learning and arithmetic.  SpiNNaker aims to run Spaun in real time.

In the next few years, as part of the Human Brain Project, SpiNNaker will be used for larger models, and presumably be tested on progressively more demanding cognitive tasks.  From my perspective, I am interested to see how large-scale neural models of biological intelligence will compare to engineered intelligence systems such as deep neural networks.

Engineered intelligence is free from the constraint of having to be faithful to biology.  This gives it a massive advantage over simulated neural models when it comes to performing tasks.  Ideas from biology have been influential in machine learning and artificial intelligence, but they have been heavily supplemented by numerical analysis and statistical computing.

At the moment many machine learning algorithms require huge amounts of computing power.  So it will be interesting to see whether any new hardware emerges that can bring this down.  It would be cool if state-of-the-art machine learning algorithms that today require the use of a supercomputer, could be run on an affordable battery operated device.  And it will be interesting to see if the new neuromorphic machines that are emerging will drive engineers and scientists to further develop learning algorithms.

Monday, August 1, 2016

Summer reading

I have recently been reading 'grit: the power of passion and perseverance' by Angela Duckworth, which I have found both fascinating and persuasive.  Duckworth is a psychologist, interested in the differences between people who are talented but relatively low achievers compared with people who are high achievers.  One of the main messages of the book is that talent counts but effort counts twice.

Determination, persistence, constancy, tenacity, and focus, especially in the face of setbacks and challenges appear to have a much larger effect on what people achieve than natural talent or innate giftedness.

I wish I could say these were all things I possessed in abundance, but I do not think that is the case.  Nevertheless there is cause for hope as grit appears to increase with age.  And perhaps being more aware of the importance of these qualities helps to cultivate them more.

In parallel I have been reading the Pickwick Papers by Charles Dickens, which tells the story of a group of kind hearted friends who travel around rural 19th century England, making new friends and getting into various kinds of trouble.  It is quite good fun, but, in my opinion, not as well written as some of his later work such as Great Expectations.  Perhaps a case in point where passion and perseverance on a single goal over a long period of time can lead to great things.

Monday, June 20, 2016

ISBA 2016

I got back from ISBA 2016 at the weekend, having spent a week at the picturesque Forte Village resort in Sardinia.  Last weekend also happened when UK astronaut Tim Peake returned from having spent 6 months on the International Space Station.  Although I am sure returning from space requires more adjustment than returning from an international conference, I do feel like a bit like I have returned from another planet!

I cannot pretend to give an expert's view of the conference since there were many people there with decades more experience than me.  The age distribution of the conference was heavily weighted towards young researchers (perhaps partly as a result of the generous travel support targeted towards this group).  Nevertheless the age distribution was very wide with a fair number of people there in their seventies.  One of these was Adrian Smith, who came to the first of the Valencia meetings, and gave an interesting perspective on how Bayesians have gone from being outsiders and regarded with a high degree of scepticism to being a dominant force in the world of statistical analysis.  A simple illustration of this is the numbers at the conference which have grown from around 70 to around 700 over the course of around 40 years.

One feature of the conference that has remained the same (and perhaps a key ingredient to its continuing success!?) is the cabaret, which features Bayes inspired entertainment.  The proceedings of the first Valencia meeting (which can be found here - printed the song "There's no Theorem like Bayes Theorem" to the tune of "There's no Business like Show Business" by the distinguished statistician G.E.P. Box.

I would strongly advise against searching for a YouTube rendition of Box's song.  I do not know whether Box was as good a musician as he was a lyricist (and statistician), but his followers certainly seem to have a rather deficient sense of pitch and harmony.

Here are a few reflections on the academic program of ISBA 2016.

A lot of the talks fell into one of two broad categories.  On the one hand, some talks focused on general inference problems, and the development of methodology that should be applicable to a wide range of problems in various application areas.  On the other hand, some talks focused more on a specific application area, and looked at the challenge of adapting quite general statistical ideas to specific research questions.

The presenters who I found most stimulating were Adrian Raftery on demography and Sylvia Richardson on single-cell gene expression.  These were both from the second category of talks (i.e. more oriented to a specific application), but the researchers have both also done important work on the first category (i.e. development of generally applicable statistical methodology).  For me, their work demonstrates the value of working in both areas.  They both have an impressive ability to identify problems that benefit from bayesian analysis.  In Adrian Raftery's demography work, the novel application was the quantification of uncertainty in country-specific fertility rates by pooling information across countries through a hierarchical model.  Sylvia Richardson's work on gene expression also used a hierarchical model, but in this case to quantify uncertainty in cell-specific gene expression levels, again by pooling information across cells.  The main reason the bayesian approach is so effective in these problems is the small amount of data that is available per country (in demography) or per cell (in gene expression).

Although I found some of the presentations on general methodology quite stimulating (such as Peter Green's keynote lecture on structural learning), there were quite a few presentations which I felt were not well motivated, at least not in a way that I could understand.  One area where there were quite a few presentations this year was Bayesian variable selection for linear regression models.  In that setting you assume that the data is i.i.d. and the variable selection can be that of as a kind of model uncertainty, often encoded through the choice of prior.  The reason I am somewhat sceptical about this kind of research is (i) the linear regression model may not be sufficiently complex to model the true relationships that exist between the variables in the dataset, (ii) if the linear regression model is appropriate, then the most important predictors for a given response can usually be picked up through a non-bayesian analysis such as a forward selection or a backwards elimination algorithm.  This is based on my experience of fitting and using regression models as a practitioner in operational research.

To wrap up, I am deeply grateful to all the people who make the bayesian analysis community what it is, both through their research findings, through the hard administrative labour that must go into organising large scientific meetings, and through personal warmth and encouragement.  I hope that it continues to be a vibrant community with vigorous exchanges between scientific applications and mathematical theory.