Preprint Lunacy - Post 4

29/10/2023 - Jim Barrett

This post relates to my preprint lunacy project. You can find the introduction post here. This post covers my progress up to commit #534f2a0.

For the next step in building this thing, I want to sure up the repo a bit with some better practises, since I've been playing pretty fast and loose with coding style so far. I'm starting to get pretty excited about the prospect of getting recommended papers. There are a few steps between now and then, the rough plan is;

  • Add functionality for favouriting papers
  • Build a simple model or heuristic to find papers close to a set of papers
  • Decide on some heuristic for papers being "close enough" to warrant a recommendation
  • A page to present the recommendations

But, before all that fun stuff, time to eat my vegetables. My usual setup for quality control on a repo is formatting, linting and tests. I set up a CI/CD pipeline to run these for every pull request, and not allow merging without them passing. I also forbid pushing directly to the main branch.

Backend stylechecks

In Python, I have typically used isort, black and pylint. Since I'm using this project to learn some new stuff, I also want to try another code quality tool, namely mypy.

Fortunately, I already have quite a bit of code and config I can reuse for this from other repos, so I can directly steal the .isort.cfg and .pylintrc files from other projects. Running isort and black just rearranges what's already there, so they were easy. Pylint throws up a bunch of complaints though, so I went through and fixed those one by one. Mypy was also reasonably straightforward to use, so I configured that, and added the various stub files it was asking for. I then set up a google coud build trigger to run all of these tests on PRs, and merged in my lovely clean codebase to check everything was working.

Frontend stylechecks

Stylechecking the frontend ended up being a bit more of a rabbit-hole. I started by trying to configure eslint for my existing code. I got a bunch of suggestions from the linter, but whern I googled to figure out exactly what they were talking about, it appeared that many of them are deprecated in favour of using typescript + react, rather than straight javascript. This is the push I needed to finally start learning and using typescript, since most of the professional JS practicioners that I know use typescript. I thus spent a couple of hours figuring out how best to translate my frontend to typescript.

Since I don't have too much code, the translation wasn't too painful. The main issue I ran into was figuring out the correct way of typing props as arguments to React components. For example, starting from my original function signature for the SearchField component;

function SearchField({setPapersInView}) {
    ...
}

I ended up having to write something like;

type Paper = {
  title: string,
  publicationDate: string,
  authorList: string,
  abstract: string

}

function SearchField({setPapersInView}: {setPapersInView : (papers: Paper[]) => void}) {
    ...
}

whilst also changing my usage of the useState declaration in my entrypoint component;

// before
const [papersInView, setPapersInView] = useState([]);

// after
const [papersInView, setPapersInView] = useState<Paper[]>([]);

finding the write combination of changes, and figuring out what types of brackets to use and where was the biggest challenge, but once it was all sorted, I'm prettyb pleased with how the code looks. Of course, the code still isn't really stylechecked. To do that, I used ESlint. It was a massive pain to install, but once I managed it, it gave me a bunch suggestions to improve things. I fixed those, and added the linting steps to my build pipeline for the repo.

Conclusion

I always get a good feeling doing work like this, kind of like how I imagine craftsmen get from cleaning and organising all of their tools in their workshop. But, now that that's done, on to the really fun stuff.



Preprint Lunacy - Post 3

19/07/2023 - Jim Barrett

This post relates to my preprint lunacy project. You can find the introduction post here. This post covers my progress up to commit #d3917b6.

The next item on the agenda is a frontend. I might be being a bit too ambitious, but I've been intending to learn ReactJS for a long time, and now feels like the perfect opportunity to do so. In this post I'm going to try and put together a simple web interface making use of the work I've done so far. I'll make the backend with Flask, since that's something I'm already very familiar with, and I'll try and make the frontend with React. I'm reserving the right to bail out and revert to Jinja2 at some point during this post 😇.

So, my wishlist for this post will be the following;

  • A search bar where I can input search terms
  • A backend which embeds the search term and returns the best matching papers
  • A display which shows the title, authors and abstract of the best matching papers

Hopefully this will then act as a base for future features I want to add, whilst being sufficiently simple to facilitate my learning of React.

I started by reading this tutorial for setting up a Flask + React project. It looks as though there is no problem with developing the frontend first with some dummy data, and then worry about the backend and calling into it later. I therefore hopped over to reading the tutorials on the react homepage.

The recommendation in the react tutorials was to build a static version of your page before trying to add in the dynamic content. I thought about how to naturally break down the MVP into components, and came up with the following;

const PAPERS = [{
  "title": "A title",
  "authorList": "A et al",
  "publicationDate": "17th July 1991",
  "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
},
{
  "title": "Another title",
  "authorList": "B et al",
  "publicationDate": "8th December 2020",
  "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
}
]

function SearchField() {

  return (
    <>
    <input id="searchBox" type="text" />
    <button>Search</button>
    </>
  )
}

function PaperSummaries( { papers } ) {

  if (papers.length === 0) return <></>

  return papers.map(paper => {
    return (<>
    <PaperSummary paper={paper} />
    <br></br>
    </>)
  })
}

function PaperSummary({ paper }) {
  return (
    <>
    <h2>{paper.title}</h2>
    <h3><i>{paper.publicationDate}</i> - {paper.authorList}</h3>
    <p>{paper.abstract}</p>
    </>
  )
}

export default function MyApp() {
  return (
    <div>
      <h1>Preprint Sanity</h1>
      <SearchField />
      <PaperSummaries papers={PAPERS} />
    </div>
  );
}

I have no particular talent in terms of design, and I'm not too worried about the looks of the site for now, so the simple, circa Web 1.0 look will do fine for now. At some point I will probably find a good template to work with so that the app becomes responsive etc. But for now, it's time to start putting the dynamics into the site.

Reading further into the React tutorials, I need to think about the statefulness of the page. As I see it, the two stateful parts are what is typed into the search box, and the list of papers displayed on the page. The logic will then go something along the lines of;

  1. The user enters a search term into the search box
  2. The frontend sends this search term to the backend
  3. The backend embeds the search term, finds the most similar papers, and sends these back to the frontend
  4. The frontend displays the matched papers

Before building the backend properly, I decided to make a dummy route to continue building and testing the frontend. The route simply returns a couple of random papers from a list of 4 dummy papers;

@app.route('/random_dummy_paper')
def random_dummy_paper():

    papers = [
        {
            "title": "A title",
            "authorList": "A et al",
            "publicationDate": "17th July 1991",
            "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
        },
        {
            "title": "Another title",
            "authorList": "B et al",
            "publicationDate": "8th December 2020",
            "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
        },
        {
            "title": "Yet another title",
            "authorList": "c et al",
            "publicationDate": "3rd February 2016",
            "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
        },
        {
            "title": "Such title, much paper",
            "authorList": "D et al",
            "publicationDate": "15th October 1955",
            "abstract": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.  "
        }
    ]

    return list(np.random.choice(papers, size=2))

It took me a little while to get my head around handling state properly, and figuring out which component should be responsible for what. I managed to get it working by adding a top level state variable to the app, representing the papers currently in the PaperSummaries component. I then pass the setter for this component to the search bar. The search bar then sends the request to the backend and uses the setter to update the visible papers. The changed components look like so;


function SearchField({ setPapersInView }) {

  const fetchPapersForSearchTerm = async () => {
    let searchBox = document.getElementById("searchBox");
    let searchTerm = searchBox.value;

    fetch("/random_dummy_paper")
    .then( (resp) => {
      console.log(resp)
      return resp.json()
    })
    .then((papers) => {
      setPapersInView(papers)
    })

  }

  return (
    <>
    <input id="searchBox" type="text" />
    <button onClick={fetchPapersForSearchTerm} >Search</button>
    </>
  )
}

export default function MyApp() {

  const [papersInView, setPapersInView] = useState([]);

  return (
    <div>
      <h1>Preprint Sanity</h1>
      <SearchField setPapersInView={setPapersInView} />
      <PaperSummaries papers={papersInView} />
    </div>
  );
}

It feels a little bit weird to me how the responsibility for updating the state of the papers component is so far away from that component, but perhaps this is how it's supposed to be with React. I'll see how it feels as I get more used to the framework.

The next step is to actually add the functionality to the backend, i.e., embedding the search term, finding the most similar papers and then returning their details. I realised pretty quickly that I currently don't store a bunch of the information that I want to display on the page (such as the actual text of the abstract, the paper title, publication date, authors etc). I have two options; download, store and then serve this information myself, or hook into the arxiv API to fetch it. I'm thinking that at least in the short term, hooking into the arxiv API is going to be the simplest solution (and might even be the best choice in the long term).

Querying the arxiv api turned out to be very straightforward (I don't know why I struggled so much with it earlier). I wrote a few simple functions to facilitate what I need to do for now. I expect I might need to add a bit more sophistication in the future, so I left some space for that.

from dataclasses import dataclass
from typing import List, Dict, Any
from urllib.parse import urlencode
import feedparser
from html2text import html2text

@dataclass
class ArxivPaper:

    title: str
    authors: List[str]
    publish_date: str
    abstract: str

    def to_dict(self) -> Dict[str, Any]:

        return {
            "title": self.title,
            "authorList": ", ".join(self.authors),
            "publicationDate": self.publish_date,
            "abstract": self.abstract
        }


ARXIV_API_URL = "http://export.arxiv.org/api/query"

def get_formatted_arxiv_api_url(
        search_query: str = None,
        id_list: List[str] = None,
        start: int = None,
        max_results: int = None
):
    query_params = {}
    if search_query is not None:
        query_params['search_query'] = search_query
    if id_list is not None:
        query_params['id_list'] = ','.join(map(str,id_list))
    if start is not None:
        query_params['start'] = start
    if max_results is not None:
        query_params['max_results'] = max_results

    return f"{ARXIV_API_URL}?{urlencode(query_params)}"

def fetch_arxiv_papers(id_list: List[str]) -> List[ArxivPaper]:

    url = get_formatted_arxiv_api_url(id_list=id_list)

    paper_details = feedparser.parse(url)['entries']

    arxiv_papers = [
        ArxivPaper(
        title=paper['title'],
        abstract=html2text(paper['summary']),
        publish_date=paper['published'],
        authors=[author['name'] for author in paper['authors']]
        )
        for paper in paper_details
    ]

    return arxiv_papers

Then I could replace my dummy papers route to actually have the functionality I need from my earlier work embedding papers;

@app.route('/get_closest_papers', methods=['POST'])
def get_closest_papers():

    if not request.method == "POST":
        return ""

    request_data = request.get_json()

    search_term = request_data['search_term']

    embedding = embed_abstract(search_term).squeeze()
    paper_ids = get_closest_papers_to_embedding(embedding)

    arxiv_papers = fetch_arxiv_papers(paper_ids)

    return [paper.to_dict() for paper in arxiv_papers]

and update the frontend component to make the proper POST request

function SearchField({ setPapersInView }) {

  const fetchPapersForSearchTerm = async () => {
    let searchBox = document.getElementById("searchBox");
    let searchTerm = searchBox.value;

    let postBody = {
      "search_term": searchTerm
    }

    fetch(
      "/get_closest_papers",
      {
        method: "POST",
        headers: {'Content-Type': 'application/json'}, 
        body: JSON.stringify(postBody)
      }
    )
    .then( (resp) => {
      return resp.json()
    })
    .then((papers) => {
      setPapersInView(papers)
    })

  }

  return (
    <>
    <input id="searchBox" type="text" />
    <button onClick={fetchPapersForSearchTerm} >Search</button>
    </>
  )
}

and everything works, or at least functionally. It's not the prettiest, and isn't reactive at all, but those are problems for the future.

In the next post, I want to add some more features. Top of my list is a "show me similar papers" feature, but I may also start working on the paper favouriting feature. I also want to do a bit of repo maintenance, setting up stylecheckers and linters, and writing docstrings, before that job becomes unmanageable.



arXiv Lunacy - Post 2

13/07/2023 - Jim Barrett

This post relates to my arXiv lunacy project. You can find the introduction post here. This post covers my progress up to commit #ec11477.

I feel like the next step in development of the project should be to implement functionality to be able to add new embeddings to my dataset. This will be necessary when fetching the latest preprints, screening them for papers which might be interesting to me and then saving them for future reference.

It feels like I should probably be using a relational database or something similar in order to store the embeddings and update them. Honestly though, since this app is in all likelihood only going to be used by me, I don't really care about a few seconds of latency. Instead, I will opt for simply storing the tables I produce as blobs in container storage, then download them and load them into pandas when I need them. I will try and design the schema of the tables such that I could theoretically transform things over to a relational database if I ever need to.

I have mostly developed my personal projects on Google Cloud, and I will do the same here. The plan is that I will put the feather encoded table of embeddings into blob storage into a Google Cloud Storage bucket. I happen to already have some code in the backend of this website (where you're reading this blog) which solves some of these problems. One day I might look into a good way of separating out some of these general purpose utilities I write between projects and making them available across all my projects (by having my own package index or something), but for now I will shamelessly copy-paste and modify the code for the new use case.

After wrestling a little bit with getting Google authentication working reliably locally, and then spending even longer overcomplicating my solution using various io tricks with GCP's own Python tools, I discovered the wonderful, if slightly opaquely named, library gcsfs. With this, I was able to write the code to both load and dump dataframes to GCP storage buckets;


import pandas as pd

from util.constants import GCP_BUCKET_NAME

def get_blob_stored_dataframe(blob_name: str):
    """
    Retrieves and deserialises a blob stored at blob_name
    """
    df = pd.read_feather(f'gs://{GCP_BUCKET_NAME}/{blob_name}')

    return df


def save_dataframe_to_blob(
    df: pd.DataFrame, blob_name: str
) -> bool:
    """
    Appends the "update dict" to the json list stored at blob_name
    """

    df.to_feather(f'gs://{GCP_BUCKET_NAME}/{blob_name}')

    return False

We'll wait and see if this method is fast enough to work for when I start writing functionality using the dataframe, but for now I feel I can work with it.

The next step is to start regularly embedding any new papers and adding them to the dataframe. I fiddled around for a while trying to figure out the arxiv API to search for papers and abstracts, but then figured that a better approach would be to instead listen to RSS feed, which posts the metadata about all the papers released each day. One problem I foresaw with this approach is that there will inevitably be a few days gap between the embeddings I produced initially from the monthly data dump to Kaggle, and the first time I start adding data from the RSS feed. However, I figured this is a problem easily fixed in the future by waiting a month or two, and then merging in whatever I'm missing from a future monthly Kaggle dump.

I started by writing a function which reads the IDs and abstracts from the RSS feed into a format I can work with. I tried parsing the feed by hand for a while using the Python standard XML parsing library, but it quickly became evident that it would be much easier to have the feedreader package handle everything for me. I then noticed that the abstracts were HTML formatted, and so I found the simple html2text tool to parse things like <p> tags into whitespace. The function ultimately looked like this;


from util.constants import INTERESTING_ARXIV_CATEGORIES
import feedparser
from html2text import html2text

def get_arxiv_rss_url(arxiv_category: str):

    return f"http://export.arxiv.org/rss/{arxiv_category}"

def get_latest_ids_and_abstracts():

    paper_id_to_abstract = {}
    for category in INTERESTING_ARXIV_CATEGORIES:

        url = get_arxiv_rss_url(category)
        rss_content = feedparser.parse(url)
        for entry in rss_content['entries']:
            paper_id = entry['id'].split('/')[-1]
            paper_id_to_abstract[paper_id] = html2text(entry['summary'])

    return paper_id_to_abstract

Now, to embed them, we simply use the same libraries and code as we did for the inital embedding. I did a bit of refactoring to the original script to pull this functionality into a common place. At this point, I'm not especially happy with the structure of the repo. I'm hoping that when it matures a bit a more logical arrangement of the code will become obvious. In any case, the embedding code looks like so;


def get_latest_embedding_df():

    paper_id_to_abstract = get_latest_ids_and_abstracts()

    paper_ids = []
    embeddings = []
    for paper_id, abstract in paper_id_to_abstract.items():

        paper_ids.append(str(paper_id))
        embeddings.append(embed_abstract(abstract))

    embeddings = np.array(embeddings).squeeze()
    embeddings_df = pd.DataFrame(embeddings)
    # feather doesn't like numerical column names
    embeddings_df = embeddings_df.rename(columns={i:f"dim{i}" for i in embeddings_df.columns})
    embeddings_df["id"] = paper_ids

    return embeddings_df

The final thing I want to set up in this post is the running of the code at regular intervals. I considered for a while putting an endpoint on this website, which tends to operate as my general purpose "always-on" machine. But I know that's not the best practice, and I want to start getting a bit better at the cloud infrastructure.

Ultimately, what I want is an ocassional, short running, scheduled job. It seems to me that the best solution will be a GCP "cloud function", in combination with the cloud scheduler to trigger it daily. Walking through the wizard for creating a cloud function, it looks as though for a non trivial function (beyond a few lines of code), the easiest thing to do is the package up a zip file with the relevant code and requirements and then put it somewhere on cloud storage. The entrypoint function has to be in a main.py file, it has to include a requirements.txt and then it needs to include whatever other code we need.

I don't really like the idea of cobbling together the relevant code every time I want to update the function, so I figured I'd try and write a script to put together the relevant files and zip them up. This was the script I came up with;


import shutil
import tempfile as tmp
import zipfile
from pathlib import Path

import functions_framework
import pandas as pd

from arxiv_lunacy.embeddings import get_embeddings_df
from arxiv_lunacy.latest_papers import get_latest_embedding_df
from util.constants import (EMBEDDINGS_DF_FILENAME, GCP_FUNCTION_ZIPFILE_NAME,
                            REPO_ROOT)
from util.storage import save_dataframe_to_blob, save_file_to_blob


@functions_framework.http
def update_embeddings_df(_):

    current_embeddings_df = get_embeddings_df()
    embedding_update_df = get_latest_embedding_df()

    all_embeddings_df = pd.concat([current_embeddings_df, embedding_update_df])
    all_embeddings_df = all_embeddings_df.drop_duplicates(subset='id').reset_index().drop(columns='index')
    save_dataframe_to_blob(all_embeddings_df, EMBEDDINGS_DF_FILENAME)

    return '{"status":"200", "data": "OK"}' 


def create_and_upload_gcp_function_zipfile():

    tmpdir = Path(tmp.mkdtemp())

    zipfile_path = tmpdir / GCP_FUNCTION_ZIPFILE_NAME

    zf = zipfile.ZipFile(zipfile_path, 'w', zipfile.ZIP_DEFLATED)

        # upload the relevant code directories
    for dir in ['arxiv_lunacy', 'util']:
        for f in (REPO_ROOT / dir).iterdir():
            if not str(f).endswith('.py'):
                continue
            zf.write(str(f), f'{dir}/{f.name}')

    zf.write(REPO_ROOT / 'requirements.txt', 'requirements.txt')
    zf.write(REPO_ROOT / 'scripts/produce_gcp_functions.py', 'main.py')
    zf.close()

    save_file_to_blob(zipfile_path)
    shutil.rmtree(tmpdir)


if __name__ == '__main__':

    create_and_upload_gcp_function_zipfile()

It took a few attempts to get it right. The main problems I ran into were making sure that the entrypoint function actually returned something (I have no idea if the JSON repsonse is necessary, but if it ain't broke, don't fix it), and making sure that the cloud scheduler has sufficient permissions to make the HTTP request to trigger the function. The function seems to take around a minute to execute (which might reduce if it's clever about caching the SentenceTransformer model). I will monitor it over the next few days, but for now it seems like a success!

Next step: a frontend.



arXiv Lunacy - Post 1

08/07/2023 - Jim Barrett

This post relates to my arXiv lunacy project. You can find the introduction post here. This post covers my progress up to commit #fd091c1.

In this post I cover the simple MVP I have built for finding the N papers most similar to a paper input by the user.

The first thing I needed to do was to get some data. Fortunately, arXiv themselves provide a dataset vis Kaggle of all of the information I expect I will ever need for bootstrapping my model with all the papers up until now. In particular, they provide the title, arxiv-id, categories and abstract of all preprints hosted on the arxiv. I downloaded the data, which is provided in JSONL format, and wrote a small function to serve up the papers with categories I'm personally interested in (here stored in a constant set);

ARXIV_METADATA_SNAPSHOT_FILE =  Path('./arxiv-metadata-oai-snapshot.json')

def records_gen() -> Generator[Dict[str,Any], None, None]:
    """
    Generate the relevant records```
    """
    with Path(ARXIV_METADATA_SNAPSHOT_FILE).open('r') as f:
        for line in f:
            record = json.loads(line)
            cats = set(record['categories'].split())
            if len(cats & INTERESTING_ARXIV_CATEGORIES) == 0:
                continue

            yield record

I decided to use a transformer model to produce embeddings of the abstracts, and chose the most popular sentence embedding model from Huggingface for my MVP. It does say in the model description that the model is only designed for texts up to 256 word pieces, so we can undoubtedly do better at some point, but for now it's good enough. I did look into using one of Google's cloud based models for sentence embedding, but a back of the envelope calculation suggested it would be approximately $30 to produce the initial embeddings. I might do this at some point, but not for now. I train the model thus;


MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
TORCH_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def produce_embeddings_df() -> pd.DataFrame:
    """
    Produce a dataframe of embeddings
    """

    # calculates how many relevant papers there are, simply for the progress bar
    n_records = get_number_of_records()

    ids = []
    embeddings = []

    model = SentenceTransformer(MODEL_NAME)

    for record in tqdm(records_gen(), total=n_records):
        ids.append(record['id'])
        embeddings.append(model.encode(record['abstract'], device=TORCH_DEVICE))

    embeddings_df = pd.DataFrame(embeddings)
    embeddings_df = embeddings_df.rename(columns={i:f"dim{i}" for i in embeddings_df.columns})
    embeddings_df["id"] = ids

    return embeddings_df

This took about an hour running on my modest gaming PC to embed the ~250k papers in the categories relevant to me, producing a file just over 400MB when saved to the feather format. Depending on how I ultimately end up deploying the model, this might be a bit cumbersome, but that's a problem for another day.

Finally, I wrote some functionality for finding the N papers most similar to a given paper. I implemented the cosine similarity as my comparison metric, which I think is a sensible choice for this problem. I might look into other, more "search engine"-y techniques if I feel I need it, but with some numpy magic, it's pretty trivial and fast to compute.


def cosine_similarity(all_embeddings,embeddings_to_compare):

    numerators = np.dot(all_embeddings, embeddings_to_compare)
    denominators = (np.linalg.norm(all_embeddings)*np.linalg.norm(embeddings_to_compare))

    return numerators / denominators

And then finally we can compute the closest papers.


def get_closest_papers(paper_ids, top_n = 10):

    embeddings_df = get_embeddings_df()
    id_to_index = get_paper_id_to_index(embeddings_df)
    paper_inds = [id_to_index[paper_id] for paper_id in paper_ids]

    ids = embeddings_df.id.to_numpy()
    embeddings_df = embeddings_df.drop(columns="id")
    paper_embeddings = embeddings_df.loc[paper_inds].to_numpy()
    other_embeddings = embeddings_df.to_numpy()

    cosine_sims = cosine_similarity(other_embeddings, paper_embeddings.T)

    top_n_inds = np.argpartition(cosine_sims, -top_n, axis=0)[-top_n:]

    return ids[top_n_inds]

An annoying amount of this code is just rearranging the data from how I have it stored, so I might look to improve that at some point, but for now it only takes a second or so to produce all the cosine similarities and provide the top N, which is pretty good!

The final thing to test is whether or not it actually works! I googled to find a paper broadly related to my current field of work, and found the paper Multi-Task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets. I passed this into my similarity function and asked for the top 3 (well, top 4, since the paper always returns itself as a top match, which is good for a sanity check!). These were the papers it gave me back;

Really cool! It seems to be working great! We'll have to see what more we can do with it in the next post!



arXiv Lunacy - Introduction

07/07/2023 - Jim Barrett

I miss the old Arxiv-Sanity. For me, it was one of the most effective ways of keeping on top of the latest developments in my fields of interest. Not that I blame the author for no longer hosting the old version, but I do miss it.

I have therefore decided to look into creating my own version, a journey which I plan to document in a sporadic series of blog posts here. The features I want to eventually include are;

  • Given a paper X, find me similar papers
  • Keep an up to date directory of papers over time
  • Given a series of papers I have favourited, notify me when something I might be interested in appears
  • A lightweight web interface which allows me to quickly digest abstracts of papers recommended to me, and allow me to favourite relevant ones
  • Maybe, in the long run, some option for me to add some notes to the papers I read.

I have no real intention, at least in the short to medium term, of hosting the tool such that it is available to other users, but I will develop all of the code here. I am lovingly calling the project arxiv lunacy, in homage to the tool that inspired it. Perhaps one day I'll branch it out to look at published literature too, and then I'll have to rename it. Watch this space for updates.



Gaussian Process Demo

09/11/2020 - Jim Barrett

I was recently asked to introduce my team to Gaussian processes. In my experience, Gaussian processes are often introduced in a very abstract manner, and it took me some time to understand exactly what was going on behind the scenes.

As such, I prepared the following notebook which walks you through what is actually going on when you train a Gaussian process model. If you want to run or modify the notebook, you can download it here.



Gaussian Process Update Equations

14/07/2017 - Jim Barrett

When you start to learn about Gaussian processes, you come across the update equations fairly early on. The update equations are fairly intimidating to look at, and are typically dismissed as trivial to derive (for example, Rasmussen and Williams simply point you towards a statistics book from the 1930's, which is neither available online nor in our university library...). I had a go at the derivation, and promptly realised it wasn't trivial at all from a cold start.

Fortuitously, at the PEN emulators workshop I recently attended, there was an introductory lecture from Jochen Voss, where he went through a univariate case, and then gave us the structure of the derivation for the full multivariate case. Luckily, this gave me the push I needed to try the derivation again, so I went away and filled in the gaps.

So here it is, in all it's glory; the derivation of the update equations for Gaussian processes.

The overall endgame of Gaussian process regression is to write down a conditional distribution for a set of predicted outputs given a set of training (observed) outputs . By the product rule

Since we have a constant set of data, is just a constant in this expression.

The underlying assumption in Gaussian process regression is that outputs are jointly Gaussian distributed, so that

Where is the joint covariance matrix. Remember that under the Gaussian process model we have trained a function which computes the elements of the Covariance matrix purely as a function of the inputs, it is only a function of the outputs that we're trying to find. We can define the covariance matrix blockwise

Where is the covariance matrix computed using only the training inputs , is the covariance matrix computed using the prediction inputs , and is the cross terms (i.e. the covariance \emph{between} and , computed using and ). It is a well known result (it's in numerical recipes, or on Wikipedia) that you can blockwise invert a matrix;

Where . So, we can directly compute our Gaussian density

However, the only thing that isn't a constant here is , so we can drop a bunch of terms (since we're only interested in the density, not absolute values)

If we take the transpose of the middle term, we can group the terms together a bit more

Now, in general, a multivariate Gaussian has the form;

If we remember that covariance matrices are symmetric, we can expand, drop some constant terms and then rearrange this to

we can therefore see that both and have exactly the same form. We can therefore straightforwardly match expressions for .

The expression for requires a little more work. We start by matching terms

We can rearrange this a little bit

We know that is a symmetric matrix (we just showed that its inverse is the covariance matrix). So, if we right multiply by the covariance matrix and take the transpose, we finally arrive at

And, so, in conclusion we know that

So, not quite as trivial as the textbooks claim!



(un-)Predictability of Sports

31/12/2017 - Jim Barrett

A subject of debate (mostly in the pub) over the last few months has been about the predictability of sports results. My hypothesis was that the total number of points scored in a team-vs-team sport should be roughly Poisson distributed, and as such you would expect greater statistical fluctuations in the scores of sports where the score-lines are typically low (e.g. football), compared to sports with a high scoreline (e.g. Basketball). My thinking was that the "better" team should win more consistently in high scoreline sports.

Over the Christmas break, I decided to put my money where my mouth is, and actually investigate it. You can find the Jupyter notebook here. I downloaded a bunch of historical results from football, ice hockey and basketball, together with bookmakers pre-game odds. The data took a bit of cleaning, but I ultimately looked at how often the bookie's favourite lost for each sport. The tl;dr conclusion is that I was wrong. Football is the most accurately predicted, and has the lowest average scoreline.

rate of upsets

I had intended to clean the notebook up a little bit, and make some more investigations (e.g. was I wrong because the score-lines aren't Poisson distributed? Or because each team's individual scoreline are too highly correlated with each other?) but whilst I was clearing out some of my external hard-drives I accidentally formatted my internal hard-drive too. That was a lot of fun to fix.... Anyway, I lost all my squeaky clean data, so I couldn't make prettier plots. Apologies.



Random Samples from a Sphere

26/07/2017 - Jim Barrett

During my work today, at some point I had to draw samples uniformly from a sphere, and made an observation I hadn't really appreciated before. The name of the game is to make sure that every infinitesimal shell has the same density of samples, on average.

Or in other words, the volume gets more spread out, so we need more samples for higher r. The density is simply constant with respect to direction (angles). So, to get the correct distribution we simply need to sample from

Or in other words, just sample from a power law distribution with index 2. Elementary observation? Maybe, but it makes it easier to code up!

samples from a sphere



New Paper - Accuracy of Inference

04/12/2017 - Jim Barrett

A couple of weeks ago I put my newest paper on to the arXiv, which you can find here; https://arxiv.org/abs/1711.06287

The paper describes how much we can hope to learn from gravitational-wave observations of black holes, focusing in particular on how well we can constrain our models of the evolution of binary stars. In order to calculate this I used COMPAS, the binary evolution simulation software we are developing here in Birmingham, and a technique called Fisher information matrices.

If you have a model, which depends on some parameters , and makes predictions about some observable random variable, then the Fisher information quantifies how sensitive the expected observations are to changes in the underlying parameters. If some parameters have a large effect on our expected observations, then it makes sense that we can learn a lot about them.

I calculated the Fisher matrix by running COMPAS (a lot!) and seeing how the output changed with respect to four different bits of physics; the strength of supernova kicks, the ferocity of stellar winds during two different stellar evolutionary phases, and the frictional energetics during the common envelope phase. Below is the money plot from the paper.

accuracy of inference

The plot shows how accurately we will be able to measure each of these parameters after 1000 observations of binary black holes. The fuzziness comes from a quantification of the various sources of uncertainty in this process. It also shows how the different parameters behave with respect to each other (their correlation). The way I like to think about this plot is in terms of null hypothesis testing. If we find the truth is a massive outlier with respect to these distributions, then we reject the null hypothesis that our model is correct. This plot shows that the model doesn't have to be wrong by more than a few percent before we can rule out our model.



It's Jim's life, but not as we know it!

07/06/2016 - Jim Barrett

This morning I participated in the Astronomy and Astrophysics group Science Jamboree 2016, where I managed to win a prize (chocolates!) for the best talk. I'm not quite sure how I managed it, since in the midst of a series of informative and engaging talks, I performed a silly little poem;

For my science Jamboree
I'll tell you a bit about me
My journey through physics
in a few clumsy limericks
Right through to my PhD

I was born in July '91
In the blistering east Norfolk sun
And... oh wait a minute,
there's a strict time limit
So I had best be getting along

We'll skip ahead 18 odd years
Whilst I, still wet 'round the ears
kicked off my degree,
eager to see,
How I felt after 2 dozen beers

But I did learn some physics as well
Einsten, Newton, Maxwell
But as some of you know,
not a lot of astro
Although I think you can probably all tell

To be honest I found space a bit dreary
I preferred condensed matter theory
my vague proclivity
for superconductivity
"I should do a PhD in this clearly!"

Alas, I left those low temperatures
And became a software developer
but the real world is crap
I resolved to come back
And I somehow became an Astronomer....

My first project involved an obsession
with continuous autoregression
take an impulse response
to stochastic nonsense
et voila! banded matrix compression!

And with that O(n) inference
Time to write a paper of significance!
Except instabilities
destroy all utility
Try selling that at a conference!

So my PhD project, take 2!
But what was I going to do?
I felt it was prudent
to become Ilya's student
since bless him, he's only got a few

Binary population synthesis;
i.e how did it end up like this?
which hyperparameters
make a chirp that glamorous?
The combinations seem virtually limitless

So here's where I come in
When parameter space coverage is thin
we can interpolate
perhaps extrapolate
what happens when things spiral in

Now this work isn't finished yet
Ive tried random forests and neural nets
but despite all my labour
it seems nearest neighbours
may well be our safest bet

To conclude my science Jamboree
I'll make a short summary
my project got switched
because CARMA's a bitch
and I write really shit poetry!