Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

Isolated uv Dependency Management in Monorepos

4 minute read

Published: January 22, 2026

As a machine learning team we have a number of large projects that have their own repositories, but this is not practical for the large number of smaller, more data science focused projects we carry out. These smaller projects are stored within large thematic monorepos, with each project contained within its own subfolder.

Migrating Sub-Directory While Preserving Git History

1 minute read

Published: January 19, 2026

Recently we decided to open source some code we use internally for bibliometrics analysis, and share it in a public repository. However this code existed as a sub-folder in our analytics monorepo, so the question was how could we migrate the contents of that sub-folder to a new repo while preserving the git history for only those files?

Merging Git Repos into an Existing Repository on GitHub

5 minute read

Published: November 28, 2025

Recently we’ve launched a new project seeking to improve data quality of our impact reporting. This project has grown out of a number of smaller projects, some of which have their own git repository on GitHub. For project and code management reasons we have decided to combine these into a single monorepo, bringing all components of this project into a single place.

Is The Average of Averages The Same as The Overall Average?

2 minute read

Published: June 13, 2025

There is a simple question that always seems to crop up when discussing everything from analytics dashboards to parallel computation. That is, when you have multiple groups of data (e.g., from different sources or processes), each with its own average, can you simply average those individual averages to get the true average across all the data?

Postgres Not Equal Only Returns Non-Null Values

1 minute read

Published: June 06, 2025

As part of an analysis I recently ran a PostgreSQL query to return results filtered to remove rows in which the status column contained values not equal to “started”. However after running the below query using the <> operator I noticed it returned far less rows than expected.

Neo4j Super Node Performance Issues

8 minute read

Published: February 26, 2024

We recently developed a multi-billion relationship scale knowledge graph, representing the wider academic landscape. How we developed the data pipeline to build this graph in Neo4j is a story in itself, one I may write up here or on the Wellcome Data blog. This post however focuses on a specific database modelling issue that has caused us severe query performance issues, the issue of low cardinality “super nodes”.

Multimodal Document Classification

39 minute read

Published: October 10, 2020

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

Creating Reproducible Data Science Projects

13 minute read

Published: March 11, 2020

A Nightmare Scenario - Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published: February 13, 2019

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

GraphTranslate: Predicting Clinical Trial Translation using Graph Neural Networks on Biomedical Literature

Published in Proceedings of the Fifth Workshop on Scholarly Document Processing, Association for Computational Linguistics., 2025

A novel graph neural network that leverages both semantic and structural information to predict which research publications will lead to clinical trials. Our model analyses a comprehensive dataset of 19 million publication nodes, using transformer-based title and abstract sentence embeddings within their citation network context. Our graph-based architecture, which employs attention mechanisms over local citation neighbourhoods, outperforms traditional convolutional approaches by effectively capturing knowledge flow patterns. Our metadata is carefully selected to eliminate potential biases from researcher-specific information, while maintaining predictive power through network structural features.

Recommended citation: Muller E, Boylan-Toomey J, Ekinsmyth J, Robben A, Cardona MDLP, Langfelder A. 2025. GraphTranslate: Predicting Clinical Trial Translation using Graph Neural Networks on Biomedical Literature. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 31–41, Vienna, Austria. Association for Computational Linguistics (ACL).
Download Paper

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Justin Boylan-Toomey

Sitemap

Pages

Posts

portfolio

publications

talks

teaching