Posts by Tags

Is The Average of Averages The Same as The Overall Average?

2 minute read

Published: June 13, 2025

There is a simple question that always seems to crop up when discussing everything from analytics dashboards to parallel computation. That is, when you have multiple groups of data (e.g., from different sources or processes), each with its own average, can you simply average those individual averages to get the true average across all the data?

Multimodal Document Classification

39 minute read

Published: October 10, 2020

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

Creating Reproducible Data Science Projects

13 minute read

Published: March 11, 2020

A Nightmare Scenario - Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published: February 13, 2019

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

Postgres Not Equal Only Returns Non-Null Values

1 minute read

Published: June 06, 2025

As part of an analysis I recently ran a PostgreSQL query to return results filtered to remove rows in which the status column contained values not equal to “started”. However after running the below query using the <> operator I noticed it returned far less rows than expected.

Neo4j Super Node Performance Issues

8 minute read

Published: February 26, 2024

We recently developed a multi-billion relationship scale knowledge graph, representing the wider academic landscape. How we developed the data pipeline to build this graph in Neo4j is a story in itself, one I may write up here or on the Wellcome Data blog. This post however focuses on a specific database modelling issue that has caused us severe query performance issues, the issue of low cardinality “super nodes”.

Multimodal Document Classification

39 minute read

Published: October 10, 2020

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published: February 13, 2019

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

Isolated uv Dependency Management in Monorepos

4 minute read

Published: January 22, 2026

As a machine learning team we have a number of large projects that have their own repositories, but this is not practical for the large number of smaller, more data science focused projects we carry out. These smaller projects are stored within large thematic monorepos, with each project contained within its own subfolder.

Migrating Sub-Directory While Preserving Git History

1 minute read

Published: January 19, 2026

Recently we decided to open source some code we use internally for bibliometrics analysis, and share it in a public repository. However this code existed as a sub-folder in our analytics monorepo, so the question was how could we migrate the contents of that sub-folder to a new repo while preserving the git history for only those files?

Merging Git Repos into an Existing Repository on GitHub

5 minute read

Published: November 28, 2025

Recently we’ve launched a new project seeking to improve data quality of our impact reporting. This project has grown out of a number of smaller projects, some of which have their own git repository on GitHub. For project and code management reasons we have decided to combine these into a single monorepo, bringing all components of this project into a single place.

Creating Reproducible Data Science Projects

13 minute read

Published: March 11, 2020

A Nightmare Scenario - Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

Multimodal Document Classification

39 minute read

Published: October 10, 2020

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

Neo4j Super Node Performance Issues

8 minute read

Published: February 26, 2024

We recently developed a multi-billion relationship scale knowledge graph, representing the wider academic landscape. How we developed the data pipeline to build this graph in Neo4j is a story in itself, one I may write up here or on the Wellcome Data blog. This post however focuses on a specific database modelling issue that has caused us severe query performance issues, the issue of low cardinality “super nodes”.

Is The Average of Averages The Same as The Overall Average?

2 minute read

Published: June 13, 2025

There is a simple question that always seems to crop up when discussing everything from analytics dashboards to parallel computation. That is, when you have multiple groups of data (e.g., from different sources or processes), each with its own average, can you simply average those individual averages to get the true average across all the data?

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published: February 13, 2019

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

Isolated uv Dependency Management in Monorepos

4 minute read

Published: January 22, 2026

As a machine learning team we have a number of large projects that have their own repositories, but this is not practical for the large number of smaller, more data science focused projects we carry out. These smaller projects are stored within large thematic monorepos, with each project contained within its own subfolder.

Migrating Sub-Directory While Preserving Git History

1 minute read

Published: January 19, 2026

Recently we decided to open source some code we use internally for bibliometrics analysis, and share it in a public repository. However this code existed as a sub-folder in our analytics monorepo, so the question was how could we migrate the contents of that sub-folder to a new repo while preserving the git history for only those files?

Merging Git Repos into an Existing Repository on GitHub

5 minute read

Published: November 28, 2025

Recently we’ve launched a new project seeking to improve data quality of our impact reporting. This project has grown out of a number of smaller projects, some of which have their own git repository on GitHub. For project and code management reasons we have decided to combine these into a single monorepo, bringing all components of this project into a single place.

Justin Boylan-Toomey

Posts by Tags

data science

Is The Average of Averages The Same as The Overall Average?

Multimodal Document Classification

Creating Reproducible Data Science Projects

Write up of the UK’s first Subsurface Data Science Hackathon

databases

Postgres Not Equal Only Returns Non-Null Values

Neo4j Super Node Performance Issues

image processing

Multimodal Document Classification

machine learning

Write up of the UK’s first Subsurface Data Science Hackathon

mlops

Isolated uv Dependency Management in Monorepos

Migrating Sub-Directory While Preserving Git History

Merging Git Repos into an Existing Repository on GitHub

Creating Reproducible Data Science Projects

natural language processing

Multimodal Document Classification

network analysis

Neo4j Super Node Performance Issues

statistics

Is The Average of Averages The Same as The Overall Average?

timeseries

Write up of the UK’s first Subsurface Data Science Hackathon

what I learned today

Isolated uv Dependency Management in Monorepos

Migrating Sub-Directory While Preserving Git History

Merging Git Repos into an Existing Repository on GitHub