Posts by Tags

data science

Is The Average of Averages The Same as The Overall Average?

2 minute read

Published:

There is a simple question that always seems to crop up when discussing everything from analytics dashboards to parallel computation. That is, when you have multiple groups of data (e.g., from different sources or processes), each with its own average, can you simply average those individual averages to get the true average across all the data?

Multimodal Document Classification

39 minute read

Published:

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

Creating Reproducible Data Science Projects

13 minute read

Published:

A Nightmare Scenario - Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published:

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

databases

Postgres Not Equal Only Returns Non-Null Values

1 minute read

Published:

As part of an analysis I recently ran a PostgreSQL query to return results filtered to remove rows in which the status column contained values not equal to “started”. However after running the below query using the <> operator I noticed it returned far less rows than expected.

Neo4j Super Node Performance Issues

8 minute read

Published:

We recently developed a multi-billion relationship scale knowledge graph, representing the wider academic landscape. How we developed the data pipeline to build this graph in Neo4j is a story in itself, one I may write up here or on the Wellcome Data blog. This post however focuses on a specific database modelling issue that has caused us severe query performance issues, the issue of low cardinality “super nodes”.

image processing

Multimodal Document Classification

39 minute read

Published:

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

machine learning

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published:

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

mlops

Merging Git Repos into an Existing Repository on GitHub

5 minute read

Published:

Recently we’ve launched a new project seeking to improve data quality of our impact reporting. This project has grown out of a number of smaller projects, some of which have their own git repository on GitHub. For project and code management reasons we have decided to combine these into a single monorepo, bringing all components of this project into a single place.

Creating Reproducible Data Science Projects

13 minute read

Published:

A Nightmare Scenario - Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

natural language processing

Multimodal Document Classification

39 minute read

Published:

The automatic classification of documents remains an important and only partially solved information management problem within the upstream oil and gas industry. Companies in this sector have typically amassed vast repositories of unstructured data measuring in the range of tens to thousands of terabytes. An estimated 80% of all data within the upstream oil and gas industry is stored within large unstructured document repositories.

network analysis

Neo4j Super Node Performance Issues

8 minute read

Published:

We recently developed a multi-billion relationship scale knowledge graph, representing the wider academic landscape. How we developed the data pipeline to build this graph in Neo4j is a story in itself, one I may write up here or on the Wellcome Data blog. This post however focuses on a specific database modelling issue that has caused us severe query performance issues, the issue of low cardinality “super nodes”.

statistics

Is The Average of Averages The Same as The Overall Average?

2 minute read

Published:

There is a simple question that always seems to crop up when discussing everything from analytics dashboards to parallel computation. That is, when you have multiple groups of data (e.g., from different sources or processes), each with its own average, can you simply average those individual averages to get the true average across all the data?

timeseries

Write up of the UK’s first Subsurface Data Science Hackathon

3 minute read

Published:

Recently I attended the London OGA Data Science Hackathon, the first hackathon in the UK to focus on the use of machine learning with subsurface data. The event brought together oil and gas professionals from a range of companies and disciplines including; geoscientists, engineers, developers and data scientists.

what I learned today

Merging Git Repos into an Existing Repository on GitHub

5 minute read

Published:

Recently we’ve launched a new project seeking to improve data quality of our impact reporting. This project has grown out of a number of smaller projects, some of which have their own git repository on GitHub. For project and code management reasons we have decided to combine these into a single monorepo, bringing all components of this project into a single place.