2. Extracting more data for local analysis

View on GitHub | Run in Google Colab

In the last notebook, we saw that the /works API can do some clever querying and filtering. However, we often have questions which can't be answered by the API by itself. In those cases, it's useful to collect a load of data from the API and then analyse it locally.

In this notebook, we'll try to query the API for bigger chunks of data so that we can answer a more interesting question.

We'll aim to find out:

If we filter the works API for a set of subjects, can we find the other subjects that most commonly co-occur with them?

We'll start by fetching all of the works which are tagged with a single subject.

Here's our base URL again:

base_url = "https://api.wellcomecollection.org/catalogue/v2/"

Lets' make a request to the API, asking for all the works which are tagged with the subject "Influenza".

import requests

response = requests.get(
    base_url + "works", params={"subjects.label": "Influenza"}
).json()

response["totalResults"]

2.1 Page sizes

response["totalPages"]

len(response["results"])

At the moment, we're getting our results spread across 9 pages, because pageSize is set to 10 by default.

We can increase the pageSize to get all of our 81 works in one go (up to a maximum of 100):

import requests

response = requests.get(
    base_url + "works", params={"subjects.label": "Influenza", "pageSize": 100}
).json()

response["totalResults"]

response["totalPages"]

2.2 Requesting multiple pages of results

Some subjects only appear on a few works, but others appear on thousands. If we want to be able to analyse those larger subjects, we'll need to fetch more than 100 works at a time. To do this, we'll page through the results, making multiple requests and building a local list of results as we go.

If the API finds more than one page of results for a query, it will provide a nextPage field in the response, with a link to the next page of results. We can use this to fetch the next page of results, and the next, and the next, until the nextPage field is no longer present, at which point we know we've got all the results.

We're going to use these results to answer our question from the introduction, so we'll also ask the API to include the subjects which are associated with each work, and collect them too.

from tqdm.auto import tqdm

results = []

# fetch the first page of results
response = requests.get(
    base_url + "works",
    params={
        "subjects.label": "England",
        "include": "subjects",
        "pageSize": "100",
    },
).json()

# start a progress bar to keep track of how many results we've fetched
progress_bar = tqdm(total=response["totalResults"])

# add our results to the list and update our progress bar
results.extend(response["results"])
progress_bar.update(len(response["results"]))

# as long as there's a "nextPage" key in the response, keep fetching results
# adding them to the list, and updating the progress bar
while "nextPage" in response:
    response = requests.get(response["nextPage"]).json()
    results.extend(response["results"])
    progress_bar.update(len(response["results"]))

progress_bar.close()

works_about_england = results

let's check that we've got the correct number of results:

len(works_about_england) == response["totalResults"]

Great! Now let's try collecting works for a second subject:

results = []

response = requests.get(
    base_url + "works",
    params={
        "subjects.label": "Germany",
        "include": "subjects",
        "pageSize": "100",
    },
).json()

progress_bar = tqdm(total=response["totalResults"])

results.extend(response["results"])
progress_bar.update(len(response["results"]))

while "nextPage" in response:
    response = requests.get(response["nextPage"]).json()
    results.extend(response["results"])
    progress_bar.update(len(response["results"]))

progress_bar.close()

works_about_germany = results

2.3 Analyzing our two sets of results

Let's find the works which are tagged with both subjects by filtering the results of the first list by IDs from the second list.

ids_from_works_about_england = set([work["id"] for work in works_about_england])

works_about_england_and_germany = [
    work
    for work in works_about_germany
    if work["id"] in ids_from_works_about_england
]

len(works_about_england_and_germany)

works_about_england_and_germany

That's 32 works which are tagged with both England and Germany. Let's see if we can find the other subjects which are most commonly found on these works.

Let's use a Counter to figure that out:

N.B. We're collecting the concepts on each work because they are the atomic constituent parts of subjects. Our catalogue includes subjects like "Surgery - 18th Century" which are made up of the concepts "Surgery" and "18th Century". It's more desirable to compare the concepts, because the subjects can be so specific and are less likely to overlap.

from collections import Counter

concepts = Counter()

for record in works_about_england_and_germany:
    # we need to navigate the nested structure of the subject and its concepts to
    # get the complete list of _concepts_ on each work
    for subject in record["subjects"]:
        for concept in subject["concepts"]:
            concepts.update([concept["label"]])

The Counter object keeps track of the counts of each unique item we pass to it. Now that we've added the complete list, we can ask it for the most common items:

concepts.most_common(20)

Great! We've solved our original problem:

If we filter the works API for a set of subjects, can we find the other concepts that most commonly co-occur with them?

2.4 Creating a generic function for finding subject intersections

Now that we've solved this problem, let's try to make it more generic so that we can use it for other pairs of subjects.

We can re-use a lot of the code we've already written, and wrap it in a couple of reusable function definitions.

def get_subject_results(subject):
    response = requests.get(
        base_url + "works",
        params={
            "subjects.label": subject,
            "include": "subjects",
            "pageSize": "100",
        },
    ).json()

    progress_bar = tqdm(total=response["totalResults"])
    results = response["results"]
    progress_bar.update(len(response["results"]))

    while "nextPage" in response:
        response = requests.get(response["nextPage"]).json()
        results.extend(response["results"])
        progress_bar.update(len(response["results"]))

    progress_bar.close()
    
    return results


def find_intersecting_subject_concepts(subject_1, subject_2, n=20):
    subject_1_results = get_subject_results(subject_1)
    subject_2_results = get_subject_results(subject_2)
    subject_2_ids = set(result["id"] for result in subject_2_results)

    intersecting_results = [
        result for result in subject_1_results if result["id"] in subject_2_ids
    ]

    concepts = Counter()
    for record in intersecting_results:
        for subject in record["subjects"]:
            for concept in subject["concepts"]:
                concepts.update([concept["label"]])

    return concepts.most_common(n)

Calling the find_intersecting_subject_concepts() function with any two subjects will return a counter of the most common concepts found on the works which are tagged with both subjects.

find_intersecting_subject_concepts("Europe", "United States")

find_intersecting_subject_concepts("Vomiting", "Witchcraft")

Exercises

Try running the function with different subjects. Use the API to find two subjects which appear on a few hundred or a few thousand works, and see if you can find the most common concepts which appear on both of them.
Adapt the code to compare an arbitrary number of subjects, rather than just two.

2.1 Page sizes​

2.2 Requesting multiple pages of results​

2.3 Analyzing our two sets of results​

2.4 Creating a generic function for finding subject intersections​

Exercises​

2.1 Page sizes

2.2 Requesting multiple pages of results

2.3 Analyzing our two sets of results

2.4 Creating a generic function for finding subject intersections

Exercises