Validate, inspect & standardize identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using validate() and inspect().

For terms that are not directly mappable, we offer (also see Search & lookup terms):

import bionty as bt
import pandas as pd

Inspect and mapping synonyms of gene identifiers#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "ncbi id": ["29974", "1", "5133", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol ncbi id
ensembl_gene_id
ENSG00000148584 A1CF 29974
ENSG00000121410 A1BG 1
ENSG00000188389 FANCD1 5133
ENSGcorrupted corrupted corrupted

First we can check whether any of our values are validated against the ontology reference.

Tip: available fields are accessible via gene_bt.fields

gene_bt = bt.Gene()

gene_bt
Gene
Species: human
Source: ensembl, release-110
#terms: 75719

πŸ“– Gene.df(): ontology reference table
πŸ”Ž Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
βœ… Gene.validate(): strictly validate values
🧐 Gene.inspect(): full inspection of values
πŸ‘½ Gene.standardize(): convert to standardized names
πŸͺœ Gene.diff(): difference between two versions
πŸ”— Gene.ontology: Pronto.Ontology object
validated = gene_bt.validate(df_orig.index, gene_bt.ensembl_gene_id)
validated
βœ… 3 terms (75.00%) are validated
❗ 1 term (25.00%) is not validated: ENSGcorrupted
array([ True,  True,  True, False])
# show not validated terms
df_orig.index[~validated]
Index(['ENSGcorrupted'], dtype='object', name='ensembl_gene_id')

The same procedure is available for ncbi_gene_id or gene symbol. First, we validate which symbols are mappable against the ontology.

gene_bt.validate(df_orig["ncbi id"], gene_bt.ncbi_gene_id)
βœ… 3 terms (75.00%) are validated
❗ 1 term (25.00%) is not validated: corrupted
array([ True,  True,  True, False])
validated_symbols = gene_bt.validate(df_orig["gene symbol"], gene_bt.symbol)
βœ… 2 terms (50.00%) are validated
❗ 2 terms (50.00%) are not validated: FANCD1, corrupted
df_orig["gene symbol"][~validated_symbols]
ensembl_gene_id
ENSG00000188389       FANCD1
ENSGcorrupted      corrupted
Name: gene symbol, dtype: object

Here, 2 of the gene symbols are not validated. What shall we do? Let’s run a full inspection of these symbols:

gene_bt.inspect(df_orig["gene symbol"], gene_bt.symbol);
βœ… 2 terms (50.00%) are validated for symbol
❗ 2 terms (50.00%) are not validated for symbol: FANCD1, corrupted
πŸ’‘    detected 1 terms with synonym: FANCD1

Inspect detects synonyms and suggests to use .standardize():

# mpping synonyms returns a list of standardized terms:
mapped_symbol_synonyms = gene_bt.standardize(df_orig["gene symbol"])

mapped_symbol_synonyms
πŸ’‘ standardized 3/4 terms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']

Optionally, only returns a mapper of {synonym : standardized name}:

gene_bt.standardize(df_orig["gene symbol"], return_mapper=True)
πŸ’‘ standardized 3/4 terms
{'FANCD1': 'BRCA2'}

We can use the standardized symbols as the new standardized index:

df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id gene symbol ncbi id
A1CF ENSG00000148584 A1CF 29974
A1BG ENSG00000121410 A1BG 1
BRCA2 ENSG00000188389 FANCD1 5133
corrupted ENSGcorrupted corrupted corrupted

Standardize and look up unmapped CellMarker identifiers#

Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using CellMarker.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127a",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cellmarker_bt = bt.CellMarker()

cellmarker_bt


CellMarker
Species: human
Source: cellmarker, 2.0
#terms: 15466

πŸ“– CellMarker.df(): ontology reference table
πŸ”Ž CellMarker.lookup(): autocompletion of terms
🎯 CellMarker.search(): free text search of terms
βœ… CellMarker.validate(): strictly validate values
🧐 CellMarker.inspect(): full inspection of values
πŸ‘½ CellMarker.standardize(): convert to standardized names
πŸͺœ CellMarker.diff(): difference between two versions
πŸ”— CellMarker.ontology: Pronto.Ontology object

Now let’s check which cell markers from the file can be found in the reference:

cellmarker_bt.inspect(markers.index, cellmarker_bt.name);
βœ… 6 terms (42.90%) are validated for name
❗ 8 terms (57.10%) are not validated for name: KI67, CCR7, CD14, CD4, CD127a, Invalid-1, Invalid-2, Time
πŸ’‘    detected 4 terms with inconsistent casing/synonyms: KI67, CCR7, CD14, CD4

Logging suggests we map synonyms:

synonyms_mapper = cellmarker_bt.standardize(markers.index, return_mapper=True)
πŸ’‘ standardized 10/14 terms

Now we mapped 4 additional terms:

synonyms_mapper
{'KI67': 'Ki67', 'CCR7': 'Ccr7', 'CD14': 'Cd14', 'CD4': 'Cd4'}

Let’s replace the synonyms with standardized names in the markers DataFrame:

markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

cellmarker_bt.inspect(markers.index, cellmarker_bt.name);
βœ… 10 terms (71.40%) are validated for name
❗ 4 terms (28.60%) are not validated for name: CD127a, Invalid-1, Invalid-2, Time

We don’t really find CD127a, let’s check in the lookup with auto-completion:

lookup = cellmarker_bt.lookup()
lookup.cd127
CellMarker(name='CD127', synonyms='', gene_symbol='IL7R', ncbi_gene_id='3575', uniprotkb_id='P16871', _5='cd127')

Indeed we find it should be cd127, we had a typo there with cd127a.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CD127a": lookup.cd127.name})

Optionally, run a fuzzy match:

cellmarker_bt.search("CD127a").head()
synonyms gene_symbol ncbi_gene_id uniprotkb_id __agg__ __ratio__
name
CD127 IL7R 3575 P16871 cd127 90.909091
CD1 CD1A 910 P29016 cd1 90.000000
CD120a TNFRSF1A 7132 P19438 cd120a 83.333333
CD167a None None None cd167a 83.333333
CD172a None None None cd172a 83.333333

OK, now we can try to run curate again and all cell markers are linked!

cellmarker_bt.inspect(curated_df.index, cellmarker_bt.name);
βœ… 11 terms (78.60%) are validated for name
❗ 3 terms (21.40%) are not validated for name: Invalid-1, Invalid-2, Time