Introducing SBPAX – A New Glue to Integrate SBML/VCML-type Models with BioPAX-type Pathways
Computational biology has two thriving communities each with a large pool of data and with their own established exchange format: The modeling community with formats like SBML and VCML, and the pathway exchange community around
BioPAX. Major efforts have been invested to integrate data from both communities, but suffer from semantic differences.
SBPAX is an RDF schema that serves as a glue to integrate model-centered formats similar SBML or VCML with pathway-centered formats similar to
BioPAX despite semantic differences. The discussion here will focus on SBML and
BioPAX Level 2, but
SBPAX can easily be adapted to work with any formats with a semantic structure similar to any of these.
BioPAX and SBML are two major standards in the computational biology community used to describe similar things: Both typically deal with biochemical reactions and substances participating in them. This suggests great benefit from exchanging data between the
BioPAX and the SBML community, and in fact, the similarities between the two have encouraged several efforts to design software to perform conversions form one format to the other. Such conversions are always lossy for general sets of data but often still useful.
Under simple conditions, when one does not have to worry about catalysts, different locations or small modifications on large molecules, conversion between SBML and
BioPAX can be straight-forward. Under these conditions, a direct conversion would map SBML's reactions to
BioPAX's conversion and SBML's species to
BioPAX's physicalEntity. The information contained in
BioPAX's physicalEntityParticipant (such as location or stoichiometric coefficient) would be shared between SBML's reactions and species.
The participation of a catalyst reveals some structural differences between
BioPAX and SBML, but can usually be converted lossless: In SBML, the participation of a catalyst is expressed analogous to the participation of a reactant or product, while
BioPAX has special classes – control and its derivatives – to link catalysts to the conversion they affect.
Most obvious sources of loss are kinds of information supported in one but not in the other format:
BioPAX, unlike SBML, classifies substances into classes such as proteins, DNA, RNA, complexes and smallMolecules and has supports storing of various references. SBML, on the other hand, supports a range of quantitative data essential to modeling, which
BioPAX does not support, including rate laws and initial conditions.
Less obvious but more severe are semantic differences between SBML and
BioPAX types. Some of these differences manifest themselves in a different placement of various pieces of information, such as location or stoichiometric coefficients. Others manifest in more subtle ways, when the information is not merely organized in different ways, but the meaning is not the same even though it appears to be at first glance.
This difference is crucial for integration. If some information is supported by one format but not the other, as long as the semantic structures are compatible, it is straight-forward to create such support by extensions and then map one-to-one. However, often the fact that one format supports some kind of information while the other does not indicates that the semantic structures are not compatible, no one-to-one map is possible, and instead, we have to invest to create a meaningful set of relationships. In other words, we have to create some glue.
SBPAX is designed to be this glue, which serves to describe relationship between terms used to describe SBML-type models and
BioPAX-type pathways.
Open Worlds versus File Boundaries
In SBML, a file contains exactly one model and a model is contained in exactly one file. SBML does not support relationships between data contained in different files. The user may know that something in file A is identical or somehow related to something in file B, but can not express this in SBML.
In contrast,
BioPAX provides mechanisms for identification. Since
BioPAX is based on RDF, all things are resources which regularly is assigned an RDF ID, which is designed to identify the thing anywhere in the universe. This comes naturally with what is called the Open World Assumption in the RDF world, which means that to any thing, there may be additional information somewhere else. In addition,
BioPAX supports references to database entries on substances which can identify various molecules, especially proteins. Using these references to assert that a physicalEntity in file A is identical to a physicalEntity in file B is considered one of the primary achievements by some
BioPAX users and developers.
In SBML, on the other hand, the focus is on models and therefore reactions and species are seen primarily as parts of models. Identifying the boundaries of a file with the boundaries of a model is the simplest and most natural choice.
If a SBML file is converted to a
BioPAX file, this conserves file boundaries, but not the meaning the file boundaries. If such data is then circulated in the
BioPAX community, users would expect that no relevant information is lost by moving entries between files, but instead they do loose the information which entries belonged to the same model. While it is convention in SBML that model corresponds to an exact file, such convention goes against basic
BioPAX and RDF philosophies and is therefore unenforcible.
SBPAX provides a simple solution: An RDF resource represents the model and there are relationships between the model and all the resources that represent the parts of the model. This way, models remain intact in
BioPAX, even if pieces of data are moved around.
But there is even more to it: Not only does
SBPAX provide the expressive power of SBML to define models, but we get for free something that SBML does not provide naturally: The capacity to say, that the same thing is part of more than one model. In SBML, we would have to include the thing in more than one file and then saying they are the same is not supported. With
SBPAX, however, it simply means that more than one model refers to the same resource.
Substances
System Biology deals with a rich and diverse sortiment of substances, ranging from tiny molecules like water and alcohol, over larger ones like glucose and ATP to even larger ones like proteins, which themselves range from small peptides like insulin to large complexes such as cytochrome c oxidase.
Each of these terms refers to a class of compounds, often molecules, or at least consisting largely of molecules. Whether two compounds belong to the same class is not a trivial question: Small molecules are not in the same class if they differ in an atom, a bond or a charge, while large molecules often are still considered the same class even if entire molecular groups are added, such as phospho groups. Epidermal growth factor receptor (EGFR), is still EGFR even if it is phosphorylated multiple times, but water ceases to be water if we add or remove a single hydrogen atom. This inconsistency poses a challenge for any classification scheme intending to cover the whole range of substances relevant in System Biology.
BioPAX approaches the problem by categorization: Substances fall under physicalEntity, which knows five subclasses rna, dna, protein, complex and smallMolecule. While it is not clearly stated in the documentation, rna, dna and proteins are identified primarily by their sequence, and in particular, it is still considered the same object after small modifications such as phosphorylations. If there is an interaction for which these modifications are relevant, the reaction will refer to a sequenceParticipant which refers to the necessary information. By definition, everything which is not a dna, rna or complex is a smallMolecule, even if it is very large, which is identified typically by its chemical structure. While this approach is intuitive, it is also inconsistent.
In SBML, there is practically no limit on what substances can be covered by a species, as long as it makes sense within a particular model. A species may, as SBML users sometimes like to point out, cover items that are not substances at all, such as cells or radiation. Among other things, a species may be a protein, or cover only certain states of a protein or it may be a group of similar proteins. In particular, a species in one model might overlap, but not coincide, with a species in another model. What matters is that is some useful class that can be quantified, participate in reactions and it is possible to formulate the corresponding rate laws for these reactions.
Note that to describe substances, an ontology contains classes of classes. For example, protein in
BioPAX is a class with instances such as I (say "insulin"), A ("actin") or D ("succinate dehyrdogenase"). Each of these instances is a class containing the actual molecules, for example class I "insulin" contains i1 ("a insulin molecule") and i2 ("another insulin molecule"), or A contains a1 ("an actin molecule") and a2 ("another actin molecule").
Representing substances by classes is essential for relating various substances to each other by subclasses and superclasses. For example, the substance "enzyme" is a superclass of "succinate dehydrogenase". Subclasses can be formed on the basis of a variety of distinctions, for example "insulin" has subclasses such as "human insulin" or "insulin in the blood" and so on, or "EGFR" has subclasses "EGFR phosphorylated" or "EGFR unphosphorylated".
The conclusion is that an SBML species S and a
BioPAX physicalEntity E can relate to each other in different ways: S may be identical to E, or s subset, or a superset, or overlap otherwise, or be disjoint. Where E is not fully covered by S, it is further interesting how a each connected physicalEntityParticipant corresponds to S. Similar considerations apply to SBML's speciesType, which is a class covering species that are identical by chemistry, but have different locations.
SBPAX provides properties to express all these relations.
Reactions
In analogy of substances being classes of things, reactions are classes of events. For example,
BioPAX's biochemicalReaction has an instance R which is called "Succinate and ubiquinon react to fumerate and ubiquinol under catalysis of succinate dehydrogenase". Now R is related to five substances we shall call S ("succinate"), N ("ubiqinon"), F ("fumerate"), L ("ubiqinol") and D ("succinate dehydrogenase"). This means that any instance r1 of R can be defined and identified by some instance s1 from S, n1 from N, f1 from F, l1 from L and d1 from D (assuming for simplicity, that every instance of a substance only lives once, from reaction to reaction, and that the attempt to restore it leads to a different instance). Another instance of R is r2, which is identified by s2, n2, f2, l2 and d2 accordingly.
In fact, the canonical way to define R is the class of all reactions of one reactant each from S and N, one product each from F and L, and one catalyst from D.
If we replace any of these classes with a subclass or superclass, the resulting class of reactions will be a subclass or superclass of the original set of reactions. For example, if we replace D ("succinate dehydrogenase") with Dh ("human succinate dehydrogenase"), we turn R into Rh, a subclass of R, which would read:
"Succinate and ubiquinon react to fumerate and ubiquinol under catalysis of human succinate dehydrogenase"
Analogous to the case for substances,
SBPAX provides terms to describe the relations between an SBML reaction R and
BioPAX biochemicalReaction B. Again, R and B might be identical, or B might be a subclass or superclass of R, or they may overlap otherwise, or they may be disjoint.
Reaction Kinetics
SBML supports rate laws for its reaction, which is crucial for modeling, while
BioPAX does not. We might be tempted to think that an easy fix to make the two compatible would to be to add rate laws to
BioPAX'es conversions, but there is a deeper semantic problem. This problem has to do with
BioPAX describing pathways, while SBML describes models of pathways. In other words, we face a distinction between objective and subjective information.
Some information about reactions is objective, most importantly the participants and their stoichiometric coefficients. Once we have complete and correct knowledge of the participants and their stoichiometric coefficients, we will agree what they are with any one who also has knowledge about these. Models, on the other hand are subjective information, meaning they depends not only on the truth about the subject matter, but also on requirements and preferences of the author or user of a model. The rate law is an example of this: It is by nature an approximation, and whether it is sufficient or not depends on the purpose the model serves. For two different applications, the best choice of a rate law may very well be different for the same reaction.
For
SBPAX, we do not see this as a problem, but as an opportunity: We simply define two different terms: Reaction and
ReactionModel . To a Reaction, we attach all objective information such as participants and stoichiometric coefficients, and subjective information such as the rate law is attached to the
ReactionModel . A
ReactionModel is attached to one Reaction, but a Reaction can be attached to several
ReactionModels . The
PathwayModel naturally refers not to Reactions directly, but to
ReactionModels .
This allows also to make the following distinction: If two
PathwayModels contain the same Reaction, they may use either different
ReactionModels for it, or they may use the same
ReactionModel . Only by separating Reaction and
ReactionModel can this difference be properly expressed.
Stoichiometric Coefficients
Stoichiometric coefficients are simple items except that they are used in two different ways: First, there is the actual number of compounds participating an a single actual reaction. It is always an integer number without unit and it is not subject to choice. Second, often a multiple of the actual numbers is given, for a variety of reasons, either because it is not known, or to simplify the numbers, or to attach convenient units.
SBPAX represents stoichiometric coefficients as resources with two properties: The actual coefficient, a number, and the effective coefficient, an observable consisting of a number and a unit (but it can also be without unit). If no effective value is given, the actual value is used as effective. A
ReactionModel can provide its own stoichiometric coefficients, or can inherit them form the Reaction, or can multiply the coefficients of the Reaction with a factor, which is also an observable and can carry a unit. The multiple will have a modified effective value but the same actual value.
Observables and Initial conditions
SBPAX provides observables to represent stoichiometric coefficients, quantities of substances or other properties. It consists essentially of a number and a unit, where units carry dimensions. All important units and dimensions are provided.
An important special observables is the quantity, is connected to a substance and quantifies this substance, for example in terms of concentration or mass or molarity. Time is an observable with a unit of dimension time. A time slice consists of a time and an arbitrary number of other observables, typically including quantities of all substances. A
PathwayModel has a time slice as initial conditions.
--
OliverRuebenacker - 13 Mar 2008