Chothia Canonical Assignment Discovery


Antibodies can recognise virtually any given molecule mainly by variation in the length and sequence of their Complementarity-Determining Regions (CDRs), which form the antibody’s binding interface. Three CDRs are found in the antibody’s Heavy chain (CDR-H1, -H2, -H3) and three in the Light chain (CDR-L1, -L2, -L3). The first definition of CDRs was by Wu & Kabat (1970) while performing an analysis of the variable domains of Bence-Jones proteins and myeloma Light chains. Later, Kabat and colleagues compared the sequences of the hypervariable regions in the then known structures and observed that at 13 sites in the Light and 7 in the Heavy chains (Kabat, Wu & Bilofsky, 1977), the residues are conserved. They suggested that these positions in the sequence are involved with structure rather than specificity, introducing for the first time a possible relationship between sequence and loop conformation in antibodies. A second set of observations of the crystal structures of Fab fragments and myeloma proteins revealed that, in many cases, hypervariable regions with the same length but different sequences have the same main chain conformation (de la Paz et al., 1986).

It was in 1986 (Chothia et al., 1986) that specific residues were directly associated with the conformation of the hypervariable regions during a visual analysis of the sequence and structure of antibody D1.3, thus introducing the notion of the “canonical model”. From this point, various further studies enriched the table of structurally-determining residues (canonical residues), by observing the amino acid similarities at key interacting positions within sequences of members of any given conformational class, of the known and newly defined canonical structures, for the three CDRs in Light and the first two in Heavy chains (Chothia & Lesk, 1987; Chothia et al., 1989; Chothia et al., 1992; Barré et al., 1994; Tomlinson et al., 1995; Guarne et al., 1996; Martin & Thornton, 1996; Morea, Lesk & Tramontano, 2000; Vargas-Madrazo & Paz-García, 2002). Therefore, these collections of structurally-determining residues created canonical templates for each known conformational class, which defined the allowed residues per identified position in the variable chain. These canonical templates could then be used for prediction, from sequence alone, of the conformation of a new CDR by requiring its variable chains match as many, if not all, of the allowed residues present in the template. Regarding the sixth and final CDR-H3, a number of studies (Shirai, Kidera & Nakamura, 1996; Shirai, Kidera & Nakamura, 1999; Furukawa et al., 2001; Kuroda et al., 2008) provided structure-determining sequence rules for the prediction of the CDR-H3-base (or ‘take-off’, ‘torso’ or ‘anchor’) conformation.

In the latest relevant study (North, Lehmann & Dunbrack, 2011), it was inferred that the effect of canonical residue overlap between templates caused by the proliferation of structures was diminishing the efficacy of the canonical model. Instead, a mixed approach was proposed for prediction of CDR conformation, sometimes based on the presence of a very small number of statistically prominent structurally-determining residues, the gene source, CDR length or even the use of Hidden Markov Models (HMMs). Therefore most conformational clusters/classes were noted as not canonical, while a considerable number were characterised as non-predictable altogether. Furthermore, concerns were raised regarding the predictability from sequence of the bulged (including double-bulged) CDR-H3-base conformation.

The accurate prediction of CDR conformation is important in modelling antibodies for protein engineering applications (e.g., ab initio design of antibodies, antibody humanisation, vaccine design, etc.). Specifically, knowledge of the CDR conformation is crucial for the creation of a stable binding interface, modification of the antibody’s binding affinity or even identification of an epitope. Computational methods such as the canonical model or CDR-H3 sequence rules, which attempt conformational prediction of CDRs from sequence alone, have the advantage of being inexpensive and fast while requiring only a simple input; their major drawback being the inability to predict conformations that were never observed before experimentally. In this context, a re-evaluation of the performance of the canonical model in predicting the class of CDR conformation from sequence alone is presented in light of the latest new and multi-level complete CDR clustering (Nikoloudis, Pitts & Saldanha, 2014). The key residues are updated in the existing canonical templates from the sequences of members of each level-1 cluster/class, and correspondingly the canonical templates for new clusters in a given length are populated, using the key positions defined for that length by Martin & Thornton (1996). Those defined key positions are identical for all clusters of a given length. In this way, an assessment as to whether the canonical model is still effective as the quickest and simplest prediction method for antibody CDR conformation is carried out, and the effect of canonical residues’ overlap between templates caused by the proliferation of cluster sequence populations can be evaluated.

For the hypervariable (both in sequence and conformation) CDR-H3, the sequence rules for CDR-H3-base prediction described in Shirai, Kidera & Nakamura (1999) are tested, as well as their updated versions in Kuroda et al. (2008). The goal here is to compare the accuracy of the two sets of rules and, more importantly, to find out if the continual adaptation to new sequences with additional rules, exceptions and overrides is beneficial to this predictive model.

Besides testing these two popular and historic approaches on an updated dataset, a new predictive model from sequence alone is also introduced which aims to bring improved accuracy over previous sequence-based methods, while retaining their rapid execution and simplicity of usage. All the characteristics of the new method are detailed, step-by-step: inception, goals, basic concepts and definitions, implementation strategies, training and prediction workflows. A demonstration is presented of a standard predictive model derived from the method as well as an assessment of its efficacy on the same set of CDRs employed for the testing of the canonical model and CDR-H3-base rules. As this new method allows parameterisation, future dedicated work could take advantage of the general framework provided and propose a number of different or improved implementations.

The prediction results obtained by the new method are directly compared to those from previous approaches and complemented by statistical characteristics of the training, validation and test sets. Additionally, special importance is attributed to each method’s performance in predicting the major cluster/conformation (class-I) in any given CDR/length combination (e.g., CDR-L1 11-residues). Indeed, as is revealed by the population percentages per cluster in Nikoloudis, Pitts & Saldanha (2014), in each CDR/length with more than 10 unique sequences there is usually a single cluster which regroups the large majority of the known conformations, while the remaining fraction may be populating a considerable number of much smaller clusters. In the 15 lengths (first 5 CDRs) that contained more than 10 unique sequences in their clustered population and produced more than one cluster, the major cluster of each length represented on average 74% of the available unique sequences (median: 86%). As a consequence, these major conformations are expected to occur more frequently and are accordingly more probable to prove of interest in research scenarios. For this reason further analysis is undertaken of the prediction results to calculate the precision, recall and F-measure for all major clusters, and the corresponding comparisons between methods are presented.


A new blind dataset

As the clustering dataset in Nikoloudis, Pitts & Saldanha (2014) was locked on the 31/12/2011 edition of the PDB (Berman et al., 2000), this presented an opportunity to conduct a true blind-testing by downloading the antibody structures that were released subsequently. Hence for the new dataset, a search was performed in the PDB for structures released between 01/01/2012 and 21/11/2013, using the same methodology as in Nikoloudis, Pitts & Saldanha (2014), which returned 312 files, two of which contained structures from 3 antibodies (PDB codes 3ULU, 3ULV). After removing redundant sequences, there remained a total of 230 antibody structures: 210 had both Heavy and Light chains, 4 had only a Light chain and 16 had only a Heavy chain. All redundant instances (i.e., multiple copies of the same CDR sequence within the same structure and CDRs from different structure files with identical Fv sequences) were additionally searched for different CDR conformations. Only one of the 230 structures was retained despite the fact that it was redundant (4DN4), because a different CDR-L1 conformation was observed between the two crystal structures (4DN3/4DN4, free and bound versions, respectively).

As DCP required parameter tuning, a validation step had to be inserted. However, since the initial structure of the data to be predicted presented a majority of clusters with only between one to three unique sequences, it proved impractical to perform a traditional k-fold cross-validation on the clustered set as these smaller clusters could not be further subdivided in a meaningful way. Instead a 3-way experiment was designed, where the previously clustered dataset was used for training, while the new dataset was divided approximately in half into a validation set and a test set. The validation set comprised of all PDB files released between 01/01/2012 and 14/03/2013 (113 non-redundant antibody structures), while the test set included all the subsequently released structures (15/03/2013–21/11/2013, 117 non-redundant antibody structures). This division of the dataset by time preserved the double-blind nature of the experiments, since the complete test dataset was also constructed with time of release as the sole criterion, thus eliminating any subjectivity from the selection, analysis and interpretation.

An examination of the redundant sequence content (complete Fv identity) between training and test datasets was also performed. This count revealed a 7%–9% fully redundant sequence content in the test dataset (i.e., present in the training dataset) in all considered CDRs (specifically, full count[subset-1 count/subset-2 count]: 11[4/7], 13[7/6], 17[7/10], 17[7/10] and 17[7/10], for CDR-L1, CDR-L3, CDR-H1, CDR-H2, CDR-H3-base, respectively, Supplemental Information 3). While the fully redundant content appeared to be relatively low, the concerned entries were retained in the test dataset in order to allow an appreciation of the methods’ accuracy in predicting a trained sequence and demonstrate their capacity to overcome overlapping predictive definitions.

By using this new dataset, it was possible to retain the previous entire clustered set as a prior knowledgebase and to assess the sequence-based prediction methods in realistic conditions without discarding or ignoring any data, both during training/updating and testing. This ensured that DCP training and canonical templates’ updating remained blind toward the new PDB files. In terms of predictions with canonical templates, the entire new dataset served for testing since no validation step was required. However, for practical reasons, the above first subset will henceforth be called “the validation set” (for DCP) and the second subset “the test set” (for DCP), despite the fact that both constitute test sets for the canonical method.

All conformational predictions were applied at the first level of the clustering set’s nested scheme. New Fv sequences were numbered, using the numbering scheme and CDR extents described in Nikoloudis, Pitts & Saldanha (2014). The -backbones of new CDRs were then successively superposed onto the medoid structure of every cluster of the same length, in order to determine the actual conformation of new CDRs. For a new CDR to be assigned to a pre-existing conformational cluster, its RMSD to the cluster’s medoid was required to be lower than the cluster’s radius.

A new method for prediction of CDR conformation from sequence

Method presentation

It has been made clear through various studies (Chothia et al., 1989; Chothia et al., 1992; Alzari et al., 1990; Al-Lazikani, Lesk & Chothia, 1997; Martin & Thornton, 1996; Morea, Lesk & Tramontano, 2000; Vargas-Madrazo & Paz-García, 2002; Shirai, Kidera & Nakamura, 1996; Shirai, Kidera & Nakamura, 1999; Kuroda et al., 2008) that the CDR sequence is not always solely determinant of the CDR conformation. Several residues external to the CDR, from the framework, other CDRs or the second Fv chain, were retained as structurally-determinant and included in predictive canonical templates or sequence rules. These residues were spotted after pedantic visual examination of a number of antibody structures of interest, as making important contacts with CDR residues. However, this process can potentially lead to misleading generalisations due to crystal errors, or the intrinsic backbone and side-chain flexibility of surface residues such as those in CDR sequences.

In the new method now presented, a generalisation for the presence of class-specific combinations of residues is proposed. These combinations of residues would represent conformation-influencing synergies that are expected to appear exclusively or preferentially in members of one cluster. As far as the physico-chemical aspect of the residues’ interaction is concerned, these combinations may be representing steric effects, creation of a hydrophobic pocket or local environment, hydrogen-bonding, van der Waals’ contacts, salt bridges, backbone flexibilities, etc. Of course any investigation of sequence sets with such physico-chemical criteria would dramatically increase the complexity of any method. Instead a simpler model is proposed where the nature of these interactions, as well as the very residues which participate, remain irrelevant to the prediction procedure. More specifically, it would be of interest to search for those combinations of positions in the antibody Fv sequences that contain combinations of residues that are always different between different conformational clusters, i.e., combinations of positions that present disjoint combinations of residues between classes. In this way the sequence differences between different classes are examined, instead of the sequence similarities within a class as is the case with the canonical model. This approach was named ‘Disjoint Combinations Profiling’, or DCP, and all its characteristics are further detailed in the following sections.

Basic definitions

For the formulation of this new method a number of novel features needed to be defined, which are detailed later. The basic terms used in the DCP prediction method are provided here in Table 1, as both an introduction and for quick reference.

DCP setting-up and training

In this demonstration of DCP, all neighbouring residues of a CDR are included, within a radius of 4 Å, 6 Å or 8 Å, as potentially interacting with the CDR in a way that is influencing its conformation. The initial assumption is that these neighbourhoods of members of the same conformational cluster have equivalent influence on the observed conformation. Therefore, it is expected that within these neighbourhoods there exist combinations of positions that make distinct conformational-influencing synergies, and whose sequences are never observed in members of a different cluster. These synergies could be caused by any number of the aforementioned residue-to-residue interactions. The theoretical basis behind this parameter could be the chained influence that residues may have on a local conformational feature, also implicating residues that make indirect contact with the CDR; e.g., a cascade of interactions between 3 or 4 residues where the last residue resides on the CDR but makes no contact whatsoever with the first residue of the cascade. It is therefore possible that DCP captures such chained synergies, which are different between different conformational classes.

All the Fv positions that are predominantly found within the selected radius of an examined CDR, its residues included, define its ‘Interaction Frame’ (IF). This frame of positions was constructed after visual examination with the graphics program Swiss-PdbViewer (Spdbv; Guex & Peitsch, 1997) of a large number of antibody structures. During visual examination, all positions that satisfied the radius criterion and were common to all members of all clusters, were retained. As the antibody framework is very stable, the vast majority of neighbouring positions that were observed (over 90%) was topologically preserved between the examined CDRs. This operation was repeated for each CDR.

Once the IF is selected for a given CDR, the sequences of all cluster members per CDR/length combination are parsed for the residues that occupy the Fv positions found in the IF. These residues are then arranged in the same order as the respective positions appear in the IF, in order to form the corresponding ‘IF sequence’. This way, each cluster now has a set of IF sequences that can be compared with each other for the detection of disjoint combinations of residues between them. A graphical representation of these setting-up steps can be seen in Fig. 1.

A common problem in CDR conformational prediction from sequence alone is the presence of sparsely populated clusters/classes. The sequence examples of those clusters are often so few that it becomes impossible to detect sequence features that are at the same time common between members of that cluster but different from other clusters. Especially so when the major cluster in a given length also has few members; any comparisons between the different clusters’ sequences become prohibitively risky. For the DCP training process, this obstacle was overcome by regrouping the sequences of all clusters in that length, except for the one that is being profiled. Indeed, in searching for differences, the profiled class needs to be presented against an ‘anti-reference’ rather than a traditional ‘reference’ used in many prediction methods. For example, it is possible to screen class A against what “is_not_class_A”, so by regrouping all “non_class_A” instances there is a practical enrichment of the volume of sets of sequences to be compared.

The ‘Query IF sequence set’ was defined as the group of non-redundant IF sequences of all members of the cluster under examination and the ‘Target IF sequence set’ was defined as a group of non-redundant IF sequences from members of all clusters except for the one that is being profiled. For example, when examining cluster-1 in a CDR/length with 4 clusters, a comparison is made of Query IF sequence set [cluster-1] versus Target IF sequence set [clusters-2/3/4]. The profiling for disjoint combinations can then be initiated by cycling through all combinations of Fv positions within the IF, up to the maximum combinatorial order that is pre-selected (e.g., singlets, couplets, triplets, quadruplets, or quintets, etc., each time including combinations of lower order), and extracting the corresponding amino acid sequences from the Query/Target IF sequence sets. Each combination of positions was called an ‘IF fragment’ and, accordingly, the corresponding extracted residues formed an ‘IF fragment sequence’.

Once all respective amino acid fragment sequences are acquired from both Query and Target sets, the corresponding fragment sets are then examined for disjointness, i.e., that no sequence fragment is shared between the two sets. If the sets prove to be disjoint, that IF fragment is retained as pointing to a potentially significant difference between the two sets. This IF fragment is called a ‘Signature signal’. The rationale is that if any sequence combination of the examined IF fragment is shared even once between the members of the different clusters, then the examined IF fragment sequences are not mutually exclusive and therefore cannot be theoretically considered as unique to any conformation. The complete list of signature signals constitutes the ‘DCP signature’ of the examined (Query) cluster/class, which is consequently used for its prediction with new sequences. A graphic representation of this training process can be seen in Fig. 2.

As a note, the basic properties of combinations imply that the observance of any signature signal of lower order automatically renders equally disjoint any combination of greater order, which contains all the IF positions of the lower order combination. For example, when IF fragment L90–L95 is disjoint, thus becoming a signature signal, any higher order combinations containing the previous IF positions are also disjoint; e.g., L90–L91–L95, L89–L90–L95, L89–L90–L91–L95, etc., are all equally signature signals. Therefore, in order to avoid unnecessary redundancies within a DCP signature which may affect prediction scoring, a filtering is performed that removes signature signals from the DCP signature when they contain other signals of lower order.

Prediction of CDR conformation with DCP signatures

Once a DCP signature and a Target IF sequence set are acquired for each conformational class, it becomes possible to predict the unknown conformation of CDRs (from new Fv sequences) by scoring the differences (disjoint combinations). New Fv sequences will henceforth be referred to as “Query” sequences, as they become the profiled object. The first step is, again, to number the Query Fv sequence and to assemble the respective IF sequence for each CDR to be predicted from the residues that correspond to the IF positions (defined previously during training). Subsequently, the DCP signature and the corresponding Target IF sequence set for each class of the corresponding CDR/length are loaded in turn. For each screened class, the signature signals are read one-by-one and the corresponding sets of IF fragment sequences are re-constructed. These sets of Target IF fragment sequences are then examined for disjointness versus the corresponding Query IF fragment sequence from the unknown CDR. If disjointness is observed between the Query fragment sequence and the Target fragment sequences in a given IF fragment (i.e., the Query fragment sequence is not in the list of Target fragment sequences), then the comparison score is increased by 1 and comparisons proceed with the next signature signal until all comparisons are performed. It is important to note again that signal matching is achieved by observing sequence differences (i.e., disjoint fragments) and not sequence similarities as is more common in the canonical model.

The final signature matching score (RDCPsignature) of a given class is equal to the comparison score (total number of disjoint signals), divided by the total number of signature signals in the DCP signature: (1) Once all classes in the given CDR/length are scored, the predicted conformation is the one with the RDCPsignature ratio closest to 1, and the workflow is repeated for the next CDR conformation to be predicted. A representation of the prediction workflow by DCP signatures can be seen in Fig. 3.

Canonical templates

The canonical templates were derived for every applicable conformational cluster, using the definitions of structurally-determining residues described in Martin & Thornton (1996). This choice was guided by the fact that the aforementioned study remains the most extended work on canonical residues, providing detailed tables of canonical templates for each conformational class.

Table 2 shows the canonical positions used for the creation of predictive templates in each applicable cluster, while the detailed canonical templates employed during blind-testing can be consulted in Supplemental Information 5. These templates were derived from the exact same training sequences used during DCP training, in order to allow a straight comparison between the two methods. It can be argued, that due to the nature of the level-1 clusters produced in Nikoloudis, Pitts & Saldanha (2014), the respective canonical templates may contain an unwarranted number of allowed residues, leading to misclassifications. This eventuality was explored by concurrently constructing, in selected cases (e.g., CDR-L3/9-residues, CDR-H1/13-residues), canonical templates from a small centralised portion of the cluster’s population, where conformation variations are minimal; namely those members that belonged to the cluster’s core. However, this training restriction led to an increased rate of misclassifications by canonical templates, probably because the sets of allowed canonical residues were not rich enough. For both this reason and for complete training conformity between the two methods, the exact same training sequences were used for DCP and canonical prediction from sequence.

Sequence rules for CDR-H3-base prediction

Two sets of sequence rules for the prediction of the CDR-H3-base conformation were used: the first set from Shirai, Kidera & Nakamura (1999) and the updated set from Kuroda et al. (2008). The second set is an extension of the original set of rules based on examination of 314 new, non-redundant structures from the PDB. Blind-testing both sets of rules on the available test sets presented a good opportunity to examine their validity and, importantly, assess their extensibility by constant adaptation to new sequence findings. Although the respective publication was made in 2008, the updated set is referred to as “H3-rules 2007” in the corresponding text, so will henceforth be referred to accordingly.

Identification of multi-conformation full-rogue CDRs

During clustering, two conformational clusters that contain one or more members with identical CDR sequences were defined as ‘rogue’. For the DCP training and construction of canonical templates, it was also essential to search for, and deal with, structures that have the exact same Light and Heavy chain sequences within the clustering (training) dataset, but contain a CDR that belongs to different conformational clusters. These CDR structures were named ‘multi-conformation full-rogue CDRs’. Indeed, the presence of such CDRs in the training set would void DCP, as it would no longer be possible to detect any disjoint combinations between the sequences of the affected clusters. To a lesser degree, the same event would be detrimental for canonical predictions as well, since these full-rogue CDRs would have rogue templates, in the sense employed by Martin & Thornton (1996). However, as noted in North, Lehmann & Dunbrack (2011) and also observable in the detailed updated canonical templates (see Results section), the constantly increasing number of new antibody structures is already transforming most canonical templates into a ‘rogue’ status.

A visual examination of all detected occurrences was performed and detailed observations for Light and Heavy chain CDRs, and CDR-H3-base can be found in Supplemental Information 1. Based on these findings, it was decided to make no arbitrary exclusion of CDRs from the training set. The reason was that many rogue cases could warrant a dedicated study in order to make inferences on structure validity or potential conformational switches due to antigen/ligand contacts or backbone flexibility. Instead, it was decided that the affected clusters be merged into a combination of predictable conformations. In other words, affected clusters were treated as one during training for DCP and derivation of canonical templates. The implications of this choice are debated in the Discussion section. Finally, this identification of multi-conformational full-rogue members is presented as a piece of subsequent analysis based on the results of the complete clustering performed in Nikoloudis, Pitts & Saldanha (2014).

Validation of DCP training parameters

The DCP method allows selection of the CDR neighbourhood radius (IF) and the maximum combinatorial order of IF fragments. In this demonstration, IF radii of 4 Å, 6 Å and 8 Å (3 possible selections) were considered, as well as maximum orders up to triplets and up to quadruplets (2 possible selections). Therefore, DCP training per CDR/length was repeated for all 6 combinations of parameters and validated each time on the validation set. The combination of parameters that resulted in the higher predictive accuracy was retained for the final evaluation of the method on the test set. For the prediction of the CDR-H3-base conformation, quintets were also considered resulting in 3 additional training sessions. The selected parameters are listed in the Results section.

Blind-testing of sequence-based prediction methods for CDR conformation

Prediction results were categorised into four types: accurate, uncertain, false predictions and novel conformations. Predictions were considered failed in all cases other than the category “accurate”. As the prediction result from DCP signatures and canonical templates is based on the ratio of matched over the total number of signals/canonical residues, it is possible for two conformational classes to obtain the same maximum score. In these cases, the prediction is ‘uncertain’, and all classes with identical maximum score are output for reference. For an accurate prediction, the RMSD distance of the examined CDR conformation from a single cluster’s medoid was required to fall within that cluster’s radius. If this requirement was not matched, then the conformation was considered novel. In a few cases, the examined conformation appeared as an outlier between two clusters, displaying very similar RMSD distances to both their medoids; these outliers were also considered as novel conformations. Conformations with a CDR length with only one available cluster did not count towards any evaluation.

For the assessment of each method’s performance with regard to the prediction of the major cluster (class-I) in each CDR/length, the following measures are calculated: (2)(3)(4) with TP, True Positive; TN, True Negative; FP, False Positive; FN, False Negative. Here, the positive class is the major conformation and the negative class refers to all the other conformations in that length. Therefore, ‘True Negative’ refers to the accurate prediction of a conformation other than the major in that length. Accordingly, ‘False Negative’ refers to the false prediction of a conformation other than the major one in the given length, while the actual conformation is the positive class.

Finally, as a technical appreciation of the combination of precision and recall, the F-measure is also provided: (5) For the ‘uncertain’ predictions, with more than one class attaining equal maximum score, it was judged as more equitable to consider them as Negative results in all cases, since their predictive value is minimal in practice (i.e., the true conformation may be one or none of those reported). For those cases, if the true conformation of a CDR matches the major cluster in CDR/length, then that prediction counted as a False Negative for all further calculations — and as a True Negative in the case of the true conformation not matching the major class.

Post-evaluation DCP training and canonical templates’ updating

In order to evaluate the evolution in predictive accuracy of the different methods, an experiment was performed where both the training set and the validation set were combined and subsequently used for DCP training and canonical templates’ updating. The DCP parameters were retained from the previous validation step, meaning that parameters were not re-validated in this phase. Then, a final evaluation was performed on the test set. This stage was called ‘Phase 2’ and was analogous to a single cycle holdout experiment (Table 3). Phase-2 allowed an appreciation of the methods’ performances in time, as more antibody structures become available.

Publicly available prediction tool

A GUI was developed with the Java Swing API for a computational tool that implements the prediction algorithms described for DCP and canonical templates (‘yCDRp’). The package (a jar file and a definitions folder) is available for downloading and use as a stand-alone desktop application at the “Humanisation bY Design” website, hosted by Birkbeck College, London at url: The GUI guides the user to manually structurally number their input Fv sequences and the tool’s initial release applies Phase-1 DCP signatures and canonical templates (‘definitions’) for CDR conformational prediction.


Selected Interaction Frames for testing

Although during validation 3 IFs were assessed, in the following comparison of prediction results only the IF neighbourhood radius that gave the best predictive accuracy was considered. Table 4 shows the IFs that gave the best prediction results and their corresponding CDR neighbourhood radius. Positions at the end of the IF, marked as ‘nx’, refer to CDR-H3 positions at a sequential distance x from the last residue n (H102). Since CDR-H3’s length is hypervariable, it was found that this notation better reflects the topological equivalence of numbered positions.

Notations ‘E’, ‘K’ and ‘K+’, at the end of the CDR-H3-base IF, refer to the β-hairpin type that is favoured at the CDR-H3 apex, depending on the formation of an Extended E (Extended Negative EN and Extended Positive EP both resulting in the same β-hairpin ladder), single-bulged Kinked (K) or Kinked with double-bulged (K+) base. The hypothetical β-hairpin types (A/B/C/D) were derived from the definitions of the base type in Shirai, Kidera & Nakamura (1999). The profiling of an IF fragment that contains a hypothetical β-hairpin type would give the following correspondence in English: “is the co-existence of specific residues at specific Fv positions with a hypothetical β-hairpin type in CDR-H3 distinct within a class and therefore a disjoint event between different classes?” These categorical IF positions were introduced experimentally to the CDR-H3 IF and proved beneficial in practice. It was thus demonstrated that IFs may also include categorical features (another categorical example would be the CDR length) in order to allow the consideration of more complex combinations, for instance between residues and structural features.

Summary results for all experiments

Tables 5 and 6 show the accuracy of each method in each subset and experiment. Novel/non-previously clustered conformations observed in the new dataset are removed from the totals, in order to only assess performances on conformations that are predictable. Similarly, structures with a CDR length that contained less than 10 unique sequences in the clustered set were not considered. Canonical templates’ results show a reduced total test population in CDR-L3, because no templates were available for a length of 11-residues. Individual results are commented on later, per corresponding CDR.

Reconciling the Structural Attributes of Avian Antibodies*

  1. Paul J. Conroy,
  2. Ruby H. P. Law1,
  3. Sarah Gilgunn§2,
  4. Stephen Hearty,
  5. Tom T. Caradoc-Davies,
  6. Gordon Lloyd,
  7. Richard J. O'Kennedy§,234 and
  8. James C. Whisstock135
  1. From the Department of Biochemistry and Molecular Biology, Faculty of Medicine, Nursing and Health Science, Monash University, Melbourne, Victoria 3800, Australia,
  2. §School of Biotechnology, Dublin City University, Dublin 9, Ireland,
  3. Biomedical Diagnostics Institute, National Centre for Sensor Research, Dublin City University, Dublin 9, Ireland, and
  4. Australian Synchrotron, 800 Blackburn Road, Clayton, Melbourne, Victoria 3168, Australia
  1. ↵4 To whom correspondence may be addressed: School of Biotechnology, Dublin City University, Dublin 9, Ireland. Tel.: 353-1-7007810; Fax: 353-1-7006558; E-mail: richard.okennedy{at}
  2. ↵5 An Australian Research Council Fellow and an Honorary National Health and Medical Research Council Principal Research Fellow. To whom correspondence may be addressed. Tel.: 61-4-18170585; Fax: 61-3-99029500; E-mail: james.whisstock{at}

Background: Antibodies from alternative immune hosts provide insights into novel mechanisms of antibody diversity in restricted germ-line repertoires.

Results: The high-resolution crystal structures of the first two chicken single chain antibodies (scFv) with prototypical binding sites are described.

Conclusion: Chickens exhibit unique canonical classes in the CDRL1.

Significance: Aves employ distinct mechanisms to generate diversity resulting in unique binding-site topologies.


Antibodies are high value therapeutic, diagnostic, biotechnological, and research tools. Combinatorial approaches to antibody discovery have facilitated access to unique antibodies by surpassing the diversity limitations of the natural repertoire, exploitation of immune repertoires from multiple species, and tailoring selections to isolate antibodies with desirable biophysical attributes. The V-gene repertoire of the chicken does not utilize highly diverse sequence and structures, which is in stark contrast to the mechanism employed by humans, mice, and primates. Recent exploitation of the avian immune system has generated high quality, high affinity antibodies to a wide range of antigens for a number of therapeutic, diagnostic and biotechnological applications. Furthermore, extensive examination of the amino acid characteristics of the chicken repertoire has provided significant insight into mechanisms employed by the avian immune system. A paucity of avian antibody crystal structures has limited our understanding of the structural consequences of these uniquely chicken features. This paper presents the crystal structure of two chicken single chain fragment variable (scFv) antibodies generated from large libraries by phage display against important human antigen targets, which capture two unique CDRL1 canonical classes in the presence and absence of a non-canonical disulfide constrained CDRH3. These structures cast light on the unique structural features of chicken antibodies and contribute further to our collective understanding of the unique mechanisms of diversity and biochemical attributes that render the chicken repertoire of particular value for antibody generation.

Previous SectionNext Section


Antibodies are natural components of the vertebrate immune system produced by B-cells and function to identify foreign or “non-self” molecules. Due to their exquisite specificity, tune-able affinity, potency, stability, and ease of manufacturability antibodies have enjoyed enormous successes. The pharmaceutical industry has invested heavily in antibody-based therapeutics with 34 antibodies or antibody fragments approved worldwide, 28 of which are approved in both European and United States markets (1, 2), and an estimated 350 antibody-based therapeutics are in the clinical pipeline (2). Furthermore, antibody-based reagents are ubiquitous in diagnostic and research settings as valuable recognition tools. Although the pharmaceutical market is estimated to be worth more than forty billion U.S. dollars, diagnostic- and research-based antibody markets in 2012 were valued at eight billion and two billion, respectively.6 Antibodies have come of age due to our increased understanding of their function, specificity, and origins. This collectively assembled knowledge-base draws observations from a number of disciplines including: structural biology, immunogenetics, cellular immunology, molecular biology, and bioinformatics (4). The three-dimensional structure of antibodies, both free and antigen-complexed, has played a central role in elucidating humoral immune response mechanisms, evolution of the antibody repertoire, and optimization of in vitro-generated antibodies (5). This has resulted in a number of powerful technological advances that have allowed protein engineers to actively harness and augment the potential of the immune system, including in vitro display technologies (6), humanization (7), and engineering of biophysical properties (e.g. affinity, functional activity, specificity) (4).

The natural immune repertoire is dynamic, with the capacity to generate a repertoire of 108 by affinity maturation in vivo, which is substantiated by continual exposure or sensitization to antigens from the environment or by immunization (8). Artificial manipulation of antibody genes has facilitated in vitro selection of truly unique antibodies from extremely large combinatorial libraries, which can be constructed from virtually any species, isolated from B-cells derived from naïve, immunized, or infected subjects or are partially or wholly synthesized in vitro (8). Display technologies such as phage, yeast, and ribosome display, when combined with high-throughput approaches for judicious library screening, have enabled the development of antibodies with highly tailored affinities, specificities, and biophysical properties (9, 10). The merits of in vitro-based antibody approaches, although largely under-appreciated by the research community, allow one to isolate antibodies with properties extremely or if not impossible to attain using the immune system alone (6, 8, 11). These powerful avenues of antibody development are of considerable importance for development of novel therapeutic entities and next generation diagnostic reagents and address the consternation among researchers arising from a preponderance of subpar research antibodies (12–14).

The specificity of antibodies is dictated by the hypervariable loops (15) or complementarity-determining regions (CDR)7 (16) that form the sites for contact with its cognate antigen. The CDRs are described by canonical conformations, which are defined by the length of the hypervariable loop and both hypervariable loop/frame work region (FWR)-conserved residues (15). The FWR provides a structural scaffold for the antigen binding site and are important for structural diversity, the VL/VH orientation (4, 17) and in some instances may make direct antigen contacts. The IgY (Fig. 1A) is the typical low molecular weight antibody (180 kDa) of birds, amphibian, and reptiles and is considered to be the ancestral form of the mammalian IgG and IgE. Although similar to IgG, the IgY is structurally distinct (18) due to the presence of an additional constant heavy domain, lack of a bona fide hinge region, and differing oligosaccharide side-chain composition and, unlike IgG, is capable of eliciting anaphylactic mechanisms. As the hinge region is absent in IgY, its flexibility is derived from proline-glycine-rich regions at the Cν1-Cν2 and Cν2-Cν3 domains (18). At the genetic level, in contrast to humans, mice, and primates, the v-gene repertoire of chickens employs single functional v-genes for the heavy (VH3 family) and light chains (exclusively λ light chains), which contain unique VL-JL and VH-DH-JH segments (19). In addition to somatic hypermutation, to generate a diverse functional antibody repertoire from such a restricted v-gene germ-line, chickens employ “gene conversion”. This process is analogous to that in rabbits where each v-gene is significantly diversified by recombination of segments from upstream pseudogene blocks, which lack recombination signal sequences (Fig. 1B). As a consequence, the repertoire requires sequence homology between the germ line and pseudogenes to allow for donation of gene segments. Hence, a low level of FWR mutations was observed in VH repertoire analysis coupled with maintenance of CDR structural residues, but modulation of those residues, which affect VH/VL interfaces (20). The chicken immune system can, therefore, introduce variability at selected FWR residues, resulting in structural diversity of VH/VL angles (4, 20). The D-segments of chickens (15 functional segments) are highly homologous, and hyper-diversification is achieved by gene conversion at D-D junctions, creating “mosaic CDRs” (19). These D-segments contain cysteine residues at a far higher frequency than humans or mice. This prevalence results in >50% of the chicken repertoire containing non-canonical CDRH3 cysteine residues and potentially plays an important role in functional diversity. Structures of such disulfide-containing CDRH3 across species are rare, as cysteine residues are observed at low frequency in mature B-cells (20). However, selection of avian CDRH3 non-canonical disulfide-containing clones from Escherichia coli indicates that it is possible to efficiently sample the full breadth of the chicken repertoire by phage display (19).


Diagrammatic representation of antibody structure and the mechanism of gene conversion.A, the IgY (180 kDa) from chicken is composed two identical polypeptide chains and, unlike IgG, has an additional constant heavy (red) domain and additional carbohydrate sites (pink dots). The single chain fragment variable (scFv) is constructed by isolation of the VL (blue) and VH (green) genes, which are linked in the VL-VH orientation by a flexible linker (black line). B, the process of gene conversion in chicken is depicted for both the heavy and light chain. In the H germ line, a functional VH domain is composed of unique VH and Jμ gene segments with one of a family of Dμ elements (∼15). In the L germ line, only one light chain exists (λ), and every light chain is composed of the same Vλ-Jλ arrangement, which in itself generates minimal diversity. Interchromosomal gene conversion in immature B-cells gives rise to diversity by translocation of pseudogene sequences into the V-genes. Typically the closest pseudogene is used more frequently in gene conversion; however, in the light chain the pseudogenes are distributed across a 20-kb region preceding the VL gene.

The use of chickens as diagnostic, and indeed therapeutic antibody generation hosts, is advantageous due to their phylogenetic distance from man, tolerance of multiple immunogens, and demonstrated successes in generating antibodies to a wide range of antigens (10, 19). Furthermore, the simplistic arrangement of v-genes, high core temperature, and increased CDRH3 length provide potentially superior antibody biophysical attributes. The unique properties of chicken antibodies have been demonstrated experimentally and are illustrated by detailed repertoire analysis. However, few investigations have examined the structural basis of these attributes to explain the demonstrated functional biology. In this study we have resolved the high resolution crystal structures of two unique chicken antibodies, selected from protein- and peptide-immunized repertoires by phage display, with nanomolar and picomolar affinity (10). The crystal structures have revealed unique topologies and arrangements within the CDRs of chicken antibodies that contribute our increased understanding of antibody diversity.

Previous SectionNext Section


Antibody Selection

The single chain fragment variable (scFv) libraries were generated in the VL-VH orientation, as described by Andris-Widhopf et al. (21). Clone 180 was selected from a cardiac Troponin I (cTnI) peptide (39KISASRKLQLKT50)-immunized repertoire by iterative cycles of phage display, as described previously (10). Clone B8 was selected from a PSA protein (human seminal fluid; SCIPAC)-immunized repertoire. Briefly, an adult leghorn was immunized with PSA, sacrificed, and the antibody repertoire was accessed from mRNA isolated from B-cells (femur bone marrow and spleen tissue) and displayed on the surface of filamentous phage (21). The antibody was isolated by iterative cycles phage display with increasing stringency exerted by serial limitation of adsorbed PSA.

Antibody Expression and Purification

ScFv antibodies were expressed within the periplasmic space of E. coli Top10F′ (Invitrogen) with the pComb3x vector (21). Single colonies were selected from LB-agar supplemented with 25 μg/ml carbenicillin and grown overnight in 5 ml of Superbroth supplemented with 25 μg/ml carbenicillin and 1% (w/v) glucose at 37 °C with shaking at 220rpm. This starter culture was used to inoculate 100 ml of Superbroth with 25 μg/ml carbenicillin and was grown to A600 = ∼0.6 before subculturing into 10 × 500 ml of Superbroth with 25 μg/ml carbenicillin (2-liter flasks). At A600 = ∼0.6, the cultures were induced with 0.2 mm isopropyl 1-thio-β-d-galactopyranoside at 30 °C with shaking at 230 rpm overnight (∼16 h). The bacteria were harvested by centrifugation (3220 × g) at 4 °C for 20 min. Soluble scFv were released from the periplasmic space by osmotic shock in a two-step process. The pellet was first thoroughly resuspended in 1× TBS (25 mm Tris, pH 8.0, 150 mm NaCl), and an equal volume of 2× shock buffer (50 mm Tris, pH 8.0, 300 mm NaCl, 1 m sucrose, and 2 mm EDTA) was added before incubation at room temperature for 15 min. Shocked cells were recovered by centrifugation (12,400 × g at 4 °C for 20 min) followed by resuspension in ice-cold 5 mm MgSO4 and incubation on ice for 15 min. The periplasmic-stripped cells were collected by centrifugation (27,200 × g at 4 °C for 20 min), and 0.2 times the volume of 5× binding buffer (125 mm Tris, pH 8.0, 750 mm NaCl, 50 mm imidazole, 0.02% NaN3) was added to the supernatant. HisBind (Novagen) resin (1 ml equilibrated in 30 ml of 1× binding buffer) was added, and scFv was recovered by batch binding for 2 h at 4 °C on an end-over-end roller. The resin was collected by gravity flow and washed with 30 ml of binding buffer followed by a second wash with 30 ml of wash buffer (25 mm Tris, pH 8.0, 150 mm NaCl, 20 mm imidazole, 0.05% (v/v) Tween® 20, 0.02% NaN3). Bound protein was eluted with 7–10 ml of elution buffer (1× running buffer, 300 mm imidazole) in 0.5-ml fractions, and protein-containing fractions were pooled and concentrated to 2 ml in a 3-kDa concentrator (Merck-Millipore). The pooled fractions were resolved by size exclusion chromatography (S75 16/60; GE Healthcare) in 25 mm Tris, pH 7.4, 150 mm NaCl, 0.02% NaN3. The purified protein was analyzed by SDS-PAGE and Western blot.

Affinity Measurement

Both scFvs were analyzed by surface plasmon resonance-based kinetic evaluation in a hemagglutinin (HA) epitope-capture approach. An anti-HA monoclonal antibody (Thermo Fisher Scientific) was immobilized by 1-ethyl-3-(3-dimethylaminopropyl)carbodiimide (EDC)-N-hydroxysuccinimide (NHS) (GE Healthcare) coupling on the surface of a CM5 chip (GE Healthcare) and capped with 1 m ethanolamine (10, 22). ScFv 180 was analyzed on a Biacore™ 4000 (GE Healthcare) using serial dilutions (12 to 0.19 nm) of purified human cTnI (Life Diagnostics) in HBS-EP+ (GE Healthcare) at 25 °C. ScFv B8 was analyzed on a Biacore™ 3000 (GE Healthcare) using serial dilutions of purified human PSA (12.5 to 0.39 nm) in a method described previously (22). The data were collected and processed using the dedicated Biaevaluation software (GE Healthcare). The data were double-referenced by subtraction of a “buffer only” control against the reference cell/spot-subtracted sensorgrams.


ScFv 180 and scFv B8 were concentrated to 16 and 5.8 mg/ml, respectively, with a 3-kDa concentrator (Merck-Millipore). Crystallization was carried out by the hanging drop method with a 1:1 mixture of protein and mother liquor at 4 °C, where crystals formed in 3–7 days. Initial crystal-forming conditions were identified from the Basic Crystallization kit for Proteins (Sigma). After refinement of the conditions, crystals of scFv 180 were obtained in 0.1 m Tris-HCl, pH 9.6, 0.2 m NaOAc, 29% (w/v) PEG 4000 with 0.1 m glycine. Crystals of scFv B8 were obtained in 0.1 m NaOAc, pH 4.6, 0.2 m ammonium sulfate. The crystals were flash-cooled in liquid nitrogen using 25% (v/v) glycerol as the cryoprotectant.

Structure and Refinement

Data sets were collected at the Australian Synchrotron MX2 beamline at 100K (23). The data were merged and processed using XDS (24), POINTLESS, and SCALA (25). Five percent of the data set was flagged as a validation set for calculation of the Rfree. Molecular replacement of scFv 180 was carried out using PDB code 1MCP as a search probe. Molecular replacement was carried out with the hypervariable loops removed and initially with the light chain only followed by the heavy chain with the light chain fixed. One molecule was found per asymmetric unit cell, and an initial model was generated using wARP (26). Model building was performed using COOT (27), and refinement was performed using PHENIX (28) and REFMAC (29). Molecular replacement of scFv B8 was carried out using scFv 180 as a model, and one molecule was found in the asymmetric unit. The starting model, model building, and refinement were carried out as for scFv 180. Crystallographic and structural analysis was performed using CCP4 suite (30) unless otherwise specified. Secondary structure assignment was carried out using STRIDE (31). The figures were generated using PyMOL (32), and the structural validation was performed using MolProbity (33). All atomic coordinates and structural factors were deposited in the PDB under codes 4P48 (scFv 180) and 4P49 (scFv B8).

Previous SectionNext Section


The antibody fragments were selected from VL-VH-orientated libraries constructed from high-titer chicken immunization regimens using both peptide (scFv 180)- and native protein-based (svFv B8) immunogens. In both cases the scFvs were selected by iterative cycles of phage display against purified human proteins (data not shown) (10).

Binding Analysis

Detailed kinetic analysis (Fig. 2) was carried out using surface plasmon resonance technology in a capture approach. The pComb3x-encoded HA tag facilitated oriented capture of highly functional antibody surfaces with the binding responses monitored with the protein antigen as ligands. Both scFvs were found to interact with their respective antigens with high affinity (Table 1). The selection approach applied to scFv 180 (10) was tailored to ensure that the kinetics were suited to a room temperature, point-of-care (POC) device, and therefore, its rapid association (ka) was expected to perform optimally in a short assay time frame, and once bound, its slow dissociation constant (kd) gave rise to particularly high affinity antibody in the low picomolar range (18 pm;Table 1). In analysis, using the cognate peptide conjugated to a carrier protein, the affinity was found to be ∼5-fold lower (99 pm; data not shown). ScFv B8 exhibits classical single digit nanomolar affinity (1 nm) associated with in vivo-matured antibodies.


Kinetic evaluation of the avian scFvs with their respective protein antigens. Both scFv were analyzed in a HA capture configuration, which oriented the scFv on the surface of the sensor chip. A, scFv180 was analyzed on a Biacore 4000™. The flow cell (FC) contained 5 spots; spots 1 and 5 were active (anti-HA + scFv + cTnI), spots 2 and 4 were control spots (anti-HA + cTnI), and spot 3 was an activated-deactivated surface control. Each was independently monitored, the sensorgrams double reference-subtracted (spot 1–2 and 5–4 and a buffer only control), and the 3 nm concentration was carried out in duplicate. B, scFvB8 was analyzed on a Biacore 3000™. FC 1 and 2 were functionalized with the anti-HA capture antibody. Purified scFv was captured on FC 2 only and PSA passed over FC 1 and 2. The sensorgrams were FC2–1 and double (buffer) reference-subtracted, and the 3.16 nm concentration was carried out in duplicate. Three independent analyses were carried out, and kinetic constants are reported ± standard error (Table 1). RU, response units.


SPR kinetic analysis of scFvs

Sequence Analysis

Alignment of the VL, VH, and germ-line sequences (Fig. 3) illustrates the typical FWR sequence uniformity observed in chicken antibodies. Within the FWR of the VL, mutations occurred at residues L17 and L20 (scFv 180 only) and at Vernier positions L47 and L71 (scFv B8 only). In the VH, mutations were observed at Vernier positions H2 (scFv 180 only), H47 (light chain contact residue), and H78. Within the VL CDRs, scFv 180 exhibits a high level of CDRL1 diversification with a significant insertion of aromatic (Tyr), small (Gly), and charged (Arg) amino acids, resulting in a 14-residue CDRL1 (Kabat definition unless otherwise stated; see Fig. 3), which correlates with a recent study where CDRL1 canonical structure distribution could differentiate antibodies for specific antigens (proteins, peptides, and haptens) (4, 5). Anti-peptide binding antibodies tend to favor longer CDRL1 loops (11–13) over the shorter loops observed within anti-protein binding antibodies (6–8 residues) (5). The antibodies have similar CDRH1; however, the CDRH2 are significantly diversified, containing small (Gly/Ser), negatively charged (Asp), hydrophobic (Ile), and aromatic (Tyr) residues. Unsurprisingly, the two scFv have highly divergent sequence composition of the CDRH3, and furthermore, exhibit a differing residue composition of the paratope when compared with humans and mice (20). In both cases the CDRH3 are biased toward small amino acids (Gly/Ser/Ala/Cys) with a low frequency of Tyr, which is the dominant residue of mice and humans (4). Both CDRs maintain the conserved residues Lys/Arg-94 and Asp-101 of the theoretical bulged/kinked CDR (34, 35), which appears to hold true for the majority but not all structures (36). The CDRH3 of scFv 180 is 15 residues long, dominated by small (Gly and Ser), negatively charged (Asp) residues with a single aromatic (Tyr) residue. The CDRH3 of scFv B8 is 14 residues long, composed of small (Gly/Ser/Ala/Cys), positive (His/Arg), and hydrophobic (Ile/Leu) residues. It is of the chicken VH major structural class type 1, due to the pair of non-canonical cysteine residues (94ARSHCSGCRNAALIDA102), as defined by Wu et al. (20).


Amino acid alignment of avian scFv and germ-line sequence. The chicken scFvs were numbered by the Chothia, Chothia with structurally corrected framework indels, and Kabat schemes. The framework regions (FR) and CDRs are indicated below the numbering schemes, with the associated CDR definitions for each scheme indicated by the red-shaded areas. Variations in amino sequence from the germ-line (for scFv 180 and B8) are highlighted in gray for clarity, insertions are indicated by the yellow boxes, and gaps are illustrated by dashes. The sequences of the VL and VH of the avian Fab PDB code 4GLR are shown at the bottom for comparison (19). The chicken mature light chain is two residues shorter than that of the typical mammalian λ light chain and, therefore, begins at position number 3. The antibodies were numbered using “Abysis” (53).

Crystal Structure

To investigate the structural attributes of avian fragments, crystal screening was undertaken. The scFvs were purified to homogeneity from the periplasmic space of E. coli in a two-step purification protocol and were concentrated to 16 mg/ml (scFv 180) and 5.8 mg/ml (scFv B8) for x-ray crystallography. The resultant structures were solved at high resolution (Table 2): 1.35 Å in the P2 space group for scFv 180 and 1.40 Å in the P6 space group for scFv B8. In both cases there was a single molecule in the asymmetric unit. The two structures illustrate the prototypical binding site topologies for anti-peptide and anti-protein binding antibodies. The long CDRL1 and CDRH3 of scFv 180 created a protruding binding site (Fig. 4B) with a groove at the center (Fig. 4C), which is the classical anti-peptide binding antibody topology (4). Electrostatic surface analysis of the scFv shows a negatively charged binding site (Fig. 4D). In contrast, anti-protein antibodies tend to have larger, flat binding sites, and the prototypical topology was observed in the scFv B8 structure (Fig. 4, B and C) where the electrostatic surface reveals a predominantly positively charged binding site (Fig. 4D). As these two structures represent two of only three chicken structures available, a comparison of human and murine canonical CDRs, as classified by North et al. (36), was undertaken. Alignment of the two scFv with the chicken Fab (PDB code 4GLR) shows significant FWR sequence similarity (Fig. 5A). The chicken CDRs L2, L3, H1, and H2 comply with canonical conformations seen in mammalian structures. ScFv 180 could be classified into the clusters L2–8, L3–11, H1–13, H2–10. ScFv B8 was classified into the clusters L2–8, L3–9, H1–13, and H2–10; although the CDRH2 length was 11 residues, the structural class was well matched. In the case of both antibody structures, the CDRH3 adopted the kinked/bulged conformations of the torso, and the CDRL1 had canonical conformations that are distinct from all clusters described in human and rodents to date. The CDRL1 of scFv B8 matches the unique canonical conformation described by Shih et al. (19) for the chicken anti-pTau antibody (PDB code 4GLR) (Fig. 5B). The two non-canonical cysteine residues in the CDRH3 of scFv B8 form a disulfide within the CDRH3 and appear to support the adjacent, positively charged His-96 and Arg-100A residues. This bonding appears at a high frequency number of other species including camelids, sharks, cows, pigs, and platypus (4), where the rigidity imparted by these disulfide bonds to long loops may be advantageous by minimizing entropic penalties during binding events (37) and have been shown to be essential for stability and binding function (data not shown) (38). This is the second example of such a CDRL1 canonical class that appears to be unique to chicken and was observed in crystal structures of two avian antibodies with intra-CDRH3 disulfide bonds (Fig. 5B). Although both of these chicken structures show this unique binding site arrangement, the nature of the antigens bound by the topology is different (PDB code 4GLR is phosphopeptide binding, and scFv B8 is protein binding). It would, therefore, appear that the mechanism is not implicitly antigen type-specific, although the binding site topology of PDB code 4GLR does not possess a typical “grooved” paratope associated with peptide binding antibodies because of the “bowl-like” recess in the CDRH2 that accommodates the phosphate group (19). In the case of scFv B8 CDRH2, the tyrosine residue at position H56 fills this recess, and the CDRH3 is flatter, with small amino acids directed toward the CDRH2 within the disulfide-bonded loop, which anchors two positively charged residues (His-96 and Arg-100A) positioned either side of the loop. This is in contrast to PDB code 4GLR, where the disulfide bond is positioned to hold the CDRH3 in such a conformation as to position the phosphothreonine (Thr(P)-231) in contact with the bowl-like recess in CDRH2 through orientation of key CDRH3 antigen-contacting residues (19). The scFv 180 CDRL1 (24SGGGRYYDGSYYYG34) is 14 residues in length composed of small (Gly/Ser; 50%), hydrophobic (Tyr; 35.7%) and charged residues (Arg/Asp; 14.3%). The long loop is stabilized by a network of inter- and intra-CDR contacts and contacts with both the LFR and HFR. Its length groups it in CDRL1–14 cluster; however, a structural comparison with light chains of the CDRL1–14-1 and CDRL1–14-1 median structures, PDB codes 1NC2 and 1DCL, indicated that it adopts an extended loop conformation (Fig. 5C). Furthermore, the loop appears to be more structurally similar to the extended loops of the longer clusters CDRL1–15-1 (PDB code 1EJO), L1–15-2 (PDB code 1I7Z), and L1–16-1 (PDB code 2D03) when aligning the CDRL1 loops only, independent of FWR (data not shown). The long CDRH3 (16 residues, North et al. (36), CDR definition) contains small (Gly/Ser; 43.75%), hydrophobic (Ala/Ile/Tyr; 25%), and charged residues (Asp/Lys; 31.25%) forming a grooved pocket with the CDRL1 (Fig. 4, B and C), which is negatively charged (Fig. 4D) and would appear to be a logical outcome given the highly charged nature of both cTnI (pI = 9.87) and the peptide (pI = 11.26) at physiological pH. The crystal structure of cTnI (PDB code 1J1D) indicates that this N-terminal region is α-helical and strongly positively charged. This antibody, although raised against a linear synthetic peptide, also recognizes epitope in the context of the native protein (Fig. 2 and Table 1) as previously described (10). It is likely that the peptide immunogen adopted a conformation to mimic this helical region or the binding event with the antibody causes the peptide to adopt a mimic conformation, and a recent study suggests that key “anchor” residues in the C terminus of such peptides are responsible for peptide specificity of antibodies and other peptide-binding proteins (39).


Structures of scFv 180 and B8.A, diagrammatic representation of the scFv format colored by chain (VL, blue; VH, green) with the CDR L1 (red), L2 (salmon), L3 (brown), H1 (yellow), H2 (orange), and H3 (magenta) shown. The flexible linker (black) connects the C terminus of the VL to the N terminus of the VH but was not modeled in the solved structure. B, ScFv 180 is dominated by the CDRL1 and H3, which protrudes into the solvent, creating a grooved binding site that is typical of an anti-peptide antibody. The CDR arrangement of scFv B8 is more compact creating a “shelf-like” arrangement for antigen binding. C, the antibody binding sites as viewed from the antigen perspective. D, the transparent electrostatic surface view of the scFvs, from the antigen perspective, with the CDR loops visible. The positively and negatively charged areas are indicated in blue and red, respectively. ScFv 180 has a predominantly negatively charged pocket that is largely attributed to CDRs L1 and H3. In contrast, scFv B8 has a predominantly positively charged surface.


Uniquely chicken CDRL1 canonical structures.A, schematic view of avian scFv 180 (blue-white), B8 (cyan), and Fab PDB code 4GLR (gray) illustrating the degree of structural similarity of the avian variable domains. The view is rotated through 90° (below) to view the antibodies from the antigen perspective. In both views, the CDRL1 and H3 are highlighted for each avian antibody: scFv 180 (red) scFvB8 (green), Fab PDB code 4GLR (orange). B, schematic view (as in A) illustrating the CDR arrangement of scFv B8. The CDRL1 and H3 (green) are overlaid with PDB code 4GLR (orange) with the remainder of the CDRs colored in gray. The CDRs are labeled on the view rotated through 90° to view the CDR arrangement from the antigen perspective (right) showing the non-canonical cysteine residues (Cys-97 and Cys-100) that form the intra-CDRH3 disulfide bond. The schematic view rotated through 90° of the avian scFv B8 CDRL1 (bottom) aligned and superimposed onto the median canonical mammalian clusters, CDRL1-10-1 (PDB code 1YQV: L, cyan), CDRL1-10-2 (PDB code 1AY1, L, purple), and the avian anti-pTau Fab (PDB code 4GLR: I, orange). The scFv B8 L1 is eight residues in length (24–34) and does not match any of the canonical clusters defined by North et al. (36); however, it does share a canonical structure with that of the chicken Fab, PDB code 4GLR. C, schematic view of scFv180 CDRL1 and H3 (red, in the same view as A) with the light chain aligned with representative structures of the canonical mammalian CDRL1 clusters. The CDRs are labeled on the view rotated through 90° to view the CDR arrangement from the antigen perspective (right). The view is rotated 90° to view the CDLR1 canonical structure alignment (bottom) where the CDRL1 (red) is superimposed onto the canonical mammalian clusters: CDRL1–13-1 (blue; PDB code 2A9M), CDRL1–14-1 (orange; PDB code 1NC2), CDRL1–14-2 (purple; PDB code 1DCL), CDRL1–15-1 (magenta; PDB code 1EJO), CDRL1-15-2 (yellow; PDB code 1I7Z) and CDRL1–16-1 (teal; PDB code 2D03). The scFv 180 CDRL1 is 14 residues in length (residues 24–34); however, it is not represented in the L1–14-1 cluster as defined by North et al. (36). The closest structural representative loop conformation is that of the murine anti-glycophorin A κ-light chain (PDB code 2D03; teal), which belongs to the L1–16-1 cluster.


Refinement statistics

Highest resolution shell is shown in parentheses. ASU, asymmetric unit.

Previous SectionNext Section


In chicken, as in higher vertebrates, the primary mechanism for VH diversification is V-D-J recombination and somatic hypermutation (20, 40, 41). However, the chicken VH repertoire is diversified by gene conversion in place of a diverse repertoire of sequences/structures, whereby multiple upstream pseudogenes can undergo recombination into the VH gene after functional V-D-J rearrangements. This mechanism leads to both mutation of the CDRs and modulation of the FWR (20). The chicken antibody repertoire has been extensively examined at the genomic level and also in terms of both functional isotype content (4, 42). Recently, a number of studies have exploited the chicken repertoire to generate high quality antibodies to a wide range of antigens (10, 19, 22, 43–48), have extensively examined the characteristics of the repertoire with detailed analysis of amino acid diversity in both naïve and selected repertoires (20), and have presented the crystal structure of an chicken Fv domain (19).

This study presents two high affinity chicken scFv crystal structures with prototypical binding site topologies for peptide and protein antigen binding. The antibodies were generated from immunized repertoires by phage display and interacted with their cognate antigens with high affinity (scFv B8, 1 nm; scFv 180, 20 pm). These antibodies exemplify the power of display technologies, in particular phage display, to not only recapitulate (scFv B8) but also exceed (scFv 180) the diversity of host immune system to generate antibodies that surmount the theoretical affinity ceiling (8, 49). Combinatorial libraries facilitate combinations of heavy and light chains that are truly randomized, not represented or accessible in the natural B-cell repertoire, and routinely permit isolation of antibodies with sub-single digit nanomolar affinities (8). Thus, such technologies facilitate exploitation of the true affinity potential of the natural repertoire in vitro.

The structures revealed a number of interesting canonical structural grouping deviations that may well be distinctly “chicken.” At present the paucity of chicken structures in the Protein Data Bank precludes a rational, focused appraisal of the consequences of novel CDRL1 canonical structures or the non-canonical disulfide-bonded CDRH3 for the classical modes of protein, peptide, or hapten binding. The CDRL1 of scFv B8 mirrors the chicken canonical cluster described by Shih et al. (19), and its presence is also concurrent with a long, non-canonical disulfide-constrained CDRH3, which exhibits a distinct bias toward small amino acids (Gly/Ser/Ala/Cys/Thr) (4, 20). This may suggest that such a binding site arrangement is not antigen binding mode-specific but raises the possibility that it is necessary to have a short CDL1 to facilitate the elongated and disulfide constrained CDRH3. These descriptive structures further support the increasing genetic, functional, and now structural evidence that such constrained CDRH3 is an active diversification strategy in the restricted germ-line repertoire of chicken (19, 20). This CDRL1-CDRH3 arrangement forms a common structure in chicken antibodies (two of the three available chicken structures) that is distinctly different from the structures described in mammals to date (19, 36, 50). Furthermore, the CDRL1 of scFv 180 also exhibits a conformation that has not been observed to date in humans and rodents, which is distinct from those observed for both scFv B8 and PDB code 4GLR, which forms an extended CDRL1 and is the basis of a grooved peptide binding paratope in combination with a long CDRH3.

The selection of an avian CDRH3 with non-canonical disulfides from E. coli demonstrates not only the capacity to fully access the chicken repertoire but the importance of including such mechanisms of diversity in selection of novel antibody candidates (19, 51). The restricted v-gene germ-line repertoire in chicken has evolved mechanisms capable of achieving equivalent levels of Ig protection but is distinct from murine, human, and primate mechanisms and is fully capable of broad range antigen recognition (20). Therefore, chicken antibodies present a valuable reservoir of antibodies that could be tapped into for superlative diagnostics and possibly therapeutic entities.

The recent crystal structures of chicken (described here and by Shih et al. (19)) and bovine (52) antibodies have revealed unique and even surprising mechanisms to generate diversity within the immunoglobulin fold often with restricted germ-line repertoires. Accruing knowledge of antibody structure, function, repertoire, and origin is crucial for a number of important applications including man-made, knowledge-based, repertoire synthesis, and antibody-based drug discovery. These structures give structural endorsement to the active mechanisms of diversification employed by the chicken repertoire by highlighting two descriptive examples of uniquely chicken antibody structures.

Previous SectionNext Section


We acknowledge the infrastructure support from Monash University platforms: Protein Production, Biomedical Proteomics, and Macromolecular Crystallization. This research was undertaken on the MX2 beamline at the Australian Synchrotron, Victoria, Australia.

Previous SectionNext Section


  • ↵1 Supported by Australian Research Council and National Health and Medical Research Council grants.

  • ↵2 Supported by the Irish Cancer Society program Grant PCI11WAT, as part of the Prostate Cancer Research Consortium, Dublin, Ireland.

  • ↵3 Both are joint senior authors.

  • ↵* This work was supported by the Science Foundation Ireland (SFI) under Centres for Science Engineering and Technology (CSET) Grant 10/CE/B1821 (to P. J. C., S. H., and R. O. K.) and a SFI Short-term Travel Fellowship (to P. J. C.).

  • The atomic coordinates and structure factors (codes 4P48 and 4P49) have been deposited in the Protein Data Bank (

  • ↵6 S. Yu, unpublished data.

  • ↵7 The abbreviations used are:

    complementarity determining region
    cardiac troponin I
    frame-work region
    prostate-specific antigen
    single chain fragment variable
    flow cell.
  • Received March 18, 2014.
  • Revision received April 10, 2014.
  • © 2014 by The American Society for Biochemistry and Molecular Biology, Inc.


  1. 1.↵
  2. 2.↵
  3. 3.
  4. 4.↵
  5. 5.↵
  6. 6.↵
  7. 7.↵
  8. 8.↵

0 thoughts on “Chothia Canonical Assignment Discovery

Leave a Reply

Your email address will not be published. Required fields are marked *