## Info

Figure 1. a) Non-linear map of the 166 aromatic substituents described by six substilucnt constants (71, HBA, HBD, MR, F, and R). b) to g) Plot of the scaled values of the six parameters used to describe the substituents on each point of the non-linear map. Squares (positive values) and circles (negative values) are proportional in size to the magnitude of the parameters. In Fig. lc and Id the dots indicate the substituents which do not have the ability to accept and donate H-bonds,...

## Int

Figure 5. n < j Craig plot of the set of substituents listed in Table l. The dashed lines refer to the limits choscn for n and a, respectively. An additional MR axis perpendicular to the plane of the diagram would separate the pairs 3 7, 4 8, 1 5 and 2 6 further (numbering of the substituents as in Table 1). An illustration of this procedure is given in Table 1 which refers to the selection of 9 substituents for aromatic rings, based on the list of well-characterized substituents for aromatic...

## A

Here x is the mean of column k, t, is the ath score for compound , pa k is the ath loading for variable k, and eLk is the residual. The A components are calculated to make the es as small as possible in the least squares sense. The number of components is determined by cross validation 6 . SIMCA works by deriving a model for each class. Thereafter, classification is accomplished by projecting the data of the test compounds onto each of the training sets via the class models in the descriptor...

## Strategies in Experimental Design

The experimental design methods that have been proposed in the literature can be devided into two categories 1) Methods which are aimed at a direct and, therefore, (supposedly) quick optimization of lead compounds, and 2) methods that provide a strong basis for deriving reliable structure-activity relationships. Methods which belong to the first category arc only suitable in the final stages of an optimization procedure, for which reliable fundamental structure-activity relationships are...

## Conclusions

Representation of biological data on a parabolic surface is possible, when the data set reveals quadratic non-linearity. In this case, it is useful to analyze the data, in order to determine the extremum point. The analysis leads to the determination of the extremum point in a straightforward manner, determining whether the extremum is a maximum or minimum. When the data are displayed as a contour plot, as in Fig. 1 and 2, the structure information is more clearly visible. Further, it is easier...

## An Example of DPPs in Design and Analysis

In a previous study on substituent descriptors the potential use of iv-scales (PPs) was illustrated with a series of tricyclic neuroleptics 7 (Fig. 4). Using MLR and the classical Hansch approach the following equation between ataxia in mice and a steric (Kr) and an electronic descriptor (< r ) was found p D50 0.533(7,, + 0.0366VT - 0.00062 V 16 r2 0.76 Q2 0.42 (1) Using the PPs, obtained as w-scales, and again MLR for the analysis, a significant equation was obtained, but which was difficult...

## J 087

Matched display of Figs. 7a and 8a after normalization of their respective scores. The correlation coefficients between the scores on the axes of the co-inertia analysis are r, and r2. some basic principles of graphics and to try to illustrate them in the particular case dealing with the study of the co-structure between two data tables. Furthermore, it is obvious that these principles must be only considered as guides and not as rigid laws. Indeed, as mentioned by Tufte 2 , with...

## References

1 Spilker, B., Multinational Drug Companies. Issues in Drug Discovery and Development, Raven Press, New York, 1989 2 van dc Waterbeemd, H., Quant. Struct.-Act. Relat. U, 200-204 (1992) 3 van dc Waterbeemd, H Drug Des. Disc. 9, 277-285 (1993) 4 Hansch, C Maloney, P. P., Fujita, T., and Muir, R. M Nature 194, 178-180 (1962) 5 Purcell, W. P., Bass, G. E., and Clayton, J. M., Strategy in Drug Design A Guide to Biological Activity, Wiley, New York, 1973 6 Gould, R. F., ed., Biological Correlations...

## Vch

Weinheim New York Basel Cambridge Tokyo Prof. Raimund Mannhold Biomedical Research Center Molecular Drug Research Group Heinrich-Heine-Universit t Universit tsstra e 1 D-40225 D sseldorf Germany Prof. Povl Krogsgaard-Larsen Dept. of Organic Chemistry Royal Danish School of Pharmacy DK-2100 Copenhagen Denmark Prof. Hendrik Timmerman Faculty of Chemistry Dept. of Pharmacochemistry Free University of Amsterdam De Boelelaan 1083 NL-1081 HV Amsterdam The Netherlands This book was carefully produced....

## The Hansch Approach

In the early 1960s, Hansch and co-workers systematically investigated ways of expressing the relationships between structural and physico-chemical properties and activities quantitatively. The traditional QSAR paradigm is often formulated as shown in Eq. (1). More recently, due to the confusion surrounding the term QSAR, Hansch has referred to the science of chemical < -> biological interactions as the unnamed science 28 . Since it is evident that a biological effect seldom depends on just...

## Symbols

A Loading matrix (also factor pattern) akj Loading of the j-th variable in the < -th principal component or factor ak Corresponding normalized eigenvector BRDa Decrease of diastolic (D) blood pressure in Wistar rats at dose a (after logarithmic transformation) Decrease of diastolic (D) blood pressure in Wistar rats at dose b (after

## Spectral Map Analysis

Spectral Map Analysis (SMA) is a straightforward multidimensional extension of the two-dimensional geometrical constructions in Figs. 4 and 5. The original idea was to define a transformation which would transform the observed log X data into specificities, Z, and which would have the geometrical effects such as is illustrated in Figs. 6 and 7. In the first of these, we consider each of the n compounds to be represented as points in a p-dimensional coordinate space, Sp, by means of their log...

## Ch

Orbital parameters for these same atoms using the AMI Hamiltonian, showed that the E-state values were significantly better in modeling a structure-activity relationship. The atoms implicated by the E-state analysis are the same as those implicated using the less successful molecular orbital model. The equation, relating the E-state indices to the inhibitory potency, p 5o is 2.2.4.2 Adenosine Ax Inhibitors Joshi and Kier 19 using E-state indices, analyzed a series of xanthines reported by...

## Graphical Displays

In this paragraph, our intention is to formulate some of the basic principles, which allow graphs to be employed more incisively in medicinal chemistry. Indeed, the study of the theory of data graphics is beyond the scope of this paper and can be found in numerous reference textbooks e.g. 2, 13, 14 . When a graph is drawn up, quantitative and categorical information is encoded chiefly through the combined use of points, lines, numbers, symbols, words, scales, and or colors. This graph should...

## Oh

Structure of the nine scsquiterpenoid unsaturated dialdehydes under study. chemical investigations. These studies have shown that small structural variations in the molecules can considerably change their activity 32, 33 . The have also stressed numerous contradictory results, certainly in relation with the instability of some of these chemicals in assay media 34 . In order to confirm this hypothesis, we have tried to find the co-structure between a data matrix (similar to Table Y in...

## Chemometrics

The term chemometrics was coined in the 1970s and is defined as the chemical discipline that uses statistical and mathematical methods for selecting and optimizing analytical and preparative methods, as well as procedures for the analysis and interpretation of data. Chemometrics has found wide application in analytical chemistry 10 15 . Two specific journals are devoted to the development and applications of chemometrics, namely Chemometrics and Intelligent Laboratory Systems (1986) and The...

## Statistical Validation of QSAR Results 51 Validation Tools

IC 50 Concentration needed to lower cell viability by 50 PCA Principal components analysis PLS Projection to latent structures PRESS Prediction error sum of squares or Predictive residual sum of squares Q2 Amount of predicted (CV) variance R2 Amount of modelled sum of squares (variance) RSD Residual standard deviation SSy The sum of squares of the response values X matrix Table of N compounds x K structure descriptors The procedure for establishing reliable quantitative structure-activity...

## Qsar

In a CoMFA study 57 , or on applying PLS to the energy field computed by GRID 58 , the rigorous procedure suggested by GOLPE may be impractical, since the variables in 3D QSAR are in the order of hundreds or thousands. Therefore, a strategy providing a reduced number of variables from the beginning and, therefore, a reduced number of combinations, would be highly desirable. In Sect. 5.2.3.1, we showed that D-optimal designs are more efficient than FFDs in constrained problems. Accordingly, in...

## Nhch

TMIC plot for 40 substituents (variables n, F, R, and MR). simple visual inspection. Substituents which are distant from each other, are selected in such a way that the whole space is systematically covered, while not including points which have a symmetrical position with respect to the origin. A TMIC map for 40 substituents (parameter space n, F, R, MR) is shown in Fig. 19. A distinct clustering of substituents, similar with respect to the properties considered, becomes evident and...

## Modern Chemometric Approaches in Molecular Design

Biological activities seldom depend on just one or two chemical properties, and subsequently, a complex matrix of data must often be analyzed. Biological data can vary from just mere simple affinity data ( C50 values) to complex in vivo data, reflecting only the activity or inactivity of a compound. The selection of the appropriate method for handling such data is extremely important if any useful conclusive results are to be obtained. The present volume first describes molecular concepts and...

## Nh

1) nature of the substituent aliphatic ( ) or aromatic ( + ) (the term aromatic refers to an aromatic ring that is directly joined to the benzimidazole) 2) size small ( ) or large ( + ) 3) lipophilicity hydrophilic ( ) or lipophilic ( + ) . The latter two features need further specificaiton the smallest common aromatic substituent is a phenyl group and, therefore, it would be reasonable to consider groups with less than, say 9 non-hydrogen (second row) atoms as small. The nature of the phenyl...

## Ann

ALS Adaptive Least Squares ANN Artificial Neural Networks CA Cluster Analysis CCA Canonical Correlation Analysis CFA Correspondence Factor Analysis CR Continuum Regression CSA Cluster Significance Analysis FA Factor Analysis FB Fujita-Ban analysis FW Free-Wilson analysis GOLPE Generating Optimal Linear PLS Estimations kNN -Nearest Neighbor LDA Linear Discriminant Analysis LLM Linear Learning Machine MLR Multiple Linear Regression NLM Non-Linear Mapping PCA...

## Pc

At some of the software products, which are more specialized in molecular design. A list of these products, which are currently on the market, is given in Table 2. Reviews of new products appear twice a year in the Newsletter of the EFMC (European Federation for Medicinal Chemistry) 49 and the International QSAR Society 50 , In particular, those products which offer molecular display, statistical and graphical tools, such as TSAR 91 , are potentially very useful and would be of considerable...

## O

Spectral map of pharmacological activities of 40 neuroleptics in three selected pharmacological tests, as defined in Table 1. The three selected tests are poles of the spectral map of Fig. 10. The three bipolar axes through the tests represent contrasts. Such values are expressed as ratios between activity values in Table 1. The projecition of the center ( + ) onto a bipolar axis defines the origin of the axis of contrast. All other conventions are the same as in Fig. 10. Note the...

## Examples of Design and PLS Modeling

In this section eight examples are given with the aim of illustrating the concepts of experimental design and PLS analysis in the development of QSARs. The examples concern a diverse series of peptides or peptoids (Sees. 3.2.3.1 to 3.2.3.4), halogenated alkanes (Sec. 3.2.3.5), dibenzofuranes (Sec. 3.2.3.6), aromatics (Sec. 3.2.3.7) and corrosive carboxylic acids (Sec. 3.2.3.8). In the examples given in Sees. 3.2.3.1 to 3.2.3.4, the statistically designed training sets were constructed in...

## J 133 V3

Hjco _ c-Cl hjco ( c+ < -> h3co y c _ 3 _ ch3 - Figure 2. Illustration of the stabilization of an intermediate carbocation by the methoxy group. In analogy to the < j constants for electron-withdrawing groups, Brown and coworkers 9 developed a set of substituent constants < r+ for groups that release electron density via resonance. The reaction selected for defining (j+ was the Sni solvolysis of f-cumyl chlorides in 90 aqueous acetone at 25 C. Electron-releasing substituents such as...

## Tryptamine

Extension of the double-log chart of Fig. 4 for multiple dimensions. The n compounds arc represented as points in p-dimensional space, Sp, spanned by p tests. The diagonal line represents the potencies of the compounds, and the (hyper)plane perpendicular to it defines the specificities between compounds and tests. Spectral mapping performs a Principal Components Analysis (PCA) of the specificities in the (hyper)plane. Figure 6. Extension of the double-log chart of Fig. 4 for multiple...

## Summary How to Develop and Interpret a PLS model

One must have a good understanding of the given problem, in particular, which biological properties of interest are to be measured and modeled, and which structural features should be varied. 2. Good data, both Y (activity) and X (structural descriptors) must be obtained. Multivariate Y variables provide much more information, because they can first be analyzed separately by PCA. This gives a good idea about the amount of systematic variation in Y, and which Y variables should be analyzed in...

## Introduction

The SIMCA method of pattern recognition and classification (hence, abbreviated PARC) was first described in 1976 1 and as a tool in drug design, it was last reviewed by the authors in 1990 2 . Initially, SIMCA was an acronym for SIMple Classification Analysis, but was soon reinterpreted by Dave Duewer as Soft Independent Modeling of Chemical Analogy. While other methods of PARC have been applied to drug design problems, SIMCA remains the method of choice, and a number of recent quantitative...

## Applications of Principal Component and Factor Analysis in Medicinal Chemistry

Some typical situations for the application of factor and principal component analysis in medicinal chemistry are summarized in Table 1. The examples, to be Table 1. Selected applications of factorial methods in medicinal chemistry Table 1. Selected applications of factorial methods in medicinal chemistry Potcncy in a set of parallel biological tests with similar objects (e.g. bactcria, fungi) Relations between tests recognition of redundancies and of tests with high information content Potency...

## Steps in a Pattern Recognition Study

PARC studies are carried out in defined steps which are 1) selecting the training set compounds and developing the training data set (Fig. 1), 2) data preprocessing (transformation, scaling, centering), 3) developing, optimizing and validating the classification models, and 4) classification of the test set compounds. Step 1, ideally, should involve experimental design if the training set is to span the descriptor space. This topic is discussed by others in this volume. Data preprocessing, Step...

## Maximal Standard Deviation

Flow diagram of co-inertia analysis in the context of SAR and QSAR studies. Figure 1. Flow diagram of co-inertia analysis in the context of SAR and QSAR studies. 4.2.3.2 Example Structure-Reactivity Relationships for Unsaturated Dialdehydes Chemical Stability of Sesquiterpenoid Unsaturated Dialdehydes A large number of terpenoids with an unsaturated dialdehyde functionality group have been isolated from various organisms, which occupy different trophic levels in the environment (e.g....

## D

Illustration of the dual geometrical representation of the data in an nxp table of transformed data (specificities). Panel a shows n compounds in test spacc, Sp. Panel b shows p tests in compound space, S . The axes (J and ,) represent the principal components or factors of the patterns. Corresponding factors are common in the two spaces. The elliptic contours are a schematic representation of the probability density contours of the patterns of points. The diagram corresponds to the...

## N

Fusarie (R5 n-Bu) and picolinic (Rs H) acids. pK.d -6.77( 1.39)R(RJ - 4.06( 1.17)F(R5) 15 r 0.977 s 0.291 F 52.9 p Kb 2.53(+ 1.00)77(R4) + 1.81 ( + 0.46)F(R5) + 1.41( + 0.30) (R5) + 13.01 (+ 0.19) Table 4. Pharmacological data of antihypertensive fusaric and picolinic acids. p 50 inhibition of DBH in vitro (I incubation with on excess of Cu2+ ions II no excess of Cu2+ ions), BR log (decrease blood pressure), male Wistar rats with renal hypertension S systolic, D diastolic. a 0.25...

## Strategy for QSAR Development in Drug Design

In the preceding paragraph, some general, but important remarks on modern QSAR analysis were given. These considerations have been incorporated into a strategy for QSAR development, which is described in the next few sections. This strategy consists of six steps, which are closely linked to each other, and are based on the two principal methods of statistical experimental design and multivariate data analysis 5 . Briefly these steps are (1) Formulation of classes of similar compounds, (2)...

## Activity Potency Sensitivity and Specificity

Spectral mapping is an unsupervised multivariate QSAR method. The term multivariate indicates that the method is applicable in the case when several compounds are studied simultaneously in multiple tests. The term unsupervised implies that the method does not rely on a specific model for structure-activity. In contrast, supervised methods, such as those based on regression and discriminant analysis, rely on a training set of compounds in order to determine the parameters of the model. Such a...

## The QSAR model

Any scientific model consists of several stages, starting with the philosophical viewpoint, conceptualization to the execution. All aspects are essential in order to comprehend the model and its underlying concepts. Our way of thinking in chemistry consists of translating the influence of structure on activity in terms of effects, such as lipophilic, steric, polar, hydrogen bonding, and possibly other effects. Some of these can be localized to a part of a molecule, for instance, a part that...

## Discussion and Conclusions

Quantitative structure-activity relationships arc valuable tools for modeling and predicting the biological responses of chemical compounds, and for the identification of potential structures with optimized biological properties. It is important to realize that QSAR modeling is not only restricted to small and semi-rigid molecules, but as shown here, it can be applied to long-chain and highly flexible chemical structures, such as peptides. Thus, QSAR modeling of flexible peptide sequences need...

## Smi

Correlation matrix of the five physico-chemical variables under study. Figure 4. Correlation matrix of the five physico-chemical variables under study. data and scatter plots. The data matrix of molecular descriptors (Table X, Fig. 1) was processed by a standardized PCA. According to the eigenvalues (Fig. 5a), PCI and PC2 enable a graphical interpretation of the data. Fig. 5b shows that PCI is mainly explained by the angle, distance and dipole X variables, which are negatively...

## Vss

We all remember that, having got lost in the wonderland, little Alice met the Cat. She asked him Would you tell me, please, which way 1 ought to go from here That depends a good deal on where you want to got to, said the Cat. I don't much care where , said Alice. Then it doesn't matter which way you go, said the Cat. so long as I get somewhere, Alice added as an explanation. Oh, you are sure to do that if you only walk long enough, concluded the Cat 1 . This little excerpt stresses the point...

## Objectives and Data Homogeneity

Data analysis is very much like chemical analysis one must know what one is looking for in order to select an appropriate analytical method, and a given problem can be solved by a variety of methods. Moreover, for a given problem, not all of the data is of interest, just as a chemical sample contains constituents of little interest. We shall refer to the uninteresting parts of the data as noise, and the data of interest as information. In this context, we must remember that noise is only partly...

## Hn

Antibacterial fluoroquinolones. rotation of the axes, a distinct clustering of points can still be seen. The largest cluster contains the tests, P. ae., P. mi., E. cl., K. pn., E. co., M. mo. and S. ma. (see Table 2 for abbreviations). Obviously, these tests are very similar, at least with respect to the substances investigated. Situated fairly close to this cluster is another cluster with the tests E. fa. and S. au., again indicating similarity, but at the same time, suggesting that...

## Qng

Xi,k Xk + X ti.aPa.k + eUk (2) yi,j Jj + Z Ui,aCa,j + fi,j (3) The variables, is and ws, are latent variables calculated along the axes of greatest variation in X and Y. The latent variables make the x-residuals and -residuals as small as possible and are maximally correlated. They are related through the inner relation, expressed by Eq. (4). The PLS model is shown graphically in Fig. 3. Figure 3. Graphical representation of the PLS model. Figure 3. Graphical representation of the PLS model.

## Developing Optimizing and Validating Classification Rules

SIMCA classification rules are geometric structures in descriptor space. They are (for 3 or more variables) a sphere or hypersphere for A 0 (A is the number of components or product terms in Eq. 1), a cylinder or hypercylinder for A 2, and a parallelepiped or hyperparallelepiped for A 3 or more. Attempts to improve classification by adjusting the SIMCA classification rules have been limited in number. An early report by Forina and Lanteri 21 suggested that SIMCA models be modified to...

## Sxo

A plot of AxO (labeleld DxO) versus 2x0 (labeleld S.rO) for the bioconcentraiion data, showing the elliptical contours for pBCF with the contour intervals shown at 8.20, 6.13, 4.06. 1.99, 0.09, 2.16, and 4.23 (the smallest oval). The data points are shown as lower case letters corresponding to the plot symbol in Table 2. The maximum point is labeled as ' + '. The pair of points for i and j occur at the same coordinates as k and 1 as well as q and v. nectivity chi indexes. There are...

## Conclusion

The aim of this study was not to give a catalogue raisonn of all the graphical methods, which can be used in medicinal chemistry to enhance the statistical results produced by SAR and QSAR studies. Indeed, our intention was only to present . 9-0 Hydroxyisovelleral Acetylmeajlidial r '-. lsovelleral-. Figure 8. a) Representation of the compounds, as defined by their chemical stability on the first co-inertia plane, b) Weights of the chemical stability parameters in the equations of scores of the...

## 4

The correlation between variables y and yr is given by the following In order to find the minimum number of components, p, necessary for data reproduction within residual error, the components are added step by step to the model according to Eq. (12). After each step, the data matrix is reproduced, and the procedure is continued until only non-systematic noise remains. A criterion, which was sometimes used to recognize this salient feature was to Accept only components with eigenvalues > 1....

## C

A mapping error (E) is calculated from the distances in the two spaces. The original mapping error ( ) calculation for NLM, devised by Sammon 10 on the basis of the Euclidean distance, is stated as follows (Eq. (1)) where d* and are the Euclidean interpoint distances in the original and display spaces, respectively. The procedure proposed by Sammon 10 has been significantly modified by Kowalski and Bender 18 . They defined the mapping error function, E, as shown in Eq. (2), where (d*)1' is a...

## Software

One of the most difficult tasks for the medicinal chemist is the interpretation of biological test results and how the rest results correlate with the chemical data. The choice of appropriate software tools to achieve this, is a prerequisite for extracting all the available information from the data. Although seemingly trivial, simple 2D scatter plots of either biological or chemical data are still highly informative, as was recently illustrated in an optimization study of antibacterial agents...

## Determining the Model Complexity

An important point to stress is the difference in prediction error and fitting error 18 . Fitting error is based on predicting the training objects and decreases with model complexity (adding components). Prediction error is based on estimation of compounds not included in model development. It decreases, goes through a minimum and then generally increases with model complexity. Selection of components based on cross-validation 6 gives models with optimal prediction capability. Indeed,...

## Historical Background

Spectral mapping has been proposed by the author as a multivariate QSAR method in 1975. The design of the method was the result of favorable circumstances. It first came about at the research laboratory at Janssen in Beerse, which at that time had implemented a number of simple, but highly effective ideas for the statistical analysis of the results of its screening tests. The procedures that resulted from these ideas relied heavily on graphical displays, as will be shown later on. Secondly, the...

## I

M More active, E equiactive, L less active. Descending lines indicate sequence. Square brackets indicate alternate fCompared to 4-H compound. Figure 2. Operational scheme for the optimization of aromatic substitution patterns (reprinted with permission from Ref. I Copyright 1972. American Chemical Society). The scheme for modifying aliphatic chains can also be applied analogously. The second procedure for optimizing the substitution on aromatic rings, also suggested by Topliss 2 , begins with a...

## Pp1

Scatter plot of PP2 vs PP 1 for the 58 haloalkanes. The compounds submitted to biological testing (training and validation sets) are marked with filled circles. The compounds are numbered as in Table 6. 58 compounds with up to four carbon atoms and a varying degree of halogenation 5 . Interestingly, this class contained many chlorinated chemicals that are used on a large scale in industrial applications (as solvents, etc.), and also several chlorinated-fluorocarbons (CFCs). In order...

## Examples

3.3.4.1 Neurotoxicity of Fluorophosphorous Compounds Mager 10 has reported the neurotoxicity of 21 fluorophosphatcs, fluorophos-phonates and fluorophosphorodiamides. In this data set, there are cyclic and non-cyclic alkyl portions of esters and amides along side one aromatic substituent. The activity to produce ataxia in adult white hens was reported as mg kg. The doses were expressed on a molar basis and converted to the negative logarithm, pD, which ranges from 3.20 to 5.92. The data are...

## Representation of a Response Surface

When two or more structure variables are required to represent the variation in the list of molecules, the non-linear equation becomes more complicated. In addition to linear and squared terms, there are cross terms in the structure variables. For a two-variable case pC a,w, -I- b2u + a2u2 + b2u + cuxu2 + d (3) The term response surface arises because of the contours generated in a plot of u versus u2. The right hand side of Eq. (3) expresses a general parabolic surface one can draw contours at...

## Symmetric and Asymmetric Data Structures

The problem of predicting that some compounds will be biologically active and others will be inactive stimulated much of the early applications of PARC to structure-activity data. Such applications lead to our proposal that QSAR problems lead to two types of data structure 1) symmetric data and 2) asymmetric data 4, 16 . Fig. 2 is an example of symmetric data structure. Two or more classes form well-defined clusters in descriptor space. This results in classification studies of antagoniste vs....

## Y

Data of a QSAR model can be arranged into two tables, matrices, X and Y. Note that the raw data may be transformed (e.g. logarithmically), and are usually centered and scaled before the analysis. Figure 1. Data of a QSAR model can be arranged into two tables, matrices, X and Y. Note that the raw data may be transformed (e.g. logarithmically), and are usually centered and scaled before the analysis. In the example, the side chains of the AAs are modeled using three lipophilicity...

## Basic Principles

4.1.2.1 Principal Component Analysis If for n chemical compounds ( 1, , n) biological potencies are measured in m biological tests (. ' 1, , m) the results can be arranged in a matrix which we shall call the biological data matrix. If the tests are put into the columns (variables) and the compounds into the rows (objects), then the matrix has the following form where > ' is the biological potency of the i-th compound in the y'-th test. In order to give all variables (which may be on quite...

## Establishing the Training Sets

The training set refers to the set of compounds, whose relevant descriptors or features are to be used in the learning phase. Historically, training sets are designed from a lead compound. The result is a group of compounds, which are similar to the lead compound and, for the most part, those that should be most easily synthesized. As mentioned above, training sets should be established from experimental design methods, but this is seldom the case. Even though far from ideal, such data can be,...

## 1

M More active, E - equiactive, L less active. Descending lines indicate sequence. Square brackets indicate alternates. Figure 3. Operational scheme for modifications of aliphatic chains (reprinted with permission from Ref. 1 , Copyright 1972, American Chemical Society). additional substitution patterns have been proposed that might possibly improve activity. In the present example 4-i-Pr, 4-f-Bu, 3,4-di-Me, 4-O-rc-Bu, 4-O-Bz, and 4-N(Et)2 were proposed as additional candidates. An illustration...

## Residuals

The residuals of Y and X are of diagnostic value in determing the quality of the model. A normal probability plot of the Y residuals (Fig. 14) of the final AA model shows a fairly straight line with all values within + 3 SDs. In order to be a serious outlier, a point must clearly deviate from this line and be outside of the limit of 4 SDs. Since there are many X residuals (N * K), one needs a summary for each compound in order not to be cluttered with unnecessary detail. This is provided by the...

## List of Contributors

Karl Thomae GmbH Chemical Department Birkendorfer Stra e 65 Postfach 1755 D-88397 Biberach an der Riss 1, Germany Tel. +497351542974 Fax +497351542165 Dr. Daniel Chessel Universit Lyon I URA CNRS 1451 F-69622 Villeurbanne, France Prof. Sergio Clementi Laboratorio di Chemiometria Dipartimento di Chimica Universit di Perugia Via Elce di Sotto 8 1-06123 Perugia, Italy Tel. and Fax +397545646 gabri chemiome.chm.unipg.it Dr. Gabriele Costantino Istituto di Chimica Farmac...