Normalization

In spite of great care in keeping experimental conditions constant, random effects are unavoidable. In circadian research we usually use multiple chips (12 chips in our case) to measure temporal changes of mRNA expression. As stochastic variability is inevitable, proper mathematical procedures must be implemented to allow for cross-chip comparisons. "Normalization" is a term used to describe processes that reduce the impact of random effects on the data, with many methods having been proposed (25,26). In this section we adopt the following: we scale the average expression level on each chip so as to be equal among all chips, as we assume that all chips have been stained with roughly the equal amount of total mRNA. Another popular technique is to scale the expression levels so as to have equal medians for all the chips. Although more sophisticated techniques are now currently available (25), normalization of the average or the median are still first-choice methods (Fig. 1). In the following subheading, we describe the program codes for Mathematica to perform these normalization steps.

3.3.1. Load Packages and Expression Profile Data

1. Before starting, load the Mathematica packages required for the subsequent analyses.

Needs["Statistics'MultiDescriptiveStatistics'"] Needs["Statistics'ContinuousDistributions'"]

2. Load the previously prepared raw expression data, using the following Mathematica code:

dataTable =

ReadList["C:\\work\\data.txt",{Word, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number}]; Now the variable "dataTable" is a table (two-dimensional matrix), whose rows represents genes, and whose columns represents probe IDs (column 1) and expression profiles (column 2 to column 13).

3. Separate probe IDs from expression profiles using the following code:

idList=Transpose[dataTable][[1]]; rawExpressionTable=Transpose[Drop[Transpose [dataTable],{l}]];

Table 1

Summary of Microarray Studies on Circadian Rhythms

Authors

Year Sample

DNA Chip Design

Analysis method

Schaffer et al. (19) 2001

Claridge-Chang 2001

McDonald and 2001

Rosbach (20)

Grundschober 2001 et al. (21)

Humphries et al. (15) 2002

Duffield et al. (17) 2002

Arabidopsis Arabidopsis

Drosophila head HDO

Drosophila head HDO

Rat-1 fibroblasts HDO

Rat liver cDNA Rat kidney

Rat pineal gland cDNA

Mouse liver cDNA Mouse hypothalamus

Rat-1 fibroblasts cDNA

Drosophila head HDO

Mouse heart Mouse liver Mouse SCN Mouse liver Drosophila head

Mouse SCN Mouse liver

12 time-points, 4-h interval, LD followed by DD, n = 3

20 time-points, 4-h interval, DD, n = 1 2 time-points, 12-h interval, LD, n = 1

13 time-points, 4-h interval, DD, n = 1 12 time-points, 4-h interval, LD and DD, n = 1

HDO 6 time-points, 4-h interval, LD, n = 2-3, and DD, n = 2 HDO 12 time-points, 4-h interval, LD and DD, n = 1

Cross correlation with cosine waves Two time-point comparison

Fourier analysis

Cross correlation with cosine waves

Spectral analysis

Two time-point comparison

Two time-point comparison Anchored comparison Moving window analysis Cosine wave fitting Cross correlation with cosine waves

Autocorrelation analysis

Cosine wave fitting

Autocorrelation analysis

Cross correlation with cosine wav es

2002 Drosophila head

Hirota et al. (16) 2002 Rat-1 fibroblasts

Nowrousian et al. (22) 2003 Neurospora

2003 Mouse liver

2003 Arabidopsis

2004 Mouse liver

HDO 12 time-points, 4-h interval, LD and DD, Cosine wave fitting n = 2

HDO 3 time-points, 0 h, 1 h, 4 h, n = 1 cDNA 5 time-points, 4-h interval, 1 cycle, DD, n = 3 and temprature entrainment HDO 2 time-points, 12-h interval, 1 cycle, DD, n = 1

HDO 2 time-points, 12-h interval, 1 cycle, DD, n = 2-3

Time-point comparison Time-point comparison Cosine wave fitting Two time-point comparison

Time-point comparison Two time-point comparison

Studies are listed by publication date.

HDO, high-density oligonucleotide microarray; cDNA, complementary DNA microarray; SCN, suprachlasmatic nucleus; LD, light-dark; DD, constant darkness.

Table 2

Profiles of Gene Expression Over 2 d at 4-h Intervals

Table 2

Profiles of Gene Expression Over 2 d at 4-h Intervals

1415670_

at

313.6

332.7

313.1

425

599.7

463.8

429.2

324.6

554.4

461.2

575.6

349.5

1415671_

at

680.4

799

805.5

1019.7

1031.7

1008.5

1006.5

707.5

756.8

1123.4

1195.1

675

1415672_

at

1281.6

1484.1

872.7

1058.8

1184

1084

1227.2

931.4

1059.4

1214.8

1203

764.2

1415673_

at

124.3

95.3

80.4

110.3

132.9

112

103.9

58

64.9

108.1

101.5

65.4

1415674_

_a_at

307.4

335.8

312.1

376.6

350.2

340.7

394.6

289

284.8

385

375.8

245.7

1415675_

at

258.8

229.4

231.9

282.3

245.4

271.7

315.5

228.3

167.5

227.4

242.4

170.9

1415676_

a_at

1094.3

1415.7

1330.2

1327.6

1242.9

1221.8

1722.2

1248.4

1092.6

1446.3

1311.5

1173.6

1415677_

at

441.3

480.6

557.8

737.4

434.2

523.2

789.9

635

372.1

850.9

524.1

625

1415678_

at

828.6

930.7

884

967.3

950

818.9

749.2

687

685

984.4

792.3

570.2

1415679_

at

1274.7

1409.7

1202

1358.2

1286.6

1249.3

1500.2

993.2

1185.1

1428.4

1565.9

958.2

This table is created with Microsoft Excel. The first column shows the "Affymetrix Probe Set Ids" and the following columns indicate the expression level for each gene. The first 10 out of 22,690 rows are shown here. There is no header row to simplify Mathematica codes.

Loge (expression)

Loge (expression)

Fig. 1. Schematic representation of the normalization procedure. Gene expression data from two different chips are shown before (A) and after (B) normalization to illustrate how these procedures transform the data sets. The normalized distributions, shown in (B), are shifted and aligned at their centers. Gene expression comparisons between the two distributions can now be made without systematic experimental bias.

Loge (expression)

Loge (expression)

Fig. 1. Schematic representation of the normalization procedure. Gene expression data from two different chips are shown before (A) and after (B) normalization to illustrate how these procedures transform the data sets. The normalized distributions, shown in (B), are shifted and aligned at their centers. Gene expression comparisons between the two distributions can now be made without systematic experimental bias.

The first code exchanges rows and columns of dataTable, and then extracts the first column (probe IDs). The second code exchanges rows and columns of dataTable, and then drops the first column (probe IDs), and exchanges its rows and columns again. The produced "idList" is an array of probe IDs, and "rawExpressionTable" is a table (two-dimensional matrix), whose rows represent genes, and whose columns represent expression profiles.

3.3.2. Equalize Average or Median of Each Chip

1. Scale the level of expression of each probe so that the average expression level for each chip becomes 1000 (see Note 1) using the following Mathematica code:

normalizationFactors=1000/Mean[rawExpressionTable];

normalizedExpressionTable=rawExpressionTable.Diagonal

Matrix[normalizationFactors];

The first line calculates scaling factors and put them in a vector. The second line multiplies "rawExpressionTable" with a diagonal matrix of the scaling factors to produce normalized expression profiles "normalizedExpressionTable," whose rows contains normalized expression profiles of each gene. Alternatively, scale the expression levels for each probe so that the median of each chip becomes 1000 (see Note 1), using the following Mathematica code:

normalizationFactors=1000/Median[rawExpressionTable]; normalizedExpressionTable=rawExpressionTable.Diagonal Matrix[normalizationFactors];

Was this article helpful?

0 0

Post a comment