# Normalization

In spite of great care in keeping experimental conditions constant, random effects are unavoidable. In circadian research we usually use multiple chips (12 chips in our case) to measure temporal changes of mRNA expression. As stochastic variability is inevitable, proper mathematical procedures must be implemented to allow for cross-chip comparisons. "Normalization" is a term used to describe processes that reduce the impact of random effects on the data, with many methods having been proposed (25,26). In this section we adopt the following: we scale the average expression level on each chip so as to be equal among all chips, as we assume that all chips have been stained with roughly the equal amount of total mRNA. Another popular technique is to scale the expression levels so as to have equal medians for all the chips. Although more sophisticated techniques are now currently available (25), normalization of the average or the median are still first-choice methods (Fig. 1). In the following subheading, we describe the program codes for Mathematica to perform these normalization steps.

3.3.1. Load Packages and Expression Profile Data

1. Before starting, load the Mathematica packages required for the subsequent analyses.

Needs["Statistics'MultiDescriptiveStatistics'"] Needs["Statistics'ContinuousDistributions'"]

2. Load the previously prepared raw expression data, using the following Mathematica code:

dataTable =

ReadList["C:\\work\\data.txt",{Word, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number, Number}]; Now the variable "dataTable" is a table (two-dimensional matrix), whose rows represents genes, and whose columns represents probe IDs (column 1) and expression profiles (column 2 to column 13).

3. Separate probe IDs from expression profiles using the following code:

idList=Transpose[dataTable][[1]]; rawExpressionTable=Transpose[Drop[Transpose [dataTable],{l}]];

Table 1

Summary of Microarray Studies on Circadian Rhythms

Authors

Year Sample

DNA Chip Design

Analysis method

Schaffer et al. (19) 2001

Claridge-Chang 2001

McDonald and 2001

Rosbach (20)

Grundschober 2001 et al. (21)

Humphries et al. (15) 2002

Duffield et al. (17) 2002

Arabidopsis Arabidopsis

Rat-1 fibroblasts HDO

Rat liver cDNA Rat kidney

Rat pineal gland cDNA

Mouse liver cDNA Mouse hypothalamus

Rat-1 fibroblasts cDNA

Mouse heart Mouse liver Mouse SCN Mouse liver Drosophila head

Mouse SCN Mouse liver

12 time-points, 4-h interval, LD followed by DD, n = 3

20 time-points, 4-h interval, DD, n = 1 2 time-points, 12-h interval, LD, n = 1

13 time-points, 4-h interval, DD, n = 1 12 time-points, 4-h interval, LD and DD, n = 1

HDO 6 time-points, 4-h interval, LD, n = 2-3, and DD, n = 2 HDO 12 time-points, 4-h interval, LD and DD, n = 1

Cross correlation with cosine waves Two time-point comparison

Fourier analysis

Cross correlation with cosine waves

Spectral analysis

Two time-point comparison

Two time-point comparison Anchored comparison Moving window analysis Cosine wave fitting Cross correlation with cosine waves

Autocorrelation analysis

Cosine wave fitting

Autocorrelation analysis

Cross correlation with cosine wav es

Hirota et al. (16) 2002 Rat-1 fibroblasts

Nowrousian et al. (22) 2003 Neurospora

2003 Mouse liver

2003 Arabidopsis

2004 Mouse liver

HDO 12 time-points, 4-h interval, LD and DD, Cosine wave fitting n = 2

HDO 3 time-points, 0 h, 1 h, 4 h, n = 1 cDNA 5 time-points, 4-h interval, 1 cycle, DD, n = 3 and temprature entrainment HDO 2 time-points, 12-h interval, 1 cycle, DD, n = 1

HDO 2 time-points, 12-h interval, 1 cycle, DD, n = 2-3

Time-point comparison Time-point comparison Cosine wave fitting Two time-point comparison

Time-point comparison Two time-point comparison

### Studies are listed by publication date.

HDO, high-density oligonucleotide microarray; cDNA, complementary DNA microarray; SCN, suprachlasmatic nucleus; LD, light-dark; DD, constant darkness.

Table 2

Profiles of Gene Expression Over 2 d at 4-h Intervals

Table 2

Profiles of Gene Expression Over 2 d at 4-h Intervals

 1415670_ at 313.6 332.7 313.1 425 599.7 463.8 429.2 324.6 554.4 461.2 575.6 349.5 1415671_ at 680.4 799 805.5 1019.7 1031.7 1008.5 1006.5 707.5 756.8 1123.4 1195.1 675 1415672_ at 1281.6 1484.1 872.7 1058.8 1184 1084 1227.2 931.4 1059.4 1214.8 1203 764.2 1415673_ at 124.3 95.3 80.4 110.3 132.9 112 103.9 58 64.9 108.1 101.5 65.4 1415674_ _a_at 307.4 335.8 312.1 376.6 350.2 340.7 394.6 289 284.8 385 375.8 245.7 1415675_ at 258.8 229.4 231.9 282.3 245.4 271.7 315.5 228.3 167.5 227.4 242.4 170.9 1415676_ a_at 1094.3 1415.7 1330.2 1327.6 1242.9 1221.8 1722.2 1248.4 1092.6 1446.3 1311.5 1173.6 1415677_ at 441.3 480.6 557.8 737.4 434.2 523.2 789.9 635 372.1 850.9 524.1 625 1415678_ at 828.6 930.7 884 967.3 950 818.9 749.2 687 685 984.4 792.3 570.2 1415679_ at 1274.7 1409.7 1202 1358.2 1286.6 1249.3 1500.2 993.2 1185.1 1428.4 1565.9 958.2

This table is created with Microsoft Excel. The first column shows the "Affymetrix Probe Set Ids" and the following columns indicate the expression level for each gene. The first 10 out of 22,690 rows are shown here. There is no header row to simplify Mathematica codes.

Loge (expression)

Loge (expression)

Fig. 1. Schematic representation of the normalization procedure. Gene expression data from two different chips are shown before (A) and after (B) normalization to illustrate how these procedures transform the data sets. The normalized distributions, shown in (B), are shifted and aligned at their centers. Gene expression comparisons between the two distributions can now be made without systematic experimental bias.

Loge (expression)

Loge (expression)

Fig. 1. Schematic representation of the normalization procedure. Gene expression data from two different chips are shown before (A) and after (B) normalization to illustrate how these procedures transform the data sets. The normalized distributions, shown in (B), are shifted and aligned at their centers. Gene expression comparisons between the two distributions can now be made without systematic experimental bias.

The first code exchanges rows and columns of dataTable, and then extracts the first column (probe IDs). The second code exchanges rows and columns of dataTable, and then drops the first column (probe IDs), and exchanges its rows and columns again. The produced "idList" is an array of probe IDs, and "rawExpressionTable" is a table (two-dimensional matrix), whose rows represent genes, and whose columns represent expression profiles.

### 3.3.2. Equalize Average or Median of Each Chip

1. Scale the level of expression of each probe so that the average expression level for each chip becomes 1000 (see Note 1) using the following Mathematica code:

normalizationFactors=1000/Mean[rawExpressionTable];

normalizedExpressionTable=rawExpressionTable.Diagonal

Matrix[normalizationFactors];

The first line calculates scaling factors and put them in a vector. The second line multiplies "rawExpressionTable" with a diagonal matrix of the scaling factors to produce normalized expression profiles "normalizedExpressionTable," whose rows contains normalized expression profiles of each gene. Alternatively, scale the expression levels for each probe so that the median of each chip becomes 1000 (see Note 1), using the following Mathematica code:

normalizationFactors=1000/Median[rawExpressionTable]; normalizedExpressionTable=rawExpressionTable.Diagonal Matrix[normalizationFactors];