### Overview

Two different types of processing were explored in the present study: a hybrid method utilizing Z-Score standardization and least squares regression (Z-LSR), as well as the established method of Singular Value Decomposition (SVD) and reconstruction by eigenvectors. Images must also be pre-processed to remove spurious signals due to cosmic rays. Image post-processing includes contrast normalization and optionally false color-mapping, or bicubic interpolation if required [12], as illustrated in figure 7.

### Raman Data acquisition

Cultured HeLa cells were grown in Dulbecco's modified Eagle's medium (DMEM; Nissui) supplemented with 10% fetal calf serum, 2 mM glutamine, 100 units/mL penicillin, and 100 μg/mL streptomycin, in a humidified atmosphere (5% CO2, 95% air) at 37°C. 1 mm thick, 25 mm diameter quartz substrates were used as cell culture substrates for low background signal during Raman microscopy. Raman imaging was performed using the 532 nm excitation laser of a confocal Raman micro-spectroscopy system (Raman-11, Nanophoton). The excitation source was a semiconductor laser operating at 532 nm, with a power of up to 300 mW. The laser beam was focused into a line illumination pattern and irradiated the sample via a water-immersion objective lens (Nikon, 100X, 1.0NA). Scattered light was collected with the same objective and guided into a Czerny-Turner-type spectrometer with a focal length of 50 cm, and the final spectroscopic data was collected by an electrically cooled CCD camera (Princeton Instruments, Pixis 400). The overall spectral resolution was 1.6 cm^{-1}, and optical resolution was approximately 350 nm. Due to the line illumination configuration of the beam, laser power or intensity at the focus was difficult to measure but was between 0.1 and 2.5 mW per square micrometer for all sample images.

Spectral data wavenumber calibration was evaluated by collecting the Raman spectra of ethanol and comparing peak positions to ethanol spectra in the Spectral Database for Organic Compounds (SDBS) [13] with peaks matching within three wavenumbers as follows: (comparison shows measured/database peak pairs in cm^{-1}) 1094/1097, 1452/1455, 2877/2876, 2927/2927, and 2971/2973.

### Cosmic Ray Reduction

There are numerous sources of noise in acquired Raman datasets. These include the granular nature of the low photon numbers, thermal noise in the detection, background Raman or fluorescence signals from quartz substrates along with other sources of noise that are more or less evenly distributed throughout the dataset. The most corrupting noise type in the datasets tends to be due to cosmic rays impinging on the detector. The cosmic rays usually corrupt only a single pixel or small group of pixels, replacing the data value with an erroneous value often an order of magnitude higher than the Raman data values. The cosmic ray corruption therefore appears as a spurious delta function isolated in both x-y dimension and spectral value. The cosmic ray reduction pre-processing step removes these outlier values that corrupt or wash out valid signal peak information corresponding to useful biological macrophage detail. Cosmic rays are an inevitable result of long exposure times, which are required in Raman microscopy, and there are many potential approaches to removing such corruption of the data. Keeping in mind the targets of automated data processing and scalability, we used a simple heuristic algorithm determined by the specific physical nature of the cosmic ray based noise, which primarily manifests as large deviations in pixel value but only affects isolated pixels. The algorithm is as follows: for a pixel in x-y, any wavenumber value which lies outside 3 standard deviations from the average for all wavenumbers in that pixel are deemed to be outliers and are assumed to be cosmic ray based corruption of the data. To remove such cosmic ray outliers, we compute the mean and variance. After the mean and variance is determined, if the deviation for any wavenumber is above the chosen threshold (typically 3 standard deviations) then we consider it a cosmic ray. Cosmic rays are replaced by the last value, which lay within the threshold (i.e. the last wavenumber value which was not a cosmic ray). This has the effect of replacing cosmic ray noise with local "trusted" data values, since cosmic rays typically only modify one or two adjacent pixels in the data stack.

### Standardized Regression (Z-LSR)

In an attempt to manage the aforementioned drawbacks of SVD, we applied statistical methods utilizing Z-Score standardization and least squares regression [12] which we refer to as the Z-LSR method. The computational complexity of this method is Ο(*MN*) such that *M* is the number of (*x*, *y*) points (e.g. 77 × 96) and *N* is the length of the w-dimension (e.g. 1,340) offering substantial performance improvement with similar (figure 2) or improved noise reduction compared with the SVD approach.

For each (*x*, *y*) position in the 3D Raman data stack, there is a w-axis spectral vector denoted as . The first step is to compute the z-score standardization from the spectral vector as

where the mean (μ) and standard deviation (σ) are computed for the specific vector . The output vector has the same length as the input vector . The second step is to compute the least squares regression slope *m* for each vector . The slope associates a single scalar value with each standardized spectral vector, the hypothesis being that similar vector spectra in the data stack will have similar slopes. The slope is then given by

The third and last step is to produce a new Z-LSR spectral vector by multiplying all values in the vector by their associated slope *m*. This process is repeated for all z-score vectors until the Raman data stack has been replaced with all of the vectors. Ultimately, the image representation is revealed by consistent regression trends multiplied by the standardized value.

### Automated Color-Mapping

The automated color-mapping was done by searching through the data to locate the strongest three peaks, for all spectral data. Red, green and blue channels were then assigned respectively on a per-pixel basis so that the pixel color corresponded to the relative strengths of the previously defined three peaks. This produced fully automated color imaging based on the strongest spectral information in the data. Due to the differences in sample composition and varying amounts of cell to substrate signal in different experiments, automated color-mapping produced different pseudocolor images for different data sets. This differences were determined predominantly by the ratio of the cell to substrate area in the dataset, since the strongest contribution to the spectra came from the background if there was little cellular material in the data set and vice versa for datasets where the substrate was covered by cells. We also experimented with mapping groups of peaks to each individual color channels, but have shown only color images where the top three peaks were mapped to a single color channel.

### SVD

SVD [14] begins with the construction of matrix *A*
_{
(N × M)
} such that n is the number of rows and m is the number of columns. This matrix is then decomposed into three matrices defined by the equation: such that *U* represents the left singular vectors, *S* represents the singular values in non-increasing order along the diagonal, and the transpose of *V* represents the right singular vectors. In the present work, the matrix *A* was constructed from the raw Raman spectra. For example, the data set depicted in figure 1 contains the dimensions 77 × 96 × 1340; therefore, Matrix *A* will become a 2D representation of this data set with 7,392 columns (i.e. the X/Y planes are constructed as a 1D piecewise linear array) and 1,340 rows representing the spectral data: *A*
_{
(1340 × 7392)
}. The computation of the SVD means solving for the eigenvalues and eigenvectors of the matrices *AA*
^{T} (i.e. the columns of *U*) and *A*
^{T}
*A* (i.e. the columns of *V*). The concept of orthogonality can be expressed by the equation: *e*
^{T}
*e* = *I* such that *e* is an eigenvector and *I* is the identity matrix. This is the underlying reason why we can separate the Raman spectra into 'noise' and 'signal'.

Identifying the eigenspectra of interest was done by selecting the eigenspectra with the highest root-mean-squared deviation (RMSD). Given a vector with *N* spectral values, such that one vector exists for each eigenspectra reconstruction, we sorted the eigenvector index position using its corresponding RMSD value defined as , and sorted eigenvectors in descending RMSD value. This approach empirically performed well because noisy eigenspectra oscillate very near to zero, whereas signals arising from bio-molecules of interest, have positive intensities (e.g., figure 3C).