Generate sample matrix
The generate_sample_matrix.py script is used to generate a sample matrix from the file names in a given directory. In Illumina sequencing, it is common to use file names to identify samples, but very often that the files names are not easy to understand, and may not reflect the properties of the samples.
Here, given a directory containing files, the script will generate a sample matrix with the following columns:
Sample: the original sample nameLabel: the user-defined labelGroup: the user-defined groupReplicate: the user-defined replicateBatch: the user-defined batchMark: the user-defined markPeakType: the user-defined peak typeFileName: the original file name
We will use all files in the directory matching the given file suffix.
Usage
python3 sample_matrix.py <directory> <suffix> <output_file>
For example, given a directory ./data containing the following files:
sample1_rep1_trimmed.fastq.gz
sample2_rep2_trimmed.fastq.gz
sample3_rep1_trimmed.fastq.gz
We can generate a sample matrix with the following command:
python3 sample_matrix.py ./data _trimmed.fastq.gz sample_matrix.csv
The generated sample_matrix.csv will look like this:
| Sample | Label | Group | Replicate | Batch | Mark | PeakType | FileName |
|---|---|---|---|---|---|---|---|
| sample1_rep1 | sample1_rep1 | sample1_rep1_trimmed.fastq.gz | |||||
| sample2_rep2 | sample2_rep2 | sample2_rep2_trimmed.fastq.gz | |||||
| sample3_rep1 | sample3_rep1 | sample3_rep1_trimmed.fastq.gz |
This sample matrix is applicable to most type of analysis. However, some columns may not be applicable to some analysis, and the user can choose to remove them from the sample matrix.
The script will automatically populate the Label column with the same content as the Sample column. However, users can choose to modify the Label column to better reflect the properties of the samples. The Label column is used to identify the files in the sample matrix, so it is important to make sure that the Label column is unique and matches the file names without extension.
If Label has been modified in the sample matrix, ensure that you also run file_name_conversion.py to rename the files in the directory to match the Label column in the sample matrix. For details, see here.