Naming conventions

FASTQ/A Files - WORK IN PROGRESS

FASTQ and FASTA files contain biological sequence data, usually nucleotide or amino acid sequences. FASTQ files contain sequence quality information (for example, Phred/Q scores), while FASTA files contain only the sequence.

For sequence data generated by the Ocean DNA genome skimming project, it is helpful to have a universal format for file names. This will facilitate downstream file parsing and data management.

Ideas for required fields: - Voucher/Catalog ID - Taxonomic ID (Family-Genus-Species?), no abbreviations - anything else?

Need to consider the issue of when the same sample is sequenced twice. Include a unique identifier based on sequencing batch, plate, and well?

NO SPACES in file name. Use consistent delimiter (underscore? period? dash?) to separate fields in the file name. Delimiter must NOT be used within the fields. If a field is unknown for a specific sample, insert NA, do not skip the field.

Example: NMNH-12345_Percidae-Etheostoma-olmstedi_OtherStuff.fastq.gz

Why is this helpful? Let’s imagine you wanted to create a table to species names from a recent sequencing run. With a consistent naming scheme for your files, it’s easy!

# print list of unique species in my FASTQ files
find /PATH/TO/MY/DIRECTORY/*.fastq.gz -printf "%f\n" | cut -f2 -d"_" | cut -f2,3 -d "-" | sort | uniq

Please keep FASTQ/A files gzip compressed to minimize disk space.

Compress a file: gzip fileName.fastq

Decompress a file (if needed): gunzip fileName.fastq.gz