要处理的FASTA序列矩阵格式如下(数据已处理过,非真实数据):
>CPBOL076-11|JF952978|LiujqAB08|Abies squamata|matK CGATCGTTGTCTCATGGATCTTTCCCTGATCAAACTCATTTTGATCGAAAGATCAAACATATTATCAGAAATTATCGTCG AAATTCACTGAAAAGTATCTGGTCGTTGAAGGATCCTAGAATTCACTATGTTAGATATGCAGAAAGATCTATTATAGCTA TAAAGGGTACTCATCTCCTAGTGAAAAAATGTAGATATCATCTTCCAATTTTTCGGCAATTTTATTTCCATCTTTGGTCC GAACCATATAGGGTATGTTCTCATCAATTATC >CPBOL121-11|JF953012|Yangqe0222|Aconitum angustius|matK GCCCCCTTTTTGCACTTATTGAGACTCTTTCTCTACGAGTATCATCATTGGAATATTCTTATTACTCAAAAAAATCAAAT GAATTTCTTTTTTTCAAAAGAGAATCAAAGATTTTTTCTGTTCCTATATAATTTTCATGTATATGAATCGGAATCCATAT TCGTTTTTCTCCGTAAACAATCTTCTCATTTACGATCAACATCCTCTAGAGCTTTTCTTGATCGAACAC ...
矩阵由多条序列数据组成,每条序列又由一行注释(以>开头),和后续的一行或多行数据构成。
程序要做的:
临时更换行结束符,滤掉第一条序列前的所有字符:
$/ = ">"; <>; $/ = "\n";
获取注释行中的信息:
while (<>) { my $sample_id = ""; my $genbank_no = ""; my $seq = ""; if (/^.+\|(.+)\|(.+)\|.+\|.+$/) { $sample_id = $2; $genbank_no = $1; } else { next; }
获取序列数据,并去掉所有多余字符。因序列数据可能有多行,所以需要再次临时更换行结束符:
$/ = ">"; $seq = <>; $/ = "\n"; $seq =~ s/\r//g; $seq =~ s/\n//g; $seq =~ s/\s//g; $seq =~ s/-//g; $seq =~ s/~//g; $seq =~ s/>$//;
生成目标文件:
my $new_sample_id = "CPBOL2010-" . $sample_id; my $filename = $new_sample_id . ".fasta"; open OUTFILE, ">", $filename; print OUTFILE ">" . $new_sample_id . "\r\n"; print OUTFILE $seq . "\r\n"; close OUTFILE;
输出对应关系:
print "$filename\t\t\t$new_sample_id\t$genbank_no\r\n";
继续处理下一条序列,直到结束:
} exit 0;
完毕。