GTF文件转GFF文件

程序来自:https://metacpan.org/pod/distribution/GBrowse/bin/gtf2gff3.pl

/opt/biosoft/GBrowse-2.54/bin/gtf2gff3.pl --cfg gtf2gff3.cfg hg19_genesymbol.gtf > test.gff3
#注意运行pl程序前,先添加一行:
use Config::Std;
#然后再执行
#cfg文件样例:


###################################################################################################
#This config file allows the user to customize the gtf2gff3
#converter.

[INPUT_FEATURE_MAP]
#Use INPUT_FEATURE_MAP to map your GTF feature types (column 3 in GTF) to valid SO types.
#Don't edit the SO tags below.
#Mapping must be many to one.  That means that exon_this and exon_that could both
#map to the SO exon tag, but exon_this could not map to multiple SO tags.

#GTF Tag                  #SO Tag
gene                      = gene
mRNA                      = mRNA
exon                      = exon
five_prime_utr            = five_prime_utr
start_codon               = start_codon
CDS                       = CDS
stop_codon                = stop_codon
three_prime_utr           = three_prime_utr
3UTR                      = three_prime_utr
3'-UTR                    = three_prime_UTR
5UTR                      = five_prime_utr
5'-UTR                    = five_prime_UTR
ARS                       = ARS
binding_site              = binding_site
BLASTN_HIT                = nucleotide_match
CDS_motif                 = nucleotide_motif
CDS_parts                 = mRNA_region
centromere                = centromere
chromosome                = chromosome
conflict                  = conflict
Contig                    = contig
insertion                 = insertion
intron                    = intron
LTR                       = long_terminal_repeat
misc_feature              = sequence_feature
misc_RNA                  = transcript
nc_primary_transcript     = nc_primary_transcript
ncRNA                     = ncRNA
nucleotide_match          = nucleotide_match
polyA_signal              = polyA_signal_sequence
polyA_site                = polyA_site
promoter                  = promoter
pseudogene                = pseudogene
real_mRNA                 = mRNA
region                    = region
repeat_family             = repeat_family
repeat_region             = repeat_region
repeat_unit               = repeat region
rep_origin                = origin_of_replication
rRNA                      = rRNA
snoRNA                    = snoRNA
snRNA                     = snRNA
source                    = sequence_feature
telomere                  = telomere
transcript_region         = transcript_region
transposable_element      = transposable_element
transposable_element_gene = transposable_element
tRNA                      = tRNA

[GTF_ATTRB_MAP]
#Maps attribute keys to keys used internally in the code.
#Don't edit the code tags.
#Note that the gene_id and transcript_id tags tell the script
#who the parents of a feature are.

#Code Tag    #GTF Tag
gene_id    = gene_id
gene_name  = gene_name
trnsc_id   = transcript_id
trnsc_name = transcript_name
id         = ID
parent     = Parent
name       = Name

[GFF3_ATTRB_MAP]
#Maps tags used internally to output GFF3 attribute tags.
#Also, when LIMIT_ATTRB is set to 1 only these tags will be
#Output to the GFF3 attributes column.

#Code Tag  #GFF3 Tag
PARENT   = Parent
ID       = ID
NAME     = Name

[MISC]
# Limit the attribute tags printed to only those in the GFF3_ATTRB_MAP
LIMIT_ATTRB     = 0 
#A perl regexp that splits the attributes column into seperate attributes.
ATTRB_DELIMITER = \s*;\s*
#A perl regexp that captures the tag value pairs.
ATTRB_REGEX     = ^\s*(\S+)\s+(\"[^\"]*\")\s*$
#If CDSs are annotated in the GTF file, are the start codons already included (1=yes 0=no)
START_IN_CDS    = 1
#If CDSs are annotated in the GTF file, are the stop codons already included (1=yes 0=no)
STOP_IN_CDS     = 0
###################################################################################################


#报错解释
DIAGNOSTICS
ERROR: Missing or non-standard attributes: parse_attributes
A line in the GTF file did not have any attributes, or it's attributes column was unparsable.

ERROR: Non-transcript gene feature not supported. Please contact the author for support: build_gene
This warning indicates that a line was skipped because it contained a non-transcript gene feature, and the code is not currently equipped to handle this type of feature. This probably isn't too hard to add, so contact me if you get this error and would like to have these features supported.

ERROR: Must have at least exons or CDSs to build a transcript: build_trnsc
Some feature had a transcript_id and yet there were no exons or CDSs associated with that transcript_id so the script failed to build a transcript.

ERROR: seq_id conflict: validate_and_finish_trnsc
Found two features within the same transcript that didn't share the same seq_id.

ERROR: source conflict: validate_and_finish_trnsc
Found two features within the same transcript that didn't share the same source.

ERROR: type conflict: validate_and_finish_trnsc
Found two features within the same transcript that were expected to share the same type and yet they didn't.

ERROR: strand conflict: validate_and_finish_trnsc
Found two features within the same transcript that didn't share the same strand.

ERROR: seq_id conflict: validate_and_build_gene
Found two features within the same gene that didn't share the same seq_id.

ERROR: source conflict: validate_and_build_gene
Found two features within the same gene that didn't share the same source.

ERROR: strand conflict: validate_and_build_gene
Found two features within the same gene that didn't share the same strand.

ERROR: gene_id conflict: validate_and_build_gene
Found two features within the same gene that didn't share the same gene_id.

FATAL: Can't open GTF file: file_name for reading.
Unable to open the GTF file for reading.

FATAL: Need exons or CDSs to build transcripts: process_start
A start_codon feature was annotated and yet there were no exons or CDSs associated with that transcript_id so the script failed.

FATAL: Untested code in process_start. Contact the aurthor for support.
The script is written to infer a start codon based on the presence of a 5' UTR, but we had no example GTF of this type when we wrote the code, so we killed process rather than run untested code. Contact the author for support.

FATAL: Invalid feature set: process_start
We tried to consider all possible ways of infering a start codon or infering a a non-coding gene, and yet we've failed. Your combination of gene features doesn't make sense to us. You should never get this error, and if you do, we'd really like to see the GTF file that generated it. Please contact the author for support.

FATAL: Need exons or CDSs to build transcripts: process_stop
A stop_codon feature was annotated and yet there were no exons or CDSs associated with that transcript_id so the script failed.

FATAL: Untested code in process_stop. Contact the aurthor for support.
The script is written to infer a stop codon based on the presence of a 3' UTR, but we had no example GTF of this type when we wrote the code, so we killed process rather than run untested code. Contact the author for support.

FATAL: Invalid feature set: process_stop
We tried to consider all possible ways of infering a stop codon or infering a a non-coding gene, and yet we've failed. Your combination of gene features doesn't make sense to us. You should never get this error, and if you do, we'd really like to see the GTF file that generated it. Please contact the author for support.

FATAL: Invalid feature set: process_exon_CDS_UTR
We tried to consider all possible ways of infering exons, CDSs and UTRs and yet we've failed. Your combination of gene features doesn't make sense to us. You really should ever get this error, and if you do, we'd really like to see the GTF file that generated it. Please contact the author for support.

FATAL: Array reference required: sort_features.
A user shouldn't be able to trigger this error. It almost certainly indicates a software bug. Please contact the author.

FATAL: Can't determine strand in: sort_feature_types.
This may indicate that your GTF file does not indicate the strand for features that require it. It may also indicate a software bug. Please contact the author.

FATAL: Hash reference required: sort_feature_types.
A user shouldn't be able to trigger this error. It almost certainly indicates a software bug. Please contact the author.

FATAL: Invalid value passed to strand: strand.
This may indicate that your GTF file does not indicate the strand for features that require it. Consider using the DEFAULT_STRAND paramater in the config file. It may also indicate a software bug. Please contact the author.

CONFIGURATION AND ENVIRONMENT
A configuration file is provided with this script. The script will look for that configuration file in ./gtf2gff3.cfg, ~/gtf2gff3.cfg or /etc/gtf2gff3.cfg in that order. If the configuration file is not found in one of those locations and one is not provided via the --cfg flag it will try to choose some sane defaults, but you really should provide the configuration file. See the supplied configuration file itself as well as the README that came with this package for format and details about the configuration file.

源程序代码

此条目发表在Perl分类目录。将固定链接加入收藏夹。

发表评论

邮箱地址不会被公开。 必填项已用*标注

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

Protected with IP Blacklist CloudIP Blacklist Cloud