vcf の validator にはいろいろ知られているが今回は EBI のvcf validator を使ってみた。
とりあえず手持ちの vcf を適当に放り込んでみた。
mijinko@linux:~$ /home/mijinko/ダウンロード/vcf_validator_linux -i '/home/mijinko/ドキュメント/example.vcf'
[info] Reading from input file...
Lines read: 10000
Lines read: 20000
Lines read: 30000
Lines read: 40000
Lines read: 50000
Lines read: 60000
Lines read: 70000
Lines read: 80000
Lines read: 90000
Lines read: 100000
Lines read: 110000
Lines read: 120000
Lines read: 130000
Lines read: 140000
Lines read: 150000
Lines read: 160000
Lines read: 170000
Lines read: 180000
Lines read: 190000
Lines read: 200000
Lines read: 210000
Lines read: 220000
Lines read: 230000
Lines read: 240000
Lines read: 250000
Lines read: 260000
Lines read: 270000
Lines read: 280000
Lines read: 290000
Lines read: 300000
Lines read: 310000
Lines read: 320000
Lines read: 330000
Lines read: 340000
Lines read: 350000
[info] Summary report written to : /home/mijinko/ドキュメント/example.vcf.errors_summary.1596778720325.txt
[info] According to the VCF specification, the input file is not valid
よくわからないが失敗したようだ。エラーを見てみる
ーーーーーーーーーー
According to the VCF specification, the input file is not valid
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 358937 time(s), first time in line 222.
Warning: Comma found in the ID column; if used as separator, please replace it with semi-colon. This occurs 2 time(s), first time in line 211263.
ーーーーーーーーーー
らしい… line 222はヘッダーが終わって中身が始まるところなので、そもそもフォーマットが違うのではないかと思われる。
なので、clinvar から前にダウンロードしてきていた vcf を入れてみることにする。
mijinko@linux:~$ time /home/mijinko/ダウンロード/vcf_validator_linux -i '/home/mijinko/デスクトップ/ClinVar/clinvar_20181028.vcf'
[info] Reading from input file...
Lines read: 10000
Lines read: 20000
Lines read: 30000
Lines read: 40000
Lines read: 50000
Lines read: 60000
Lines read: 70000
Lines read: 80000
Lines read: 90000
Lines read: 100000
Lines read: 110000
Lines read: 120000
Lines read: 130000
Lines read: 140000
Lines read: 150000
Lines read: 160000
Lines read: 170000
Lines read: 180000
Lines read: 190000
Lines read: 200000
Lines read: 210000
Lines read: 220000
Lines read: 230000
Lines read: 240000
Lines read: 250000
Lines read: 260000
Lines read: 270000
Lines read: 280000
Lines read: 290000
Lines read: 300000
Lines read: 310000
Lines read: 320000
Lines read: 330000
Lines read: 340000
Lines read: 350000
Lines read: 360000
Lines read: 370000
Lines read: 380000
Lines read: 390000
Lines read: 400000
Lines read: 410000
Lines read: 420000
[info] Summary report written to : /home/mijinko/デスクトップ/ClinVar/clinvar_20181028.vcf.errors_summary.1596781075944.txt
[info] According to the VCF specification, the input file is not valid
real 0m13.629s
user 0m13.165s
sys 0m0.072s
時間はとても速いがまたダメだったみたい…
According to the VCF specification, the input file is not valid
Error: INFO DBVARID metadata Number is not 1. This occurs 1 time(s), first time in line 22.
Warning: Chromosome/contig '1' is not described in a 'contig' meta description. This occurs 30663 time(s), first time in line 29.
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 1928 time(s), first time in line 463.
Error: Info field value is not a comma-separated list of valid strings (maybe it contains whitespaces?). This occurs 987 time(s), first time in line 10379.
Warning: Chromosome/contig '2' is not described in a 'contig' meta description. This occurs 47880 time(s), first time in line 31164.
・
・
・
うーん、NCBIとそもそもフォーマットが違うのか、使い方がそもそも違うのか、
そこら辺をよく調べないといけないと思った(遅い)
コメント