Mucin domain of MUC2/MUC5AC/MUC5B/ MUC6 in Refseq, Korean, and American assembly
The length of mucin domains of MUC2 of Refseq, Korean and American assembly are 5884 bases, 4471 bases and 8773 bases, respectively. In NCBI there is a nucleotide entry of human MUC2 DNA sequence with Accession number NM_002457.4. The mucin domain of this entry is 9270 bases and only a part of it could be found in RefsEq. However, we still regard the mucin domain of NM_002457.4 as the most complete assembly among all available MUC2 mucin domain assemblies.
The length of mucin domains of MUC5AC of Refseq, Korean and American assembly are 10371 bases, 10576 bases and 11196 bases, respectively. We regard the mucin domain of MUC5AC in Refseq as the most accurate assembly among all the three assemblies.
The length of mucin domains of MUC5B of Refseq, Korean and American assembly are 10893 bases, 11589 bases and 10772 bases, respectively. We regard the mucin domain of MUC5B in Refseq as the most accurate assembly among all the three assemblies.
The lengths of mucin domains in MUC6 of Refseq, Korean and American assembly are 3009 bases, 13299 bases and 8727 bases, respectively. For the American mucin domain of MUC6, the heads goes into gap region. We regard the mucin domain of MUC6 in Korean individual as the most complete assembly among the three assemblies.
Programing pipelines to get consensus sequence with SMRT reads
By taking the read in one specific region, we used multiple alignment methods to get an alignment. In the alignment, for each position we took the nucleotide (including insertion, i.e. A/T/C/G/-) which appears maximum number of times. Then we removed all the insertions and got the consensus sequence. Next we aligned back the consensus to all reads, and corrected the errors which were caused by several insertions together with one nucleotide or several same nucleotides together with one insertion (Fig. 1). For example in the part of three‘-’s and one ‘A‘, in the alignment some reads give ‘- - - A’ and some give ‘A - - -‘. Thus a nucleotide could be replaced by an insertion and later be removed in the consensus, and this will cause a missing of a nucleotide. Same principle, several same nucleotides together with an insertion might cause a redundant nucleotide in the consensus. This problem could cause a frame shift, and we managed to correct it according to the translation result. If we regard the error rate of SMRT as 15% (18), for each position, the error rate is 0.15 to the power of the number of reads which could be aligned in this position.
Assembly of mucin domain of MUC2 in HX1 with SMRT reads
For MUC2, in all the SMRT reads downloaded, 4 reads could be found to cover both the intron before mucin domain exon and the intron after mucin domain exon (Fig. 2). In human MUC2 mucin domain there is a CysD domain in the middle of a domain which is full of proline, threonine and serine (PTS). Although TR structure of PTS domain makes it impossible to be identified precisely with similarity search, CysD domain could be precisely found. Therefore, we took use of the CysD domain in the middle of PTS exon. We regarded the last “cysteine” in CysD domain as delimiter. For NCBI MUC2 (Accession number NM_002457.4) mucin domain exon which contains 9270 bases, the left part contains 1736 bases, and the right part contains 7534 bases. Thus we searched for two types of read: I. previous intron + left part of exon; II. right part of exon + next intron. We found 18 type I and 9 type II reads. Combining with the 4 reads which could be found to cover both previous and next introns, 22 reads could be used to build the left part and 13 reads could be used to build the right part (Fig. 2). For left part, in one position at least 11 reads could be aligned, thus the maximum error rate is 0.15 to the power 11 and the accuracy is 99.9999999%. For right part, in one position at least 7 reads could be aligned, thus the maximum error rate is 0.15 to the power of 7 and the accuracy is 99.9998%. The whole mucin domain exon of MUC2 in HX1 has 8994 bases.
MUC2 mucin domain TR structure
The protein sequence of the left part of MUC2 mucin domain in HX1 has 2 CysD domains at the beginning and the end (Fig. 2). They have 95 and 97 amino acids, respectively. Between the two CysD domains is the PTS TR short part (Fig. 3A). The TR lengths vary a lot. We define each TR with a symbol “PS” at the start of each TR. Therefore, the PTS TR short part has 28 TRs. The shortest TR has 7 amino acids and the longest TR has 26 amino acids. The protein sequences of PTS TR short part as well as two CysD domains of MUC2 mucin domain in NCBI (Nucleotide accession number NM_002457.4; Protein accession number NP_002448.4) are exactly the same as those in HX1.
The PTS sequence after 2nd CysD domain of MUC2 mucin domain in HX1 is PTS TR long part. It has 101 TRs, one PTS head and one PTS tail (Fig. 3B). 98 TRs have 23 amino acids, respectively. 3rd TR has 24 amino acids. 4th TR has 22 amino acids. 57th TR has 21 amino acids. The PTS head has 14 amino acids. The PTS tail has 84 amino acids.
The protein sequence of PTS TR long part of MUC2 mucin domain in NCBI (Nucleotide accession number NM_002457.4; Protein accession number NP_002448.4) has 105 repeats, one PTS head and one PTS tail. Comparing with the protein sequence of PTS TR long part of MUC2 mucin domain in HX1, it has 4 more TR CNVs directly after 89th repeat and 18 SNPs at repeat 15, 18, 25, 25, 38, 38, 39, 39, 39, 40, 40, 40, 41, 41, 42, 42, 60, and 104 (100 for HX1), respectively (Fig. 3C and 3D).
The protein sequences of PTS TR short part as well as two CysD domains of MUC2 mucin domain in BAC clone RP-13870H17 (Nucleotide accession number MH593786.1) are the same as those in HX1. However, the protein sequence of PTS TR long part of MUC2 mucin domain in BAC clone RP-13870H17 has only 98 TRs. Therefore, different individuals could have different number of TRs in PTS TR long part of MUC2 mucin domain.
Assembly of mucin domain of MUC5AC in HX1 with SMRT reads
For MUC5AC, in all the SMRT reads downloaded, 10 reads could be found to cover both the intron before mucin domain exon and the intron after mucin domain exon. In one position at least 6 reads could be aligned, thus the maximum error rate is 0.15 to the power 6 and the accuracy is minimum 99.9988%. The whole mucin domain exon of MUC2 in HX1 has 10371 bases.
MUC5AC mucin domain TR structure
The protein sequence of MUC5AC mucin domain in HX1 has one main head, one main tail, 6 CysD domains, 2 Long Tandem Repeat (LTR) groups, 4 Short Tandem Repeat (STR) groups and 1 unique short piece (Fig. 4A). The main head is composed of a PTS domain of 45 amino acids long and a CysD like domain of 99 amino acids long (Fig. 4B). The main tail is composed of a PTS domain of 50 amino acids long and a short piece of 12 amino acids long (Fig. 4C). For all 6 CysD domains, each has 105 amino acids and locates after each LTR/STR group (Fig. 4L). For 2 LTR groups, each is composed of one PTS domain of 95 amino acids long, one CysD like domain of 101 amino acids long, and one PTS domain of 65 amino acids long (Fig. 4E). Other than one CysD domain, there is one small PTS piece of 7 amino acids long between the 2 LTR groups (Fig. 4D). For 4 STR groups, each has one PTS head of 36 amino acids long and one PTS tail of 13 amino acids long (Fig. 4J and 4K). 1st, 2nd, 3rd and 4th STR group have 119, 18, 35, and 65 STRs, respectively (Fig. 4F, 4G, 4H and 4J). Each repeat has 8 amino acids except that 17th repeat of 3rd STR group has only 7 amino acids (Fig. 4F, 4G, 4H and 4I). As the delimiters of LTR/STR groups, 6 CysD domains are quite similar (Fig. 4L).
The protein sequence of MUC5AC mucin domain in NCBI (Nucleotide accession number NM_001304359.1; Protein accession number NP_001291288.1) has same length and TR structure as the protein sequence of MUC5AC mucin domain HX1. There are only 3 SNPs. One is in 99th repeat in 1st STR group; another two are in 3rd and 4th CysD, respectively (Fig. 4M).
Assembly of mucin domain of MUC5B in HX1 with SMRT reads
For MUC5B, in all the SMRT reads downloaded, 9 reads could be found to cover both the intron before mucin domain exon and the intron after mucin domain exon. In one position at least 5 reads could be aligned, thus the maximum error rate is 0.15 to the power 5 and the accuracy is 99.99%. The whole mucin domain exon of MUC5B in HX1 has 10893 bases.
MUC5B mucin domain TR structure
The protein sequence of MUC5B mucin domain in HX1 has one main head, one main tail, 1 Cys-similar domain, 6 CysD domains, and 7 PTS domains (Fig. 5A). The main head is composed of a small piece of 8 amino acids long (Fig. 5B). The main tail is composed of a small piece of 12 amino acids long (Fig. 5C). The CysD-similar domain has 100 amino acids (Fig. 5M). 2nd CysD domain has 102 amino acids (Fig. 5L). For other 5 CysD domains, each has 101 amino acids (Fig. 5L). For all 7 CysD and CysD-similar domains, each locates before each PTS domain (Fig. 5A). The first 2 PTS domains have no repeats, but a long piece of 70 and 180 amino acids, respectively (Fig. 5D and 5E). Each of the last 5 PTS domains has some STRs and one PTS tail (Fig. 5F, 5G, 5H, 5I, 5J, 5K, and 5N). The number of STRs of the bodies of 3rd, 4th, 5th, 6th, and 7th PTS domain are 10, 11, 16, 11, and 22, respectively (Fig. 5F, 5G, 5H, 5I and 5J). For all the STRs in the bodies of last 5 PTS domains, 5 have 24 amino acids, respectively; 8 have 26 amino acids, respectively; 8 have 28 amino acids, respectively; 48 have 29 amino acids, respectively; one has 34 amino acids (Fig. 5F, 5G, 5H, 5I and 5J). The PTS tails of 3rd, 4th, 5th and 6th PTS domains are homologous and they all have 147 amino acids, respectively (Fig. 5K). The PTS tail of 7th PTS domain has 87 amino acids (Fig. 5N). As the delimiters of 7 PTS domains, 2nd, 3rd, 4th, 5th, and 6th CysD domains are quite similar (Fig. 5L).
The protein sequence of MUC5B mucin domain in NCBI (Nucleotide accession number NM_002458.2; Protein accession number NP_002449.2) has same length and TR structure as the protein sequence of MUC5B mucin domain in HX1. There are only 7 SNPs. Three are in PTS tail of 3rd PTS domain, one is in 11th repeat of 5th PTS domain, one is in 4th repeat of 6th PTS domain, and three are in 7th, 8th and 9th repeats of 7th PTS domain, respectively (Fig. 5O).
Assembly of mucin domain of MUC6 in HX1 with SMRT reads
For MUC6, in all the SMRT reads downloaded, only 3 reads could be found to cover both the intron before mucin domain exon and the intron after mucin domain exon. Therefore, it is impossible to get the exactly correct nucleotide in each position. However, due to the TR structure, the number of TRs and the lengths of each TR could be obtained. The MUC6 refseq of NCBI has all the non-TR part of mucin domain, thus we can use this as the template to get the whole mucin domain exon of MUC6 in HX1 which has 13470 bases.
MUC6 mucin domain TR structure
The protein sequence of MUC6 mucin domain inHX1 has one head, one tail and 27 TRs (Fig. 6A). The head has 60 amino acids and the tail has 265 amino acids (Fig. 6B and 6C). 27 TRs could be found between the head and the tail. 1st, 2nd, 3rd, 4th, 5th, 7th, 8th, 9th, 12th, 13th, 14th, 18th, 22nd, and 26th TRs have 169 amino acids, respectively. This number is most of the case among all TRs, thus we call this type of TRs “typical TR”. 6th TR has 171 amino acids, and there is a “TG” insertion comparing with the typical TR. 10th, 11th, 15th, 19th, and 23rd TRs have 168 amino acids, respectively, and there is a deletion comparing with the typical TR. 16th, 20th, and 24th TRs have 150 amino acids, respectively, and they are first 150 amino acids of the typical TR. 17th, 21st, and 25th TRs have 74 amino acids, respectively, and they are last 74 amino acids of the typical TR. 27th TR has 115 amino acids, and it is the first 115 amino acids of the typical TR (Fig. 6D).
The protein sequence of MUC6 mucin domain in NCBI (Nucleotide accession number NM_005961.2; Protein accession number NP_005952.2) only has head, 1st TR, first 33 amino acids of 2nd TR, last 117 amino acids of 24th TR, 25th TR, 26th TR, 27th TR, and tail (Fig. 6E). Since we cannot be sure the exact nucleotide in each position due to only 3 reads available, we cannot say SNP information.
The protein sequences of MUC6 mucin domain in BAC clone RP-13870H17 has one head, one tail and 24 TRs (ref). Therefore, different individuals could have different number of TRs in MUC6 mucin domain.
Estimation of number of TRs in right part of mucin domain of MUC2 for another individual
In the result from the pipeline, all frameshifts are caused by several same nucleotides together. In HX1, in the DNA sequence of TRs in right part of mucin domain of MUC2 mucin domain, no continuous multiple “T”s could be found other than two SNPs at 46th and 96th TR, respectively, which cause two “T”s together (Fig. 7). Therefore, the number of “T”s in the TR part from pipeline consensus could be used to estimate the number of repeats without arranging each frameshift (Table 1).
In right part of mucin domain of MUC2 for HX1, the number of “T”s keeps same after adjustment. For each repeat, in most cases “T” comes 2 or 3 or 4 or 5 times, in some less number of cases “T” comes 6 times, and only once “T” comes 7 times. Therefore, for another individual, after checking “T” number we could get repeat number roughly by comparing with HX1. In HX1, the right part has 101 TRs and 364 “T”s. 46th and 96th TR have one “TT”, respectively. Since in common repeats we cannot find “TT”, we regard such cases as SNPs and count two “T”s as one. Therefore “T” number of 362 shall correspond to repeat number 101, and on average each repeat has 3.58 “T”s (Fig. 7). If we divide “T” number difference by 3.58, we can roughly get repeat number difference. For instance, the most complete MUC2 protein in NCBI (accession number NP_002448.4) has 13 more “T”s and 4 more repeats. However, some SNPs (T/A, T/C, or T/G) might affect T number difference. Anyway a roughly estimation of repeat number could be obtained in this way (Fig. 8).