Raw DNA Phasing Six Siblings with One Parent – Part 1 Homozygosity

I have written many Blogs on raw DNA phasing with my siblings and my mother. I have done this phasing using a Whit Athby paper and MS Access. I had my last sibling tested this past Summer, so thought I would see how his phasing would work using this method. The goal of this phasing would be to get four files of data representing the DNA from my four grandparents. I have four such files already, but they were created by M MacNeill a while ago and he didn’t have all my siblings’ data at the time. I would like to learn how to upload these files on my own.

Jim’s Raw DNA

Jim was my last brother to be tested. He was tested at FTDNA as that was the kit I had at the time. The first step is to find Jim’s DNA download from FTDNA and extract it. However, before I do that, I need to know what build to download. As I look at my old blogs, it appears that I was working in Build 37. Gedmatch has historically used Build 36. However, Gedmatch is being migrated to Genesis. Here is a comment I found on Facebook:

All Genesis tools natively work in B37 *meaning that all matching is done based on B37), but we decided to map all of the B37 positions to B36 and B38 when printing out segment start/end positions, with the choice given to the user which to display.

We will begin to migrate this to other tools as soon as we can. I hope you find it useful.

Build 37 DNA

All this to say that I want Build 37. I assume that I used Build 36 for Gedmatch, so I’ll do a new download for Jim:

I chose Build 37 Raw Data Concatenated. Unfortunately, my computer wants me to find an app to extract this file.

In the past, I have used Notepad, so I’ll try that. The gz file is about 6.4 MB. I can see Notepad was the wrong thing to open this with:

So I guess I need Winzip. I downloaded that and then opened Jim’s file. It opened as a csv file, but I saved it as an Excel File as that is what I will be using in Access. Jim has double A DNA:

Actually the DNA at the first position of his tested Chromosome was AA. He got an A from his dad and an A from his mom. Jim has a lot of DNA

This shows that Jim has 720,450 tested DNA positions. That is pretty good. However, there are some positions that don’t have results indicated by –. Between my mother, me and my five siblings there are about 4 million autosomal results to look at.

One thing that I notice that is different from this AncestryDNA file:

AncestryDNA has a separate column for allele 1 and allele 2. That would be better for me as I am trying to separate these alleles out.

This looks better. However, when I try to import this into Access, I get this error:

My guess is that Access does not like the dashes where there are no results. So I’ll take out every dash in Jim’s DNA results. That was close to 60,000 dashes. I tried that, but I still got the error. One on-line suggestion was to compact and repair the database. That seemed to work, but there was this problem:

I didn’t realize that there was a new header at line 702542. That imported like this:

Also for the AncestryDNA files, the X Chromosome shows as Chromosome 23, which should work better. My import to Access took out the ‘X’. After I removed the internal header and changed the X Chromosome to 23, I imported Jim’s raw DNA with no problems.

Giving Jim some Maternal DNA

Now Jim’s DNA is in shape for doing something with it. The next step is pretty simple. Every time my mom has two alleles that are the same, we know that allele is maternal for Jim. I originally tested my mom at FTDNA, so it would make sense to download her DNA from there.

Importing Mom’s DNA to MS Access

I already learned a few things from downloading JIm’s DNA from FTDNA. I used the same steps for my mom, except that I didn’t delete the dashes to see if that would make a difference. That gave me an error, so I deleted the dashes. Now I am in business.

My mom has 711,398 locations tested at FTDNA. This is a bit less than Jim’s  720,450 tested locations.

Next, I want to see what happens if I compare Mom’s ID’s with Jim’s ID’s:

Here at Access I have an equal join between the RSID fields of both tables. That results in 709,632 positions that Jim and Mom have in common. When I compare the positions between the two, I get 712,452. That is more than my mom had, so that doesn’t make sense. Actually, I shouldn’t be comparing by position, because those are positions along each of the 23 chromosomes. There may be repeats. That is good to know.

Where is Mom Homozygous?

If Mom’s Allele #1 is the same as Allele #2, that is called homozygous. I’ll perform this simple query on Mom:

I asked Access to show my where my Mom’s Allele 2 is the same as Allele 1:

Mom has that in 500,995 places. However, next, I need to get rid of the blanks:

I added Is Not Null to the criteria on the Result Column:

That gets me down to 496,136 homozygous positions. That means that more than 2/3 or almost 70% of Mom’s results are homozygous. Those are the alleles that will be Jim’s maternal alleles.

Where is Jim Homozygous?

Where Mom is homozygous, we’ll add a Mom allele to JIm. But first, where Jim is homozygous, we will add a Mom and Dad allele. I created a simple query in Access:

I’m creating two new columns for Jim. One will give me the alleles that Jim got from his Dad and the other will give me the alleles that Jim got from his Mom. In the criteria row I have that Jim’s allele 1 must equal Jim’s allele 2. When that happens, put in Jim’s allele 1 into the column. That gives me this:

 

That gives me over 500,000 rows of paternal and maternal alleles for Jim. However, I do have blanks. When I filter the blanks out, I get 491759 rows. That is a fast way to get almost 1 million alleles for Jim. Next, I’ll make a table of this query in Access. When I do this, I notice that Access has changed my query:

Access liked this better as it was simpler. I would think that JImallele2 does not have to be there twice, so I took one out and got the same result:

Access is trying to teach me to make better queries.

Adding Maternal Alleles from Mom

Here is a summary of where we are for Jim:

 

Just by assigning Jim’s own homozygoius alleles to his paternal and maternal sides, he is now 71% phased. I also see that mom had 496,136 homozygous alleles. These need to be added to Jim’s homozygous results. However, I want to be careful:

  • When I add Mom’s alleles, I don’t want to erase the ones I already gave to Jim
  • There may be homozygous alleles that mom had that Jim didn’t even test for. These could be added to Jim as bonus alleles.
  • In adding mom’s homozygous alleles to Jim’s list, we also have to add in where the position of those alleles are on the Chromosome and the RSID.

First, I note that mom has 496,136 homozygous alleles. This is more than Jim’s homozygous alleles.

First, I’ll create a query for Mom’s homozygous alleles.

Here I want there to be a non-blank result and I want Mom’s allele 1 to be the same as allele2.

Next, I’ll check to see how many of Jim’s homozygous alleles are the same as mom’s homozygous alleles.

I’ll do this by an equal join on the RSID which is a unique identifier. Here is what I get from this query:

However, there are still blanks there. I had trouble getting rid of the blanks, but I can temporarily get rid of them by filtering the results.

This gets rid of about 17,000 blanks.

This tells me that Mom has 496,136 homozygous alleles, but 381,721 of those Jim already has. That means we need to add 114,415 maternal alleles to Jim’s list. That would get his AllelesFromMom up to 606,174.

Next, I want to get a list of all of Mom’s homozygous alleles that Jim doesn’t have, so we can add them to Jim’s list. There is a little trick to getting this in
Access. First I create an unequal join:

 

On the query above on the left are all of Mom’s homozygous alleles On the right are Jim’s homozgous alleles that match Mom’s homozygous alleles. The #2 radio box is checked. That means I want everything on Mom’s side and everything where the RSID’s are equal. However, in the criteria, I’ll put an ‘is null’ on JIm’s side:

This adds 97,451 of Mom’s homozygous alleles to JIm. This is less than the 114,415 that I was looking for. One guess is that these are positions that Mom had tested that Jim did not. Somewhere I lost 7,000 of Mom’s homozygous alleles. Or this may have to do with the blanks in some of the tables. I was able to get rid of the blanks in Jim’s table and the new number came out right:

Adding 114,000 Maternal Alleles to Jim

Now that I’ve found 114,000 maternal alleles for JIm, I’d like to add them to his table. There are probably a few ways to do this in Access. One way is called Append Table. I’ll try that as I will need that later on in the process. If only I remembered how to do that. I could put Jim’s table into Excel and just add Mom’s table. However, I’m not sure Excel will appreciate the large files.

The directions that I found for Append Query said to use the data you want to copy first. That was in this Query:

What I want to add is from a Query called Mom Homo Jim Missing. These were Mom’s extra alleles. I chose to append these to a table called Jim Homozygous. But on second thought, I want it going to a new table, so I’ll copy Jim Homozygous and call it Jim Plus Mom Homozygous. First I want to review the results using the view button. I guess it looks right. It only shows the records to be added. Then I push Run and I get a warning saying that this cannot be undone.

Here is what the Appended Table looks like:

This is the point at which the appending took place. What I wasn’t expecting was that Access added the ID. This is the ID that Access originally assigned to the raw data. So now I have Jim’s ID’s and Mom’s ID’s in the same Table.

Phased Allele Update Alert

These two operations based on homozygosity alone put Jim’s phased alleles at over one million. Bing, bing, bing. Jim is already almost 80% phased. Maternally, he is close to 88% phased.

Other Phasing – Visual

I’m not the most experienced raw data phaser in the world, but I have worked on three, four five, and now six sibling raw data phasing. I have also done a lot of work with three, four, five and six sibling Visual Phasing. Here is Chromosome 1 using the Steven Fox Spreadsheet:

I can use the raw data phasing to confirm the Visual Phasing. I can also use the Visual Phasing to know where to look for crossovers. For example, I already see a problem with the map above in the bottom right corner. I will need to change the crossover designations there.

The other reason stated at the top of the Blog is that I should be able to create a file to upload to Gedmatch for each of these four grandparents. That could make searching for DNA matches easier.

Summary and Conclusions

  • I started phasing the sixth of six siblings based on homozygosity.
  • Using homozygosity alone, I got my brother Jim up to 80% phased.
  • Raw data phasing is considered an advanced topic, but the basics are quite simple. If you have two alleles that are the same, one must be from your father and one from your mother. If you are a parent and you have two alleles that are the same, you had to have passed down that same allele to your child.
  • I also used MS Access which is best suited for large databases.
  • My goal is to get four grandparent files to upload to Gedmatch (or Genesis). In the past, I have run out of steam on these projects.
  • I will be able to use my past work on visual phasing as a roadmap to finding crossovers and assigning grandparents.
  • I should be able to use my past raw data phasing experience to streamline the process.
  • With six siblings, I am expecting good results. However, as in the Visual Phasing process, the more siblings you have, you will have more combinations of sibling comparisons you have to look at.
  • Next up, I expect to look at heterozygosity.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Part 7 – Raw DNA From 5 Siblings and a Mother – DNA From Mom

I’ve spent my last 6 Blogs on this topic finding out which alleles came from my dad. In this Blog, I would like to work on finding my siblings’ and my alleles that come from mom.

The Ironic Step of Phasing – Mom Alleles from Dad Alleles

I call this ironic step in that it was my mom that was tested for DNA. Based on her results we found out a lot of the alleles that her children got from our dad who passed away quite a while ago. Now, we use those alleles we got from dad to figure out which alleles we got from mom. From the Whit Athey Paper referenced at the ISOGG Web Page on Phasing:

If a child is heterozygous at a particular SNP, and if it is possible to determine which parent contributed one of the bases, then the other parent necessarily contributed the other (or alternate) base.

 

First I copy my FillinOne Table to a MomfromDadOne Table. Then I’ll do a query on that.

This says where I am heterozygous, and I have an allele from dad, I want to see where I’m missing one from mom.

I have over 50,000 of these which will be easy to update. I will want to put Joelallele2 in the blank where JoelfromDad = Joelallele1. Then I will want Joelallele1 in the JoelfromMom space when my allele from Dad is Joelallele2.

I ran this query twice for each sibling, so 10 times. This updated 50-60,000 alleles per sibling, so about a quarter of a million alleles altogether.

Finding Mom Patterns

Now that I have filled in more alleles from Mom, it should be easier to find Mom Patterns. Here is a Query to find Min and Max for the AAAAB Pattern:

Results in:

This saves a lot of time and gives me the start and stop positions of all the AAAAB Mom Patterns. In my previous look which I now see as premature, I only found 2 AAAAB Patterns. Now thanks to my MomfromDad update above, I have at least 17 AAAAB Patterns. The only drawback is that if there is more than one AAAAB Pattern within a Chromosome, it will not show that. However, if I run all the Mom Patterns, and find overlapping Patterns, that can be reconciled later. In fact, I see an overlap already:

The first AAAAB Pattern I found was 162-233M which I did see as large. I already had found an AAABA Pattern from 192-249M. This could mean that AAAAB goes from 162-192 and that the 233M AAAAB pattern was just an outlying singleton.

I also recall that I want ID’s, so I’ll add that to my query:

Because I have so much new information, I’ll put this into a new spreadsheet:

AAABA Mom Pattern

I just have to change the Query slightly to get the AAABA Mom Pattern:

The results of this Query go into the new spreadsheet. This spreadsheet will be sorted by Chromosome later.

I added a column for IDEnd minus IDStart:

Where this is zero, it would indicate a single Pattern.

I went through all the Mom Patterns and got a spreadsheet of 194 rows that need to be reconciled. Here are Chromosomes 1 and 2 sorted:

Reconciling Chromosome 1

I have added in a column for possible assignment of a crossover to a sibling. Note that up to about 20M everything looks OK. There are discrete Patterns. ABBBA to AABBA is a change in the second position which belongs to Sharon. The change from AABBA to AABBB goes to Lori. Then the AABBB is the same as BBAAA which goes to ABAAA. That would be my crossover [Joel].

I did a Query showing where all the alleles were filled in for the Mom Patterns:

This shows where my Crossover is at ID # 8984. I have added a few more columns to my Mom Pattern Spreadsheet to add the more refined cut points:

Next I’ll look at 77M.

As best I can tell, there are two single AABAB’s in the middle of an AABBB Pattern. Next I will want to find the start of that AABBB Pattern. To find that I do a query to look for the AABBB Pattern in Chromosome 1. That Query results in more AABBB Patterns.

A Problem

I have a problem in that it appears that the Mom Patterns of AABBB and AABAB appear to overlap each other on Chromosome 1. I assume that means that I did something wrong.

refilling the dad patterns

That means that I should go back and fill the Dad Pattern back in:

First I recreate a Fill-in Table using the old Three Principles Table. Then I do update queries on that. Hopefully these numbers will work:

Back to Mom Patterns From Dad Patterns

Just so I’m not going backwards, I’ll redo this step. I copied my revised fill-in Table to a revised Mom from Dad Table. This time I’ll keep track of the alleles for fun:

So in retrospect, I don’t know if I made a mistake with the Dad fill-in’s or in the Mom fill-in from the Dad Pattern. Hopefully, there were no mistakes this time.

 

Part 6 – Raw DNA From 5 Siblings and a Mother – Filling In Paternal Blanks

In my last Blog, I said that I would work on the Maternal Patterns and then fill in blanks. However, my Maternal Pattern Table is not very filled. After some thought and re-reading the Whit Athey Paper on Phasing, on which I base this work, I decided to:

  1. Fill in the Paternal Blanks
  2. Use the Paternal Data to fill in the Maternal alleles
  3. Fill out the Maternal Pattern Table
  4. Fill in the Maternal blanks based on the Maternal Pattern Table

Filling In Paternal Blanks

I might as well start filling in the AAAAA Patterns. On my Dad Pattern Excel Spreadsheet, I can filter for that pattern:

However, I now need a formula for Excel including all the ID positions above. This was the point of my starting this project over – to get those IDs. The formula will be in the form of “Between A And B OR Between C And D OR…” So first I need a formula in Excel to create the formula in Access. That formula is called Concatenate. According to a Google search, concatenate means to “link (things) together in a chain or series”. The symbol in Excel for concatenate is simply the ampersand (&).

Here is my formula and the outcome:

However, I have another idea. I can concatenate the concatenation. First, I add an extra space on the end of my “Or”. Then I drag down the formula to fill in the other chromosomes. Then I take off the last “Or”.

That gives me this helpful string of AAAAA Positions:

This will save me a lot of cutting and pasting in Access.

Back to Access

First I copied my old Table to a new one called tblFillInOne. I will create an Update Query for that Table.

I am only updating Dad alleles from other Dad alleles, so I import those 5 alleles plus the location ID. Then I use the expression builder, to paste in the location of the AAAAA Patterns in all 22 chromosomes. So now I have the Pattern and the location, but I need some more criteria. I would like the criteria to say if there is any allele in any of the five columns and any blanks in those columns, then replace the blank space with one of the existing alleles.

Here is a simple Update Query:

This says, that if my allele is null and Sharon’s isn’t, then replace mine with Sharon’s. The problem is that this would take four separate Update Queries. With 5 siblings, that would be 20 queries.

Another risky Update Query would use this form:

Here I am saying if any allele is not null (other than mine) replace that in my slot where I have a blank. The thing I don’t know if the Update To: field can have a variable criteria. I’ll try it. When I run this as a Select Query, it puts a bit of a strain on my computer. Eventually, it gives me 18,385 rows. When I run the View function on the Update Query, I get the same number of rows, so I’ll hit the Run button and hope for the best.

If I run this Select Query again, I should get no results if everything updated.  I did get no results, so I assume that it worked. I want to save this Update Query and use it for the other four siblings.

Filling in Sharon’s missing alleles from the AAAAA Paternal pattern

I used the same logic for Sharon:

Now she has all the Is Null values and I don’t. I moved the Update To: criteria over to Sharon. I took out Sharon’s allele and added mine in her place. Again, this gives my old computer a workout. I get 18,315 rows again which seems suspicious. I see the problem. I appears that Access updated my results with a (-1) rather than with an allele.

That means that I just have to do 20 Queries. However, they should go quickly.

Back to the Simple AAAAA Query

Due to all the Update Queries, I’ll make a Spreadsheet to keep track of each Update Query I do:

It turns out that it is easier to run this Update Query sorted by ‘From’:

That way, I can just move Sharon’s allele from Dad and the Is Null along the Update Query:

With these fast 20 Update Queries, I updated over 100,000 alleles:

AAAAB Fill-in

This could be a little easier. For this one, we don’t want to touch the last ‘B’. The last B represents Lori, so we will only be filling in to and with the other four siblings.

And then we need the fill-in locations.

AAABA and AAABB Fill-in

AAABA is about the same as AAAAB except the B in the AAABA corresponds to Jonathan. He is all alone as a B so he gives no alleles and takes no alleles. The other siblings share their AAAA’s in this Pattern.

In an AAABB Pattern, the three A’s will share with each other and the two B’s will share with each other. This happens to break down along V1 and V2 lines, so I expect there will not be as much sharing as between AncestryDNA versions.  The sharing of A’s and B’s looks like this in my Fill-in Tracker:

I have darkened out the areas where an A cannot share their A with a B and a B cannot share their B with an A. As I predicted, the AAABB filled-in alleles were less:

All the other patterns filled in

All the other patterns will be of the same type. There is one AAAAA which is all the same. The other combinations are four of one type and one of the other or three of one type and two of the other.

There are 20 fill-in’s for AAAAA. As a quality check, there are 12 fill-in’s for a 4-1 Pattern and there are eight fill-in’s for a 3-2 Pattern. I would recommend using a fill-in tracker to make sure all the combinations are being covered. The specific numbers of alleles being filled in for each combination of each Pattern are not all that important, but they are interesting.

Fixing an abbab mistake

When I was filling in the ABBAB Pattern, I noticed a mistake I made. I filled in 3754 rows of Joel alleles into Heidi blank spots. In an ABBAB Pattern, I am only supposed to be filling in my alleles into Jon’s blanks. Here is the mistake:

That means in those positions, I’ll have an ABAAB Pattern where I should have an ABBAB Pattern. Oh no. So how do I fix that? I need a fix query. Under Pos ID, I’ll put in all the locations that are supposed to be ABBAB. Then I’ll make sure the first position isn’t the same as the second:

That results in only 103 rows.

If I update those 103 rows to Null, that should be a start:

Next I set the first position to be different than the last in this ABBAB Pattern:

That fixes another 212 rows. That may be all the rows to fix. I looked for more JoelfromDad = Heidi from dad where JoelfromDad <> LorifromDad and where JonfromDad <> LorifromDad, but didn’t see anything. The other updates must have been in areas with AAAAA by chance areas. In the meantime, I copied the first two columns on the left to the right, so I don’t lose my place when I am scanning across the spreadsheet.

Dad Pattern Fill-in First Round

The dark blue areas are the ones where there should not be any filling in based on the Pattern.

Summary

  • The Fill-in Step is a major part of phasing. In this step I filled in over 1 million paternal alleles in my DNA and in my 4 siblings’ DNA.
  • I noticed a mistake I made along the way, but figured out a way to fix it.
  • I figured out a shortcut to describe the different patterns by way of ID’s. The shortcut involved using a concatenation of a concatenation.
  • I haven’t yet filled in the random AAAAA Patterns that are within the other patterns. I imagine that would be important to do at some point. I know that David Pike has a utility to find Runs of Homozygosity. I suppose that would be useful for filling in alleles.

 

 

 

Part 5 – Raw DNA From 5 Siblings and a Mother – A New and Improved Method

In my last Blog, Part 4, I found that I needed to go back to improve a method from an earlier step to make a later step work much easier. This did two things:

  1. Gave me a cleaner database
  2. Set me back a ways

Re-do Principle 1: Homozygous Siblings

I need now to create a new table. This will have the allele from Mom and Dad for each sibling. I copied my previous table to a new one called tblV1andV2HomozygousSibs. I opened my new table in design view and added the 10 new fields that I needed:

The first five of the new fields will be have the Mom Patterns and the last five will have the Dad Patterns. Right now they are just blank. I’ll use an update query to add in homozygous alleles:

This query says when my allele 1 is the same as allele 2 (homozygous), put allele 1 into the slot from my Mom and my Dad. The Dad slot goes off the page, but is there. When I run the update, it fills in over 485,000 lines. To do this by hand would have taken a while. This is the first step to filling in the Mom and Dad Patterns:

I do the same query for each of my four other siblings. Care needs to  be made the the right alleles are going in the right place. For example I wouldn’t want to put a Lori allele into a JonfromDad column. Then I check to see if the columns are filling in:

If I recall right, this step fills in (or phases) about 8 million alleles. We don’t see any patterns yet other than AAAAA, but patterns are emerging in other parts of the table.

Step 2 – Homozygous Mom

Here when mom has a GG for example, she would have to give a G to each child at that position, as that is all she has. I’ll use the Update Query again for this. Here is the Criteria:

Here is the Update part:

Step 3 – Heterozygous Siblings

Here is an example:

I have TC in my two alleles at this position as do my siblings. My mom must have had CC as she gave a C to each of her siblings. That leaves a T that we must have gotten from our dad. It looks like I may need 10 Update Queries for this one. Here is the criteria:

The query says that Joelallele1 is not the same as Joelallele2 (Heterozygous Sibling) and I received my allele1 from mom.

I update the table to say I got my other allele from my Dad. This is a little more complicated Update Query. I then reverse the Joelallele 1 and 2. When I get allele2 from mom, I get allele1 from Dad. Before I run the Update Query, I view it each time to see if there is a reasonable number of rows being updated. If no rows are updated, there is probably an error in my query. This update is in the 40-50,000 row range. Also, if I get values in the view panes, it often means I have put the results in the wrong field. Usually many empty rows in the view output is a good thing.

I forgot to copy and rename my Homozygous Sibs Table, so I just renamed it to tblV1andV2w3Principles.

Finding Patterns

This time, I want to add ID’s to my patterns, so I’ll add two columns to my old Pattern Spreadsheet in Excel:

Rather than do formulae for each pattern again, I’ll just scroll through my table to see if I can finesse the Pattern boundaries and add Position IDs.

Finessing Pattern Boundaries

Here is an example at Chromosome 1 in the 77M range. There I had a change from ABBBB to ABABB. In my previous query, I only looked at Dad patterns where all the alleles were filled in. However, in the original pattern, we can infer the pattern even when alleles are missing.

Previously, I had the change at the top row where there is a full pattern. However, in going from ABBBB to ABABB, we only need the first three positions to identify the pattern. And actually, we only need position 2 and 3 to identify ABABB. At ID 25839, there is an AGG??? Pattern. This has to be in the form of ABBBB. Then 4 lines later, is ?AG??. This has to be an ABABB Pattern. Here is how I noted the change in the 77M range on my Excel Dad Pattern Spreadsheet:

The DadStart and DadEnd columns have the refined Position numbers.

Refined Chromosome 1

  • I had noted previously a possible AAAAA Pattern between AAAAB and ABBBB. It turns out that that is required. This is because to go from AAAAB (same as BBBBA) to ABBBB would require two changes. Only one change is allowed at a time. I will need to fill in the Positions and IDs.
  • The three ABABA Pattern areas need to be combined into one. They occur in a Centromere and in an excess IBD area. The Genealogy Junkie has a good Blog on that topic. I downloaded a file she had with the exact areas.
  • I added the IDs for the start and stop of the Chromosome as tested as well as the start of the next Chromosome. These are highlighted in dark purple.
  • Only 22 chromosomes to go.
Chromosome 2 Refined

Here I added a new column. This is the number of IDs or SNP positions between patterns. Note that there is a negative 4 in one case. This was an odd case where the two patterns at the crossover were inverted. I didn’t know what to do there, so I left it as is. There is a Centromere from 92-95M, so I will combine the two AAAAB Patterns that I have when I create the clean version of this table.

Chromosome 4 refined

Here I had to add a green AAAAA Pattern to make this work. Note that I am getting fewer crossovers.

Chromosome 5

Here is another case where an AAAAA Pattern is needed:

The pattern is needed between AABAA and AAAAB for two reasons. For one, there is a large gap between the end of AABAA and the beginning of AAAAB. Also, to go from AABAA and AAAAB requires two changes and only one is allowed. That requires an intermediary step of AAAAA between these two patterns [AABAA > AAAAA > AAAAB].

Here is Chromosome 5 completed:

The addition of the AAAAA Pattern results in the addition of two crossovers. Another note is that I could have had the first pattern start at the beginning of the Chromosome and have the last pattern end at the end of the chromosome. That is because there is not much room there for other crossovers.

a chromosome 8 Decision

The issue here is the two AAAAB Patterns in a row. Should they be combined or should I add an AAAAA Pattern between the two AAAAB’s? I’m going with combining. The reason is that if I put an AAAAA between the two, that would give Lori two paternal crossovers in a fairly short span. This does not happen in nature – at least in the middle of a chromosome. This would be like inheriting a 2 cM segment from a grandparent.

Chromosome 9 decision

Lori has two crossovers in a row, which is not ideal. Then there are two ABAAA patterns in a row. I decided to combine these. This is because when I look at the table, there is a centromere in there and a lot of missing SNPs. If I did create an AAAAA pattern, that would result in two close crossover for Sharon.

Here is the cleaned up version with the rogue SNPs taken out:

Missing Pattern Chromosome 10

There is a missing pattern between Lori and Jonathan’s first crossovers. AABAA > ABAAA is two changes, so I need to insert an AAAAA Pattern between the two. This will result in two new crossovers: one for Heidi and one for Sharon.

Chromosome 11 – Halfway?

The good news is that I’m at about 2/3 of the way. I have over 900,000 locations and Chromosome 11 brings us past the 600,000 mark. Note again the need for an AAAAA Pattern between the last two patterns. That will add a Lori crossover and a Jonthathan crossover, so they won’t be left out.

Chromosome 12 patterns

Chromosome 12 looks like it is missing a lot betwee ABBBB and AABAA. However, it is just missing an AAAAA. That is because ABBBB is the same as BAAAA. The progress goes BAAAA > AAAAA > AABAA. As it turns out the crossovers that have to do with transposing relate to me (Joel). The extra crossovers go to me and Heidi.

Here are the numbers filled in for Chromosome 12:

Sketchy Chromosome 13

I note that Chromosome 13 is a bit sketchy, with no identified sibling crossovers. It appears that AAAAA Patterns are needed here also.  There is about a 4M space where there is room for an AAAAA Pattern between 24M and 28M. There is also room after the AAAAB Pattern which would give Lori another Paternal crossover. This last crossover is shown in Gedmatch:

These are matches of my father’s first cousin to myself and four other siblings. This shows Lori’s crossover on the bottom match. As all siblings match to the end of the Chromosome, that would be the AAAAA Pattern.

Here is the finished Chromosome 13:

Lori’s crossover as shown in Gedmatch shows on my table at about 90M. Keep in mind that Gedmatch uses Build 36 and my table is in Build 37.

Chromosome 21: Refinement Example

Here is an example of a refinement. In my initial query, I was looking for patterns that were filled. However, in going from AABBA to ABAAB (which defines my crossover), it is the same as going from BBAAB to ABAAB. The only change in pattern is in the first three letters: BBA to ABA. We can see that change here even though the last three letters are missing:

Chromosome 22: Extra AAAAA needed

On Chromosome 22, there is a lot of room at the beginning of the Chromosome to put in an AAAAA Pattern:

There are about 5M SNPs between the start of the Chromosome (16M) to where the AAABA Pattern starts at 21M. I have 4 of my siblings mapped out using visual phasing:

This shows on the paternal side (Frazer) that there is an AAAAA Pattern. That is represented above at the start of the Chromosome in blue. I am just missing Lori. Without looking at all her results, I see she has a full match with Heidi at the beginning of the Chromosome:

And here is the last Paternal Chromosome finished:

It was a lot of work, but now I have what should be the start and stop points for all the Paternal segments for me and my four siblings.

Summary

  • I needed sequential IDs for my Access Queries to fill in missing alleles
  • To do this I needed to go back to the beginning and re-import the raw data for six people
  • I created a table for five siblings showing where they got their paternal and maternal alleles based on three principals.
  • I went back to my Paternal Pattern Table and refined what I had already done
  • I also added IDs to my Paternal Pattern Table
  • Next up is to look at the Maternal Pattern Table and start filling in blanks using MS Access

 

 

 

Part 4 – Raw DNA From 5 Siblings and a Mother – A Problem with Filling in the Blanks

In my last Blog, I looked at Maternal and Paternal Patterns created by initial phasing of my raw DNA. The Paternal side patterns were pretty complete, but I didn’t have much on the Maternal side. I summarized the patterns I found in a spreadsheet. I added siblings’ crossovers where found and added start and end positions on each chromosome where found.

Filling in the Blanks

  • First I made copies of my Mom and Dad Spreadsheets
  • Then I cleaned them up, taking out the Patterns that were only at one location (one SNP)
  • Then I used a filter to get the patterns.

The AAAAB Pattern filter on the Paternal Patterns spreadsheet looks like this:

On the Maternal Table, it looks like this:

 

I’ll start with the maternal Patterns as it will be easier. First I copied that Table I was working on to create a new Fillin Table called tblFillinStep1.

Here is what I portion of the Chromosome 10 AAAAB Maternal Pattern looks like:

 

In my previous analyses before I had 5 siblings tested, I had sequential ID’s. This made things easier in choosing patterns. Now, however, I am having trouble getting the ID’s sequential. This is probably due to my adding the extra alleles. My options now are to update the missing alleles by hand or create queries to update them.

Just Like Starting Over – a New Access Database

I decided to try importing all the raw DNA files into a new Access Database. This time I won’t let Access assign the key data field. I’ll use the rsid as the key data field. Perhaps then I can assign a new ID that will apply to the merged V1/V2 AncestryDNA dataset.

One problem that I’ve noted with the AncestryDNA downloads is that they only give you a date on the filesname. There is no indication in the file or in the file name of who the data is for. That means that it is important to rename your files. I uploaded my sister Heidi’s results to Gedmatch on 26 April 2015, so that is a clue. I see an AncestryDNA zip file from 25 April 2015, so that is a clue. I guess I’ll add her name to the file now!

The Clean start

Here is the Access Database with just 5 tables:

Here’s my Query to create an AncestryDNA V1 Raw Data Table:

I see in my results, I got more lines than last time. I checked and I forgot to only include Chromosome 1-23. I corrected that and got the 700,153 lines as before. I did and analogous query for Jon and Lori and got 666,532 rows.

All V1 Results plus the v2 results where they are they match v1

Next, I’ll create a Query between the two tables I just made.

This will have a right hand join. As stated above, it will produce all the V1 data and only those records from V2 where they equal V1. That brings in Jonathan and Lori into the V1 results. I made this data into a new table.

Finding v2 data not in v1

Next I want to add only the V2 part that isn’t in V1. When I get this, I can add this to my query to get all the results. This is the query that I couldn’t remember how to do last time. I put V2 on the left, with a right hand query.

This gives everything from V2 plus the V2 that equals V1. However, the trick is that I set V1 to ‘Is Null’ which takes out the V1. That should give only the right section of the peach colored circle below.

For some reason, my new number for the right entire circle is 666,532. The new query results is the right hand circle minus the overlapping data which is 242,494. I’ll append this query to the table I made with my previous Query. I renamed this table as tblV1andV2 and it has 942, 647 rows.

Next, I want to sort the table and add an ID.  I copied the structure of the table to a new table. Then I added a Position ID (Pos ID) with an autonumber. Then I made a query to append the old data to the new table with the autonumbered Pos ID. This gave me what I wanted but not in the right order. So then I used the sort function to get the table in the shape I wanted:

I can tell the table is right by going to the last record:

Before the last record would include the V2 alleles that were added but they weren’t sorted. This long process gives me a sorted ID in the same relative position as the sorted Chromosome positions. The reason I need that is to describe my patterns  in a simpler way than by Chromosome start and stop.

Summary and Next Steps

  • Going down the road of filling in patterns, I found something that I needed to correct in an earlier step
  • This caused me to start afresh with cleaner tables.
  • I was able to add a unique ordered Position ID to the combined V1/V2 table
  • This position ID will be used to identify each Pattern for filling in missing alleles
  • However, I have to re-do the patterns. This time, I will include the Position ID in the start and end of each pattern in my summary of Mom and Dad Patterns.

 

 

Part 3 – Raw DNA From 5 Siblings and a Mother – Patterns

In my previous Blog, I looked at Whit Athey’s Principle 3 for my mom, my 4 siblings and myself. Based on that Priniciple and the previous 2, I phased our DNA up to a point. The next step in the phasing has to do with patterns.

Patterns

The patterns I am talking about are the patterns that the five siblings receive from either their mother or father.  For example, an AAAAB pattern means that the first 4 siblings received the same allele and the fifth sibling received a different one. I had mentioned previously that the patterns should be in this form:

  • AAAAA
  • AAAAB
  • AAABA
  • AAABB
  • AABAA
  • AABAB
  • AABBB
  • ABAAA
  • ABAAB
  • ABABB
  • ABBBB

The first situation is a special case as this situation can happen within the other patterns ‘by accident’ as Athey puts it.

AAAAB Dad pattern

First I’ll look at a query to find an AAAAB pattern.

That Query results in this:

Except there are actually over 8,000 lines. I summarized the rough starts and stops in an Excel Spreadsheet:

This part can get a bit tedious. In Chromosome 2, I noted a possible break between 89 and 96M, so I’ll need to keep an eye on that. Highlighted in yellow are single patterns which may or may not be significant.

quality check

I took my AAAAB Query results and put them into an Excel spreadsheet. Then I subtracted the previous position number from the current position number to see where there were gaps. Then I filtered the gap to 1,000,000 or more positions:

 

This is my gap analysis. I highlighted the 7 million position gap where I put in an extra segment on the AAAAB pattern. This points out some of the single AAAAB patterns also.

mapping the initial results

Let’s look at Chromosome 13 between 28 and 87M. With an AAAAB pattern, that means that Joel, Sharon, Heidi and Jon match the same paternal grandparent. Lori matches a different one. However, we don’t know which paternal grandparent without a reference cousin. Fortunately, I have one. He is my dad’s first cousin. He would match on my paternal grandfather’s side. That grandfather is James Hartley, b. 1891:

Paternal cousin Jim matches the 5 siblings here:

As you may guess, Lori is on the bottom (#5). She has a crossover at about 85.5M according to Gedmatch. That means that before 85.5M she is matching on my father’s mother’s side: Marion Frazer. So, if I wanted to, I could start to map Chromosome 13. From 28 to 87M, I could say that 4 siblings got their DNA from their paternal Hartley grandfather and one sibling, Lori got hers from her paternal Frazer grandmother.

Further, I would expect an AAAAA pattern starting at 87M based on the gedmatch browser results above. The bad thing about an AAAAA pattern is that there is some missing DNA for the other grandparent. In this case, the Frazer DNA is lost on the right side of the map below. Another point is that these patterns change one letter at a time. So it makes sense that an AAAAB would go to an AAAAA. For example, an AAAAA would never go directly to an ABABA.

Here is a paternal only map of Chromosome 13 based on our very initial results:

aaaab Mom pattern

I notice that the formula that I used to find the AAAAB Dad pattern, I can move over to the mom side. So I might as well do this while I’m thinking of AAAAB pattenrs and put the results in Excel.

I randomly used Heidi as the ‘A’. So Lori not matching Heidi becomes the ‘B’. The results for this maternal query was much smaller with only 189 lines.

 

This was a lot easier. The Mom and Dad Patterns don’t interrelate with each other, so I have them on separate worksheets. Note that there is the same AAAAB pattern in the same starting place on Chromosome 13 as there was on the paternal side. This is a coincidence and the starting spot is a coincidence. This is just a rough number now and may be refined later. I could make a map of this also.

Here is a cousin on my mom’s father’s side:

Here she matches Joel, Sharon, Heidi and Lori from about 74-99M. Here is a map drawn on the Gedmatch browser and raw data phasing:

 

This shows what a AAAAB pattern looks like that is both paternal and maternal between 28 and 45M. I also show two crossovers for Lori: (Frazer to Hartley) and (Rathfelder to Lentz). In addition, Jon has to have a crossover from Rathfelder to Lentz and Lori has to have another crossover from Lentz to Rathfelder somewhere in the white spaces. There is a reason that I could tell the maternal A’s of the AAAAB pattern were Rathfelder even though our cousin match did not overlap that area. It is because the patterns do not change that fast as I explained above.

Now that I know which sibling has one of the paternal crossovers I can mark it on the Dad Pattern Spreadsheet:

I name the crossover column in the spreadsheet for the end of the pattern position, so it will be clear where it is. This is the ultimate goal of the process: to find the crossover locations and assign them to the siblings. Once this is done a map may be drawn for all the siblings.

The Next Step

In the next step, I could fill in the missing alleles between the Start and End positions of the AAAAB patterns. Here is how that will be done:

The highlighted row is where the AAAAB Pattern starts. Basically, what will happen is if there is at least one Allele in the first four positions, I will be able to fill in any of the other alleles in those first four positions with the same allele. However, in the last row, for example, there is just one G in the last position. We don’t know if the other four alleles will be a G or another letter. The row that has TTT??. We know that we can fill in the fourth T to TTTT?. However, the last allele we don’t know if it will be a T again or a different alelle. So we also need to leave that space blank.

However, I want to make sure I have all my patterns right, so I will look at all the patterns first and reconcile them.

AAABA Pattern

If I drew my map correctly above, I will be expecting Lori to have a maternal AAABA pattern on Chromosome 13. This should change to an AAABB pattern at about position 95M. I’m already on a maternal query, so I’ll start there.

 

I used Heidi again as the A. Now Jon is the B that is different than Heidi. I was surprised with the results as I only had this maternal pattern in Chromosome 1 and 23:

 

My prediction of a Chromosome 13 AAABA Pattern did not come true. I wonder what went wrong?

Paternal AAABA Pattern

Here is a partial summary of the Paternal AAABA Pattern:

 

On Chromosome 11, we see the AAABA pattern twice with an AAAAB pattern in-between. To go from an AAAAB to AAABA there has to be a transition pattern: either AAAAA or AAABB. Hopefully this prediction will be correct! That leads me to the AAABB pattern.

AAABB

This pattern requires a slight modification of my previous query:

 

This pattern is adjacent to the AAABA Pattern, so I will be able to assign some crossovers:

 

These crossovers belong to Jon and Lori as Jon is in the next to the last position of the patterns and Lori is in the last position of the patterns. Note that in Chromosome 19, Lori goes from an AAABA to an AAABB at about 5M. However, there is a rogue AAABB in the AAABA pattern at around 3M. That could be due to a misread or a mutation. I’m not sure. Jon has a crossover on Chromosome 8. These are all Lori and Jon crossovers, due to the positions of the pattern changes we are looking at. The changes are all in the last two positions.

AAABB Maternal

I’m still getting very few crossovers here. I’m not sure why:

 

I’m not sure why the maternal side is not keeping up with the paternal. I have no crossovers here yet.

AABAA

Following my alphabetical reasoning, AABAA is my next pattern. I’ll start with the maternal:

 

I changed to having my [Joel’s] allele the ‘A’ in the Pattern. The results look right:

 

It seemed like there was a break in Chromosome 5 between 46 and 50M.

AABAA Paternal

 

On Chromosome 5, there was a gap similar to the one on the Maternal side.

Centromeres

According to ISOGG, these are the Build 37 Centromeres:

This is good information to have. I assume that the Centromere is not counted, so I will ignore the Chromosome 5 missing area and make a note that the centromere is there. This also makes a difference on all the results.

AABAB: Are We There Yet?

 

Here are Heidi’s first crossovers. I’ve also heard of crossovers referred to as cut points. I am noting where the centromere is – though not quite spelling it correctly above.

Here is the Maternal AABAB. I am still annoyed that there are so few patterns. They seem to be missing for some reason:

 

I suppose, if this trend continues, I could do the project over and add in my mother’s and my FTDNA raw DNA results.

AABBA

I didn’t find any AABBA Patterns on the maternal side. However, that was with a query using my results as A. However, from my previous Blog, I recall this chart:

 

This showed that on the Mom side, Jon and Lori had the most alleles. I’ll run the query again this way:

 

Still no patterns.

Here are some more Dad Patterns:

 

However, there are a few problems. Chromosome 17 is missing a pattern. I can solve this by looking at the original table.

 

Here the pattern is AABAA.

The next problem is that there are two patterns in one spot on Chromosome 22. I ran pattern AAABA again and see it should have ended earlier:

 

Here is the right answer below that also shows a Heidi crossover at that location:

 

AABBB

Paternal

 

Maternal side

Still nothing.

ABAAA

Maternal side

 

Paternal

This side had more patterns.

ABAAB

Had several of ABAAB Patterns on the dad side, but only one on the mom side. I think that there is a fill-in step that fills in the mom side from the dad side that may correct this later.

ABABA, ABABB, ABBAA

I did notice a Dad Pattern discrepancy on Chromosome 6:

 

There are three single patterns, I figured were discrepancies. However, there appeared to be a longer AAABB Pattern within the ABBAA Pattern. This is where it helps to look at the raw data.

 

The blue section is the start of the AAABB rogue pattern that I had. However, a closer examination reveals that this pattern is not continuous from position 30514810 to 30594827. Between those two there are a lot of ABBAA patterns. This is clear at position 30544401. However, this is also clear wherever the first 2 alleles are different. For example, on the last line, I see GT???. This will be filled in with GTTAA as this is within the ABBAA Pattern area. So what happened was that there were two single AAABB patterns. When I did the query for these, it looked like the pattern was continuous, but it was not. Based on the above, I’ve modified my Dad Pattern Spreadsheet to show two single discrepancies:

 

I won’t overwrite this information, but I will keep it in mind for later in case it is important. If this was a real crossover, it would be mine. However, crossovers in the middle of a chromosome don’t change that fast for one person on one copy of their chromosome.

Some of the Dad Pattern Crossovers are starting to fill in:

 

Starts and Ends of Chromosomes

At some point, it is important to know where the Chromosomes start and end. The testing companies don’t always start at the beginning positions of each chromosome. The ends are different also based on the lengths of the chromosomes.

I was able to find what I was looking for using a min/max Query in Access. I took my table with the 900,000 plus alleles and made a query that looks like this:

 

When I run the Query, I get this helpful table:

This tells me the start and end locations for each of the chromosomes that I am looking at.

I put this into Excel and highlighted the information in purple. Then I sorted it into my mom and dad pattern spreadsheets:

 

Now, I can tell that I am near the beginning and the end of Chromosome 20 with the pattern locations. However, on Chromosomes 21 and 22, there is still room for more at the beginning of those Chromosomes. As the Chromosome 20 patterns are complete, this also tells me that my sister Lori has no paternal crossover on Chromosome 20.

ABBAB, ABBBA, and ABBBB

These are the last three patterns, not counting AAAAA. I finally have one crossover on the maternal side. It is on the X Chromosome:

 

I have a mess to clean up on Paternal Chromosome 2 :

 

There appear to be two patterns occupying the same space between 123 and 128M, which is not good. I’ll take a look at my Table: It appears that the AAAAB at 127,841,390 is a one-time occurrence. Here is my correction:

 

Note that there is still a gap at AAAAB. There may be an AAAAA Pattern stuck in there.

Lessons Learned and Next Up

  • It is good to document the process in case something goes wrong
  • The start and end points are needed for each chromosome
  • The start and end for each centromere is needed also
  • Attention is needed for the location of each crossover and who it goes to as this is a main point of all the work.
  • Changes along each copy of the chromosome are gradual. They happen one at a time and those one at a time changes correspond to siblings.
  • Next up is filling in the blanks. That was discussed briefly in this Blog.

Raw DNA From 5 Siblings and a Mother: Part 2

In my previous Blog, I started to phase 5 siblings based on their raw data and the raw DNA data from their mom. I looked at homozygous results. That is, when each sibling had the same allele, it meant that they got one of each of those same alleles from each parent. Also when my Mom had homozygous results, say GG, she had to have given one of those G’s to each of her children in that location.

I am using an Athey paper on Phasing from 2010. I looked at his first 2 principles in my previous Blog. Here is Principle 3:

Principle 3 — A final phasing principle is almost trivial, but it is normally not useful because there is usually no way to satisfy its conditions: If a child is heter
ozygous at a particular SNP, and if it is possible to determine which parent contributed one of the bases, then the other parent necessarily contributed the other (or alternate) base.
Heterozygous is a fancy term meaning two different alleles. This principle also lends itself to MS Access, but it requires a few more steps. In my case, the known contributor is my Mom. So in the case where my Allele 1 is different from my Allele 2 and I have an allele from mom. My allele from dad will be my other allele. I just have to make a formula out of that. It sounds like a high school math word problem.
First, I copy my homozyous allele from mom table to a new table. This is in case I make a mistake and have to go back to my previous table. I’ll call my new table, ‘tbl5SiblsHeteroMomtoDad’. Again, I’ll use an Update Query, to update the table with the new ‘from Dad’ alleles. There shouldn’t be an allele from Dad in any of these situations, as we have only put those in where the children were homozygous.
I used the Access Expression Builder to get my heterozygous results:
Here is the second part of the criteria:
This part says that where I’m heterozygous, and my allele from mom was allele1, put allele2 in as from Dad. Before I run this, I presently have 485,834 alleles from Dad. When I go to run the Update Query, I get this message:
After I run the Update Query, I now have 533,517 results. This is the same as 485,834 plus 47,683, so I assume that I am on the right track. I next have to run this one more time for myself for the case when my allele from Mom is allele2 and my allele from Dad would then be allele1. Then I will run this eight times for my four siblings.
5 Phased Sibs Update: V1 and V2

I did all my Principle 3 phasing and here is the update:

What is a little surprising is that Jon and Lori who were tested as AncestryDNA V2 had more Mom-phased alleles. I did mention above that they were getting extra phasing on SNPs that they hadn’t tested from their mom, but I didn’t realize how much.

I mentioned in my previous blog that the combined number of SNPs tested between V1 and V2 is 942,269. That number represents the merging of V1 and V2.

Also some of the specifics are a bit off. For example, my numbers include phased results for myself from my dad (16,536) on the X Chromosome. Well, I didn’t get an X from my dad. This means that the JoelfromDad and JonfromDad numbers above are a bit high.

Next up: DNA patterns

 

Playing With Raw DNA Data From 5 Siblings and a Mother: Part 1

In many past Blogs, I have written about the raw DNA data of my siblings and my mother. They can be searched in my Blogs under “Raw DNA Data”. First, I looked at three siblings compared with our mom. Next I looked at the results of four siblings. Now I have a 5th sibling tested.

Phasing With Raw DNA Data

The reference I use is Whit Athey’s 2010 paper called, Phasing the Chromosomes of a Family Group When One Parent is Missing.

Basically, Whit uses certain rules and iterative processes to fill in blanks of what the parent’s alleles would be as passed down to their children. From these lists of alleles one can see patterns. From the patterns, one can see where the maternal and paternal crossovers would be. The process is similar to the visual phasing process developed by Kathy Johnston. However, Kathy’s version does not require the use of a parent. Also Kathy’s version does not require looking at hundreds of thousands of alleles.

Lori’s raw data

The fifth sibling in my family to have an autosomal DNA test is Lori. She tested at Ancestry DNA. First I unzipped her results. They open in notepad. I then opened those results in Excel, so all the data would align in columns.

 

The columns are a SNP ID, the Chromosome, the position on the Chromosome in Build 37 and allele 1 and 2. I added Lori to the allele columns, so I could distinguish between siblings when comparing. One quirk is that when I convert from text to Excel, the blanks in the allele columns go to zeros. I then have to search for all zero’s in those columns and replace them with blanks. The blanks are no-calls.

This data shows Lori’s alleles unsorted. We do know that where she has a C and a C, that one is from her dad and one from her mom. However, where she has an A and G on Line 380436, we don’t know which is from her dad and which is from her mom.

Lori’s DNA in MS Access

I didn’t realize I could upgrade my old computer to Windows 10. It was just new enough to do that. When I did that, I rented Office 365 which includes Access. Access is good for comparing large amounts of data. Lori has 666,531 lines of data. There are 2 alleles for each position. So with six sets of data, that is a lot of alleles. I figure about 8 million. However, the crossovers occur at a distinct point. Finding crossovers is like finding a needle in a haystack.

First I import Lori’s Excel File into Access. It looks pretty much the same there. Except that Access adds an extra ID to keep track of things. Next I want to make an Access Query based on Athey’s Principal 1:

Principle 1If a person is “homozygous” at a location that is, having the same base on each of the two chromosomes of a pair, then obviously at that location it is possible to know with certainty that both chromosomes of the pair have that base at that location, but this is an almost trivial form of phasing.

Principle 1 in Access Query form

Here is Lori’s first query in design view:

It’s a bit small. All I did is put all of Lori’s imported raw data into a query. Then in the last columns I created a field called Lorifromdad. Then there is a formula that says if Lori’s allele1 is the same as her allele2, then put in allele 2. When I run that query, I get this:

Next, I want Lori from mom, which will look the same as Lori from dad. This is easy. I can just copy the same formula and give it a different name:

Also, I forgot that Ancestry has other DNA information in the raw data that I don’t need so I need to restrict the data to Chromosomes 1-23:

It’s nice to check the results to make sure you are getting what you want. This looks pretty simple, but Access does this operation over 600,000 times, so it saves a lot of time.

Next I add Jon:

I have the same kind of formula to Jon’s homozygous results from his mom and dad. I made an equal join in the query above. Note that Jon and Lori both tested with AncestryDNA V2. That means that they have the same SNPs tested. My 2 sisters, my mom and I all tested with V1. So we have to be careful with these joins. If I was to have used an equal join between a V1 and a V2 test, I would only get the results which were common to both.

When I view the query above, it looks like this:

Note that on the third line, Lori has homozyzous results and Jon does not.

Adding AncestryDNA V1 and V2 raw results

The next step is that I would like to carefully add the V1 homozgous results to the V2 homozygous results. Also I would like to make a large table out of what I get.

  • On my existing V1 Homozygous Table, I have 700,153 rows
  • On my new V2 Homozygous Table, I have 666,153 rows
  • The V1 SNPs that are the same as the V2 SNPs are 424037 rows or results

That means that I would like to have a table that has the V1 results plus the V2 results, minus the results in common, so that should be 942269 rows. Somehow I ended up with such a table. I know that’s not very scientifically reproducible, but that is what happened. I’m not sure how important it is to have the V2 results as they won’t phase with my mom. However, I’ll have them in case I need them.

The results of the query left the two V2 siblings’ results on the bottom of the table, but they can easily be sorted:

Principle 2

According to the Athey paper:

Principle 2 — If data from one of the parents are available, and that parent is homozygous at a SNP location, then another almost trivial phasing is possible
since obviously that parent had to send the only type of base s/he had at that location to the child.

 

This principle lends itself to Access. Basically, I want to tell Access that if Mom has the same two alleles, then show that each child got that allele from her. However, there are a few considerations. If mom has no-reads and the child doesn’t, then we don’t want to overwrite a good read with a bad one. The other consideration is, if mom had an incorrect read and the children had a correct one, we wouldn’t want to overwrite that either. However, I don’t know how rare that is. I guess it is pretty rare. I did a query to check and didn’t find any such instance. So that is one less thing to worry about.

Principle 2 in access

I want to say if mom has two non-null alleles that are the same, put that allele in as from her for all her children. Looking at my old queries, it looks like I need an Update Query. First, I copy my previous table of results to a new table called tbl5sibsMomHomozygous. I’ll try this query:

Before I update, I’ll take a view:

 

If I take out the ‘And is not null’ statement, I get the same results. I then changed the syntax to ‘Is not Null’ first and got one less record: 481977. It makes me wonder what that record is? I’ll use my second wording as it may be more accurate. Next I hit the !Run button and it updates the table I recently made.

This will give V1 mom alleles to Lori and Jon even when they weren’t tested for them. Here is an updated table view of just the alleles from Dad and the alleles from Mom:

I picked these results at random about halfway down the table. It looks like about half the alleles are filled in already. So now my siblings are more than half phased. The first 5 rows are alleles from Dad for each of the 5 siblings. The 2nd five rows are alleles from Mom for those same siblings.

a pattern preview

The highlighted row shows a pattern from Dad and one from Mom. The first row also shows the same pattern. This is what we will be looking at later in more detail to determine crossovers on the maternal and paternal side. This is what I’ll call the ABABA pattern for both. Here it is coincidental that both the Dad and Mom patterns are the same. Obviously with 5 siblings, there will be a lot of different types of patterns:

  • AAAAA
  • AAAAB
  • AAABA
  • AAABB
  • AABAA
  • AABAB
  • AABBB
  • ABAAA
  • ABAAB
  • ABABB
  • ABBBB

Those are the combinations that I can think of right now. AAAAA is a special case. This could mean that all five siblings could share the same grandparent or sometimes an AAAAA pattern is that way by coincidence. Where the maternal or paternal pattern changes is where the crossover is. This pattern should be gradual. That is, only one letter should change in a pattern change. For example, ABABA may change to ABABB or AAABB. There are many possibilities but only one letter will change. The placement of the letter represents one sibling. So that sibling will own that crossover. For example, a maternal ABABA to ABABB change would represent a maternal crossover for Lori as she is in the last position on my table. The place where the A goes to a B is the location of the crossover.

Next Up

Next up is Athey’s Principal 3 as it applies to 5 siblings and a Mom.

 

 

 

Mapping My Chromosome 20 Using My Raw DNA Results

In a past blog, I mentioned My Big Fat Chromosome 20. That blog is also referenced on the ISOGG Chromosome Mapping Page. This particular Chromosome had puzzled me for a while due to the preponderance of matches I was getting there. I used visual phasing and determined that the overload of matches was on my paternal grandmother’s Frazer side rather than the Hartley side. I had previously supposed that the Hartley side held the key to all my matches as that side had colonial Massachusetts roots. Since that time, I had my brother’s DNA tested. He is shown as F in the bottom row below. I thought that his results might add some clarity to Chromosome 20.

chrom204sibs

Rather than clarifying things, I just got a shorter version of what I already had for Jon (F) than I had for myself (J) and my two sisters. The problem is the phenomenon of close crossovers at the beginning and end of each chromosome.  Jon also has quite a few matches in Chromosome 20 (unlike my sister Sharon who had Hartley DNA in most of her paternal Chromosome 20). He has almost 30% of his phased matches there according to his match spreadsheet based on Gedmatch.

Going to the Source – Raw Data Phasing

I have been learning how to phase my raw data based on a Whit Athey article, MS Access and the work that M Macneill has done. The Whit Athey Paper describes how to manipulate the raw DNA data of one parent and four siblings to get Dad Patterns and Mom Patterns. I have found these patterns to be useful.

Dad Patterns

Even though my dad never had his DNA tested, based on the certain principles, I have come up with a spreadsheet that shows for various sections of the chromosomes matching patterns that I have with my other three siblings. I use A’s and B’s to give a generalized pattern. The patterns will be in the order of Joel, Sharon, Heidi and Jon. Here is my Dad Pattern spreadsheet showing Chromosome 20:

dadpatternchr20

I find my gap to next column handy. The first thing that I notice is that there are not many large gaps. If there were very large gaps, that might indicate an AAAA pattern where all the siblings match (in this case a paternal grandparent). One thing that I added today is a Start and Stop. This is the first and last tested position of the Chromosome. This is good to know in case a pattern is hiding at the beginning or end of the chromosome. Let’s just look at the second line of the spreadsheet. This shows that there is a pattern of ABAB from position 0 to 10M. This means that the first and third people (Joel and Heidi) match the same paternal grandparent and the 2nd and 4th siblings (Sharon and Jon) match the other paternal grandparent.

In the third row of the spreadsheet, a new paternal pattern starts (at 10M). This is ABAA. Now sibling 1, 3, and 4 (Joel, Heidi, and Jon) match each other. The difference between ABAB and ABAA is in the last position where I have Jon. He switched from a B to an A and now no longer matches Sharon, but he does match his other three siblings on the paternal side. As Jon is the one that changed, he gets the paternal crossover at this position.

A few other notes
  • These patterns are gradual. That means that there can be only one change at a time.
  • If it looks like there are two or more changes, then either something was done wrong or you have to invert the A’s and B’s
  • For example, above in row 4, I have an AABA pattern that goes to and ABAB. On face value, it looks like three changes. However, AABA is the same as BBAB. Actually it is the first B changing to an A. This is my position A, so I have a crossover around 54M on the paternal copy of my Chromosome 20.
  • These areas of patterns are also used to fill in bases received from Dad or Mom in the particular areas that the patterns occur in each chromosome.
  • If there are only three siblings tested, these patterns are not as informative.
Mom Pattern spreadsheet

I would not want to leave mom out. Here is the pattern of her 4 children matching on the maternal side:

mompatternchr20

Like the Dad Pattern Spreadsheet, everything looks well behaved as there are no large gaps between patterns. Also there are no gaps at the beginning or end of Chromosome 20. So there you have it. That is the phased DNA for myself and my other three siblings. But it doesn’t jump out at you and I don’t have a map yet. That is where I bring in the MacNeill <prairielad_genealogy@hotmail.com> Spreadsheet.

MacNeill’s Excel Spreadsheet

I adjusted MacNeill’s Chromosome 1 spreadsheet by replacing default numbers for Chromosome 20. Then I added in the locations I had in the spreadsheet above. Those are the Start36 and Stop36 columns. The 36 refers to Build 36 locations which Gedmatch uses. After that I colored in the bars to be consistent with the visual phasing I had done previously.

chr20map1

Actually, I now see that I colored Sharon’s paternal  bar backwards. She should have mostly Hartley (blue). This transposition also carried through to the next image, but I corrected it in the final image. I like having labels, so I copied this into PowerPoint and added some:

chr20map2

Next I add any appropriate cousin matches for Chromosome 20. I also made the sibling names on the left a little bigger. My mistake above on Sharon’s paternal bar is corrected and verified by her large paternal Hartley cousin match with Jim below.

chr20withmatches

I had to bring this back into PowerPoint to re-add the surnames. The places where the cousin matches start or stop may be crossovers for me and my siblings. From comparing the top part of the chart to the bottom, it should be obvious which crossovers are for me and my siblings and which are for the cousins. The good news is that the raw DNA phasing confirms my initial visual phasing done in January, 2016. The raw DNA phasing just filled in what I was unable to. The other good news was that there were significant cousin matches on both the paternal and maternal side of Chromosome 20 to make sure that all the grandparents were identified correctly. Since I did the original visual phasing last January 2016, I have gotten the DNA results of 2 more cousins. Also one additional cousin who previously had her match to only me at 23andme uploaded her results to Gedmatch.

Notes/Summary

  • The hard work in Raw DNA phasing is assigning all the bases of the siblings to the correct parent. Then patterns are discerned and noted.
  • The fun part is mapping out the results.
  • Raw DNA phasing and mapping is more accurate and complete than visual phasing. However, it takes a lot of work and works best when there is at least one tested parent.
  • The comparison of the raw DNA mapping to the actual cousin matches points out the fuzzy boundaries noted by others. This may be seen in Sharon’s short Lentz segment. Her cousin Judy match (who has Lentz ancestry) appears to exceed the length of Sharon’s Lentz segment.
  • Out of the four siblings, Sharon is the one who didn’t get the huge dose of Frazer ancestor matches. That means that she would be the best for looking for smaller matches at Gedmatch.com. Her smallest match is 9.3 cM (5.9 Gen) and my smallest match at Gedmatch is 10.7 cM (5.2 Gen).
  • At a glance, one can see who is the best person for finding matches with each of the four side of the family. For example, I received a full dose of Lentz DNA on Chromosome 20. Here is my Lentz grandmother (b. 1900) in her younger days. Her DNA is represented in yellow in the charts above.

emma

Using M MacNeills Raw DNA Phasing Spreadsheet and My Problem Chromosome 10

I have written many blogs about phasing my own raw DNA. One of the things that was bothering me while going through the process was the presentation of the results. It is possible to phase millions of bases using the raw DNA results from one parent and at least 3 siblings. But once the DNA is phased, how can those results be best portrayed? In my previous Blog on the subject, I was able to figure out a fairly simple way to show my results, but the outcome was not totally satisfactory.

chr7patmatmap

I liked how I was able to get the grandparents’ surnames at least in the first 2 bars. I also liked how I had a simple scale at the bottom. However, one of my bars went too far. Also, my simple chart started at zero and Chromosomes start at different positions. I was able to fix the bar going too far today. Excel makes these bars based on distance rather than positions, so one of my equations was wrong.

I told M MacNeill <prairielad_genealogy@hotmail.com> of my concerns and he sent me his spreadsheet. One feature I really liked about the MacNeill Spreadsheet is that it had a place for cousin matches at the bottom. Below is the first Chromosome where I used my phased raw data from my mom and 3 other siblings to create a MacNeill Chart.

chromosome15macneill

Sharon’s maternal first little segment didn’t work out perfectly, but that didn’t bother me. I know that the beginning and ends of Chromosomes can have small problematic segments. Note at the bottom that my match to Carolyn in yellow shows where my maternal crossover is in the upper part of the chart where I go from red to orange.

My Chromosome 10

I am looking at my Chromosome 10 because, for one thing, I have had trouble trying to visually phase this Chromosome in the past. Here is my attempt at visual phasing from early in 2016:

chr10visphase

Here is another try including additional cousins that tested:

10r1visphase

Note how different the maternal (lower) side is. I switched most of the maternal grandparents around.

Here is the MacNeill spreadsheet showing just the cousin matching part:

cousinmatch10macneill

I have some good matches here. Blue is Hartley, green is Frazer, yellow is Lentz. Red is Rathfelder. This makes it clear that my chromosome is mapped wrong. I need more Hartley and Lentz. The above chart includes my brother who I had tested not too long ago.

Here is another try with my brother’s DNA results included:

10visphase3

My sister Sharon (S) has a better look now on her maternal side. I got rid of the small purple segment.

Looking At the Raw DNA Phasing – Paternal Side

I have two spreadsheet summarizing the results of the many hours of work it took to phase my family’s DNA  from the raw data. One spreadsheet is for the paternal side phased DNA and the other is maternal. I have patterns for both sides. They are based on the order of my siblings: me (Joel), Sharon, Heidi and Jonathan. So an ABBB pattern would mean that Sharon, Heidi, and Jonathan all get their DNA from one grandparent, and I get mine from the other. Here is the paternal spreadsheet:

dadpatternchr10

These patterns go logically one to the other. The first pattern goes from AABA to AAAA at position 2,605,158. The B changed to an A in Heidi position, so the crossover goes to her at that position. I have a column called GaptoNext. This is based on the number of tested SNPs between patterns. When this number is large, I suspect an AAAA pattern. That was the case above highlighted in yellow. Except there is a problem. To go from ABAB to AAAA means 2 changes, and there should only be one change (or crossover) at a time. This caused me to look at the bases.

A Paternal pattern missed

Here is what I found.

chr10patternmissed

I had missed an AABA pattern at Build 36 Position 30,683,878. I took another look by setting my MS Access query so that Sharon and Heidi would have a different base from Dad:

chr10rawpatterns

This shows that the there is a change from ABAB to AABA even sooner than I thought between ID 400008 and 400045. This is an ID I created that sequentially numbers the tested SNPs. You can see another way I missed this pattern, because I didn’t fill in the missing bases. TTC? should be TTCT. CCT? should be CCTC.

What does the missing pattern represent?

The pattern of ABAB TO AABA is actually my crossover (Joel). It is a bit more difficult to see than the others. That is because the ABAB pattern is the same as BABA. The change of BABA to AABA is my change of the first B to the first A. Naturally, I put myself in the first position. In rough terms, that gives me a paternal crossover at about position 30.5M. This is a good location as it does not interfere with a large match that I have with an unknown paternal DNA relative named Shamus:

shamus

Here is my corrected Dad Pattern for Chromosome 10:

dadpatternchr10corrected

I have gone from 6 to 8 crossovers as the previous correction lead to another one. I also took out one of Heidi’s crossovers that I had wrongly identified. So fixing one problem fixed a lot of others. It helps to describe the start and stop of each pattern and to describe each crossover. The important results are the person and the last Position column. These show who the crossover belongs to and where that crossover occurs on the chromosome. I then entered the paternal crossover results into the MacNeill Spreadsheet and got this:

patchr10chart

I took out the large space between the siblings. The problem is that the space is now the same as between the maternal and paternal phased part for each sibling. Excel has no happy medium that I’ve found.

The blue is Hartley and green is Frazer. The raw phasing in the upper part of the chart matches with the cousin matches below. It is interesting that some of the cousin matches define the crossovers. For example, the Jim to Sharon match gives Sharon’s crossover. Also the Paul to Sharon match gives Sharon’s other crossover. The Paul to Jonathan match gives Jon’s first crossover.

The Maternal Side

Hopefully resolving the maternal phasing will be easier than the paternal side. My visual phasing only showed four crossovers. Here is my unfinished spreadsheet showing 5 crossovers (under the Person column):

maternalchr10

Here, it looks like I already added an AAAA pattern to the end. That was because the AABA pattern ended at about 114M and the Chromosome itself ends at about 135M. My GapstoNext column showed that gap as almost 20,000 SNPs. My question now is: should I add an AAAA pattern to the beginning also? Perhaps. An AAAA pattern means that 4 siblings match and all got their DNA in that area from their maternal (in this case) grandmother. Those results were consistent with how I had the visual phasing done. In fact, the visual phasing indicated that the 4 siblings should all get their maternal DNA from the Lentz side up until about 60M. Let’s take a closer look. This gets at my first note above in the spreadsheet image. There were only 3 single SNPs showing the AAAB pattern and they were spaced a long way apart – over 10 Megabases each. In this case, I will disregard those 3 widely spaced patterns as some type of mistake and stay with the AAAA pattern. Once I made the change from the AAAA pattern to the AAAB pattern, that brings us up to about 60M for my (Joel’s) first crossover. That seems to fit well. That leaves us with 4 crossovers – one per sibling as opposed to the two per sibling on the paternal side.

First I’ll compact the Gedmatch browser results, then show the raw DNA Phasing results on the MacNeill Chart:

gedmatchcheckofrawphase

chr10phasemap

When I compare the results, I see a problem I had with the visual phasing. The next to the last crossover looked to belong to Sharon, but instead it belonged to Heidi. Also Jon’s second paternal crossover should have been marked as an “F” above. That was just a typo. The third J for Joel crossover that I had above was not a crossover. In the middle, the 2 close crossovers of J and S should be instead S and J if I’m reading the MacNeill Chart correctly. It looks like all the FIRs and HIRs, etc. match. Once I did the raw DNA phasing, it is obvious how the gedmatch browser results had to match the raw DNA phasing results. Before, I did the raw DNA phasing it was not so obvious.

I’m happy with the results. I get to pick whatever colors I want for the four grandparents. It still would be nice to have some sort of labels or color key. After a hard day of phasing DNA, it is rewarding to see the results displayed so nicely. Thank you Mr. MacNeill.

A few observations:

  • The 4 siblings did not inherit any Rathfelder DNA (brown) on the left side of Chromosome 10
  • Lentz DNA (yellow) is missing from the right side of the Chromosome for the same 4 siblings
  • As I have my mother’s DNA results, that would make up for the missing DNA from those 2 maternal grandparents
  • Short segments of Hartley DNA (blue) are missing near the beginning and near the end of the Chromosome (i.e. none of the four siblings inherited Hartley grandfather DNA in those areas).

Summary

  • M MacNeill has the best display that I am aware of for mapping phased DNA.
  • The final mapping is like the final exam where previous mistakes are brought out, but there is a chance to correct them.
  • The phasing process is difficult, but there are built in checks and balances to find and correct mistakes or missed patterns.
  • The raw DNA phasing procedure (I use the Athey method) would generally be used if a parent has been tested and the visual one is used if a parent has not been tested. However, the visual phasing as developed by Kathy Johnston is important to use as a framework for the raw DNA phasing as well as a check for the end result.
  • The raw DNA phasing results appear to be better than what I was able to get using the visual phasing. Not because the visual phasing method is bad; more because I have not mastered it.
  • If you are using someone else’s spreadsheet, it is a good idea to know how they work in case anything goes wrong.
  • After writing many blogs on visual and raw data DNA phasing, it is nice to see everything come together using the MacNeill Spreadsheets and Charts.