In my last blog, I found a few errors when I was checking some odd results. This lead me to think that it would be better to start the phasing process from the beginning. The beginning means using 4 siblings’ raw data and my mom’s raw data. This time I will be more methodical and keep track of the results. I have a new spreadsheet called The Base Tracker. Every step that I take, it will keep track of the bases from each sibling when they assigned to a parent.
A New Table
First I’ll create a new table from the raw data. I’ll start with my mom, me and my 2 sisters as they are all tested using Ancestry Version 1.
I called the table tbl3Sibs.
Next, I combined tbl3Sibs with Jon’s Ancestry V2 results into a new table called tble4SibsNew. I made sure I had a right connect on the arrow. That means that I wanted everything in the 3Sibs table plus what was in Jon’s information. If I had left it an equal join, I would have lost the bases that are in Version 1 but not Version of the AncestryDNA results.
It is important here to connect by rsid. I made the mistake of connecting by IDs last time. As the different AncestryDNA test results versions had different ID’s, this produced crazy results. I also used only Chromosome 1-22 as there are too many special cases for the X Chromosome.
Then I used a count function to count the number of bases each sibling had. I also figured out how many blank lines there were out of the 682549 and subtracted those 8229 sibling blanks from the total to get 674,320. I’ll use that number to figure out the percent phased. This is the Count Query showing the Totals button in the Access Ribbon:
The results of this query were put in the RawBases Row below.
My New Base Tracker: % Phased
The first column has the step taken. P1 is Principle 1. JoelFD is the Joel from Dad column, so all the Dad bases are on the left and mom bases are on the right. This table will give me the % phased for each sibling.
Principle 1 Query – Homozygous Siblings
This Principle is on the Principles from a Whit Athey Paper where you have 2 bases the same and each one is from each parent. The last time I did this, I may have had too much in a query at a time. This time, I’ll do the query separately for each sibling.
First, I opened up my tbl4SibsNew in design view and added more fields to put the new dad and mom bases.
First, I copied the table, so I’d have the raw data table with no additions. I called my new table tbl4SibsNewPrinciples. That is where the phased bases will go.
Here is a simple Principle 1 Update Query for me:
It says where I am homozygous, put both those bases in my JoelFromDad and JoelFromMom columns in the new tbl4SibsR1Principles.
That little query phased over 900,000 of my bases into Paternal and Maternal sides.
I was interested in seeing the effect of Jon’s testing using AncestryDNA V2:
Jon has a ways to go to catch up on being phased. This is due to the differences in AncestryDNA V1 and V2.
Principle 2 – Homozygous Mom
Here if my mom has the same base twice, one of those has to go her child. Here is a query to update my mom bases. As my dad’s DNA was not tested, he gets a non-applicable in that column.
Note that I have a criteria ‘Is Null’. This means only update this base if there is a blank there already. Here is the Principle 2: Homozygous Mom summary:
Here I don’t know why my Principle 2 Bases were so low. I think it is because I made a mistake above, so I’ll do these steps over from the beginning.
Here I get more consistent results for my mom bases:
Here is the revised Principle 2 Summary:
Jon’s results also changed to be more realistic to where he was after Principle 1. I can also use the Access Count function to check these numbers:
All the numbers match up except for JonsFromMom. For some reason, the spreadsheet is showing a higher number of Total Bases from Mom for Jon of 540956. If I subtract that from his Principle 1 bases from Mom, I get 272250. I’ll put that in as his Principle 2 bases from mom and assume that I made a mistake in writing down Jon’s Principle 2 base from Mom number.
I suppose it’s like reconciling my bank statement. I assume that these are Jon’s mom bases filling in where Jon didn’t have test results that lined up with the AncestryDNA V1 results for his mom and siblings.
Moving On To Principle 3: Heterozyous Siblings
This works when the child is heterozygous and has one base phased to one parent. Then the other base is phased to the other parent. It appears that this would have to work just from the mom side for now to fill in the dad side. That is because we haven’t filled in the ‘fromDad’ side with any Heterozygous sibling results yet.
This query says in the situation where I am heterozygous and I get my allele2 from mom, assign my allele1 to be from my dad. But only do that where there isn’t already a JoelFromDad base there.
However, this raises a question. Here is the same query without the ‘Is Null’ criteria:
As you can tell, I am beginning to doubt my work. The question is, if there has been no previous addition of Joel bases from dad based on my heterozygous results why is there a difference between the two queries?
I checked Sharon’s results and found that she didn’t have the same situation. Where she was heterozygous, she didn’t have any bases from dad assigned to her.
Here is a query showing my problem:
It is not a problem for phasing, but more for what I will enter into my Base Tracker. Fortunately, I can do a Count Query:
This shows that my JoelFromDad bases have gone up by 25589 somehow since I last tracked them. This means that I should use the larger number for my Base Tracker.
Here is the Principle 3 Summary in my Base Tracker:
In a few hours, I’ve phased over 4 million bases. And that time includes making mistakes and fixing them. All siblings are phased at over 80% at this point except for Jon. His Paternal phasing is lagging at only at one half.
I suppose that this is the time for me to say that it takes 20% of your time to get 80% of the result and 80% of your time to get the last 20% of your result.
Summary Part 7
- After making mistakes, it feels good to start with a clean slate
- Principals 1-3 of the Athey paper are easy to implement using MS Access
- If a mistake is found, it is usually good to start from a clean table of data and fix it from there
- The Patterns don’t lend themselves as well to Access and take more time to get
- Having a Table to track the work and results is helpful and interesting.
- In the next Blog (Part 8), I will be back looking at filling in the Patterns areas