Playing With Raw DNA Data From 5 Siblings and a Mother: Part 1

In many past Blogs, I have written about the raw DNA data of my siblings and my mother. They can be searched in my Blogs under “Raw DNA Data”. First, I looked at three siblings compared with our mom. Next I looked at the results of four siblings. Now I have a 5th sibling tested.

Phasing With Raw DNA Data

The reference I use is Whit Athey’s 2010 paper called, Phasing the Chromosomes of a Family Group When One Parent is Missing.

Basically, Whit uses certain rules and iterative processes to fill in blanks of what the parent’s alleles would be as passed down to their children. From these lists of alleles one can see patterns. From the patterns, one can see where the maternal and paternal crossovers would be. The process is similar to the visual phasing process developed by Kathy Johnston. However, Kathy’s version does not require the use of a parent. Also Kathy’s version does not require looking at hundreds of thousands of alleles.

Lori’s raw data

The fifth sibling in my family to have an autosomal DNA test is Lori. She tested at Ancestry DNA. First I unzipped her results. They open in notepad. I then opened those results in Excel, so all the data would align in columns.

The columns are a SNP ID, the Chromosome, the position on the Chromosome in Build 37 and allele 1 and 2. I added Lori to the allele columns, so I could distinguish between siblings when comparing. One quirk is that when I convert from text to Excel, the blanks in the allele columns go to zeros. I then have to search for all zero’s in those columns and replace them with blanks. The blanks are no-calls.

This data shows Lori’s alleles unsorted. We do know that where she has a C and a C, that one is from her dad and one from her mom. However, where she has an A and G on Line 380436, we don’t know which is from her dad and which is from her mom.

Lori’s DNA in MS Access

I didn’t realize I could upgrade my old computer to Windows 10. It was just new enough to do that. When I did that, I rented Office 365 which includes Access. Access is good for comparing large amounts of data. Lori has 666,531 lines of data. There are 2 alleles for each position. So with six sets of data, that is a lot of alleles. I figure about 8 million. However, the crossovers occur at a distinct point. Finding crossovers is like finding a needle in a haystack.

First I import Lori’s Excel File into Access. It looks pretty much the same there. Except that Access adds an extra ID to keep track of things. Next I want to make an Access Query based on Athey’s Principal 1:

Principle 1—If a person is “homozygous” at a location —that is, having the same base on each of the two chromosomes of a pair, then obviously at that location it is possible to know with certainty that both chromosomes of the pair have that base at that location, but this is an almost trivial form of phasing.

Principle 1 in Access Query form

Here is Lori’s first query in design view:

It’s a bit small. All I did is put all of Lori’s imported raw data into a query. Then in the last columns I created a field called Lorifromdad. Then there is a formula that says if Lori’s allele1 is the same as her allele2, then put in allele 2. When I run that query, I get this:

Next, I want Lori from mom, which will look the same as Lori from dad. This is easy. I can just copy the same formula and give it a different name:

Also, I forgot that Ancestry has other DNA information in the raw data that I don’t need so I need to restrict the data to Chromosomes 1-23:

It’s nice to check the results to make sure you are getting what you want. This looks pretty simple, but Access does this operation over 600,000 times, so it saves a lot of time.

Next I add Jon:

I have the same kind of formula to Jon’s homozygous results from his mom and dad. I made an equal join in the query above. Note that Jon and Lori both tested with AncestryDNA V2. That means that they have the same SNPs tested. My 2 sisters, my mom and I all tested with V1. So we have to be careful with these joins. If I was to have used an equal join between a V1 and a V2 test, I would only get the results which were common to both.

When I view the query above, it looks like this:

Note that on the third line, Lori has homozyzous results and Jon does not.

Adding AncestryDNA V1 and V2 raw results

The next step is that I would like to carefully add the V1 homozgous results to the V2 homozygous results. Also I would like to make a large table out of what I get.

On my existing V1 Homozygous Table, I have 700,153 rows
On my new V2 Homozygous Table, I have 666,153 rows
The V1 SNPs that are the same as the V2 SNPs are 424037 rows or results

That means that I would like to have a table that has the V1 results plus the V2 results, minus the results in common, so that should be 942269 rows. Somehow I ended up with such a table. I know that’s not very scientifically reproducible, but that is what happened. I’m not sure how important it is to have the V2 results as they won’t phase with my mom. However, I’ll have them in case I need them.

The results of the query left the two V2 siblings’ results on the bottom of the table, but they can easily be sorted:

Principle 2

According to the Athey paper:

Principle 2 — If data from one of the parents are available, and that parent is homozygous at a SNP location, then another almost trivial phasing is possible

since obviously that parent had to send the only type of base s/he had at that location to the child.

This principle lends itself to Access. Basically, I want to tell Access that if Mom has the same two alleles, then show that each child got that allele from her. However, there are a few considerations. If mom has no-reads and the child doesn’t, then we don’t want to overwrite a good read with a bad one. The other consideration is, if mom had an incorrect read and the children had a correct one, we wouldn’t want to overwrite that either. However, I don’t know how rare that is. I guess it is pretty rare. I did a query to check and didn’t find any such instance. So that is one less thing to worry about.

Principle 2 in access

I want to say if mom has two non-null alleles that are the same, put that allele in as from her for all her children. Looking at my old queries, it looks like I need an Update Query. First, I copy my previous table of results to a new table called tbl5sibsMomHomozygous. I’ll try this query:

Before I update, I’ll take a view:

If I take out the ‘And is not null’ statement, I get the same results. I then changed the syntax to ‘Is not Null’ first and got one less record: 481977. It makes me wonder what that record is? I’ll use my second wording as it may be more accurate. Next I hit the !Run button and it updates the table I recently made.

This will give V1 mom alleles to Lori and Jon even when they weren’t tested for them. Here is an updated table view of just the alleles from Dad and the alleles from Mom:

I picked these results at random about halfway down the table. It looks like about half the alleles are filled in already. So now my siblings are more than half phased. The first 5 rows are alleles from Dad for each of the 5 siblings. The 2nd five rows are alleles from Mom for those same siblings.

a pattern preview

The highlighted row shows a pattern from Dad and one from Mom. The first row also shows the same pattern. This is what we will be looking at later in more detail to determine crossovers on the maternal and paternal side. This is what I’ll call the ABABA pattern for both. Here it is coincidental that both the Dad and Mom patterns are the same. Obviously with 5 siblings, there will be a lot of different types of patterns:

AAAAA
AAAAB
AAABA
AAABB
AABAA
AABAB
AABBB
ABAAA
ABAAB
ABABB
ABBBB

Those are the combinations that I can think of right now. AAAAA is a special case. This could mean that all five siblings could share the same grandparent or sometimes an AAAAA pattern is that way by coincidence. Where the maternal or paternal pattern changes is where the crossover is. This pattern should be gradual. That is, only one letter should change in a pattern change. For example, ABABA may change to ABABB or AAABB. There are many possibilities but only one letter will change. The placement of the letter represents one sibling. So that sibling will own that crossover. For example, a maternal ABABA to ABABB change would represent a maternal crossover for Lori as she is in the last position on my table. The place where the A goes to a B is the location of the crossover.

Next Up

Next up is Athey’s Principal 3 as it applies to 5 siblings and a Mom.