Beth’s Hartley DNA

In this Blog, I will be looking at Beth’s autosomal DNA. That is the DNA that she got from both her parents. However, I am more interested in Beth’s father’s mother’s DNA as she was a Hartley and the DNA that we share would be Hartley DNA.

Hartley Tree of DNA Testers

Here are those closer relatives that have had their DNA tested and uploaded to Gedmatch.com:

Here Hartley is shown as green and Snells are shown as yellow. The DNA testers are in gold. Any DNA that the four DNA testers have in common will belong to James Hartley and Annie Snell. However, it will be difficult to tell which. Any DNA that Patricia and Beth share could also belong to Charles Nute which Jim and my family will not share. Here is an example of that on Chromosome 1.

Here is a photo believed to be Mary Hartley with her sister Nellie:

Hartley and Nute DNA On Chromosome 1

This is a Chromosome browser from Gedmatch.com showing where Beth shares DNA with Heidi (1), Joel (2), Sharon (3), Jim (4) and her first cousin Patricia (5). Is the DNA that Beth and Patricia share Hartley DNA or Nute DNA? To find that out we can look at Patricia’s DNA browser. If she shares DNA in this same area with Heidi and Jim, then it will be Hartley DNA.

The above Browser shows Patricia matching Beth (1), Jim (2) and Joel (3). This means that the DNA that first cousins Beth and Patricia share in Chromosome 1 is Nute DNA. If I were to map Patricia’s maternal Chromosome 1, it would probably look like this:

This shows that Patricia got her green DNA (matching Jim and me) from her Hartley maternal grandmother and her pink DNA (matching Beth) from her Nute maternal grandfather.

First Cousins Vs. Second Cousins

First cousins share two grandparent as their most recent common ancestor. Second cousins share two great grandparents and get their shared DNA from one of them. The first cousin DNA matches will be larger in general. The second cousin matches will tend to be smaller.

First cousins

As shown above, first cousins will share the DNA from two of their grandparents. In the case of Patricia and Beth, those two grandparents will be maternal grandparents. The catch is, that when two first cousins match each other, they won’t know which grandparent they match on. They just know that it will be one or the other. In the example above, we did know which grandparent matched because of other second cousin matches.

second cousins – Two common Great grandparents

Second cousins have as their most recent common ancestors two of their great grandparents. But again they won’t know which great grandparent they are matching on.

The best way to identify which great grandparent the gold people match on would be to have a third cousin that is only related on the Hartley side OR the Snell side. I don’t know of anyone in this category right now, so I’m a bit stuck. I would like to figure out which DNA is which. The main reason is that I’m stuck on the Hartley genealogy. I know that Greenwood’s father was Robert, but before that, I’m not sure. If we could find another Hartley relative going back then it might break down the Hartley brick wall.

Any Other Way To Separate Hartley DNA From Snell DNA?

There is one main difference from James Hartley and Annie Snell above as it relates to their DNA. James was born in Bacup, Lancashire, England and Annie was born in Rochester, Massachusetts. All of James ancestors would also have been born in Lancashire. On the other hand, all of Annie’s ancestors that would produce matches go back to Colonial Southeastern New England. That means that if we find a match that is from England and has no ancestors in the United States, there would be a good chance that that DNA match was through the James Hartley side.

Beth’s X Chromosome

First, let’s look at my family. There is  no Hartley X Chromosome sharing with this group because the X-DNA does not travel from father to son.

Second, look at Beth compared to Jim:

Beth got one of her X Chromosomes from her dad. This was the same X that he got from his mother Mary. Jim got an X Chromosome from his mother. She got it from James Hartley b. 1862 and Annie Snell. So Beth and Jim have James Hartley and Annie Snell in common.

These pieces of blue where Beth and Jim match represent DNA that they share from James Hartley and/or Annie Snell.

How do Patricia and Beth compare by X-DNA?

Next we will look at Patricia and Beth. They will share X-DNA with their grandmother Mary Hartley. Beth’s dad got no X-DNA from his Nute dad, so Beth and Patricia will only match on Mary Hartley.

Note here that Beth and Patricia share some X-DNA from their grandmother that isn’t shared between Jim and Beth on the left side. They also share a longer segment at the right hand side than Beth and Jim shared. However, Jim and Beth shared a segment from 123 to 138M that wasn’t shared between Patricia and Beth.

Let’s See How Patricia Compares With Jim

The only comparison left is between Patricia and Jim.

I compared the three comparisons and came up with a bit of an X Chromosome map. In the first match between Beth and Patricia, I have that match in red. On the very right there are three matches, so I have that as great grandparent 1. We don’t know which great grandparent it is – just that it is the same one. On Jim’s map, it is his grandparent 1. Going from right to left on Jim’s map, he changes from getting his X-DNA from grandparent 1 to grandparent 2. However, Patricia and Beth continue to match on great grandparent 1. In the middle there are no matches, so we can’t tell what is going on. Also the two reds and one blue on the left may actually be two blues and a red as we don’t know how they match with the segments on the right.

Beth’s Hartley (and Snell) Chromosome Map

If we look at all the matches Beth has with Jim, my siblings and me, we will have a map of her known Hartley (and Snell) DNA:

I didn’t use the DNA shared between Patricia and Beth as they are first cousins. As such, they will share Nute and Hartley DNA and it will not be as easy to tell which is which. So second cousins are good for these maps. The red is in the bottom part of each chromosome. That represents the paternal chromosome. We have not mapped any of Beth’s maternal chromosome. If Beth were to look for Hartley or Snell matches, it looks like her best bet would be on Chromosome 12.

For comparison, here is my Chromosome Map.

On my map, the blue corresponds to Beth’s red Hartley DNA. We seem to share a stretch of Hartley DNA on Chromosome 1. But where Beth has a long stretch of Hartley DNA on Chromosome 12, I have none.

 

Mapping My Chromosome 20 Using My Raw DNA Results

In a past blog, I mentioned My Big Fat Chromosome 20. That blog is also referenced on the ISOGG Chromosome Mapping Page. This particular Chromosome had puzzled me for a while due to the preponderance of matches I was getting there. I used visual phasing and determined that the overload of matches was on my paternal grandmother’s Frazer side rather than the Hartley side. I had previously supposed that the Hartley side held the key to all my matches as that side had colonial Massachusetts roots. Since that time, I had my brother’s DNA tested. He is shown as F in the bottom row below. I thought that his results might add some clarity to Chromosome 20.

chrom204sibs

Rather than clarifying things, I just got a shorter version of what I already had for Jon (F) than I had for myself (J) and my two sisters. The problem is the phenomenon of close crossovers at the beginning and end of each chromosome.  Jon also has quite a few matches in Chromosome 20 (unlike my sister Sharon who had Hartley DNA in most of her paternal Chromosome 20). He has almost 30% of his phased matches there according to his match spreadsheet based on Gedmatch.

Going to the Source – Raw Data Phasing

I have been learning how to phase my raw data based on a Whit Athey article, MS Access and the work that M Macneill has done. The Whit Athey Paper describes how to manipulate the raw DNA data of one parent and four siblings to get Dad Patterns and Mom Patterns. I have found these patterns to be useful.

Dad Patterns

Even though my dad never had his DNA tested, based on the certain principles, I have come up with a spreadsheet that shows for various sections of the chromosomes matching patterns that I have with my other three siblings. I use A’s and B’s to give a generalized pattern. The patterns will be in the order of Joel, Sharon, Heidi and Jon. Here is my Dad Pattern spreadsheet showing Chromosome 20:

dadpatternchr20

I find my gap to next column handy. The first thing that I notice is that there are not many large gaps. If there were very large gaps, that might indicate an AAAA pattern where all the siblings match (in this case a paternal grandparent). One thing that I added today is a Start and Stop. This is the first and last tested position of the Chromosome. This is good to know in case a pattern is hiding at the beginning or end of the chromosome. Let’s just look at the second line of the spreadsheet. This shows that there is a pattern of ABAB from position 0 to 10M. This means that the first and third people (Joel and Heidi) match the same paternal grandparent and the 2nd and 4th siblings (Sharon and Jon) match the other paternal grandparent.

In the third row of the spreadsheet, a new paternal pattern starts (at 10M). This is ABAA. Now sibling 1, 3, and 4 (Joel, Heidi, and Jon) match each other. The difference between ABAB and ABAA is in the last position where I have Jon. He switched from a B to an A and now no longer matches Sharon, but he does match his other three siblings on the paternal side. As Jon is the one that changed, he gets the paternal crossover at this position.

A few other notes
  • These patterns are gradual. That means that there can be only one change at a time.
  • If it looks like there are two or more changes, then either something was done wrong or you have to invert the A’s and B’s
  • For example, above in row 4, I have an AABA pattern that goes to and ABAB. On face value, it looks like three changes. However, AABA is the same as BBAB. Actually it is the first B changing to an A. This is my position A, so I have a crossover around 54M on the paternal copy of my Chromosome 20.
  • These areas of patterns are also used to fill in bases received from Dad or Mom in the particular areas that the patterns occur in each chromosome.
  • If there are only three siblings tested, these patterns are not as informative.
Mom Pattern spreadsheet

I would not want to leave mom out. Here is the pattern of her 4 children matching on the maternal side:

mompatternchr20

Like the Dad Pattern Spreadsheet, everything looks well behaved as there are no large gaps between patterns. Also there are no gaps at the beginning or end of Chromosome 20. So there you have it. That is the phased DNA for myself and my other three siblings. But it doesn’t jump out at you and I don’t have a map yet. That is where I bring in the MacNeill <prairielad_genealogy@hotmail.com> Spreadsheet.

MacNeill’s Excel Spreadsheet

I adjusted MacNeill’s Chromosome 1 spreadsheet by replacing default numbers for Chromosome 20. Then I added in the locations I had in the spreadsheet above. Those are the Start36 and Stop36 columns. The 36 refers to Build 36 locations which Gedmatch uses. After that I colored in the bars to be consistent with the visual phasing I had done previously.

chr20map1

Actually, I now see that I colored Sharon’s paternal  bar backwards. She should have mostly Hartley (blue). This transposition also carried through to the next image, but I corrected it in the final image. I like having labels, so I copied this into PowerPoint and added some:

chr20map2

Next I add any appropriate cousin matches for Chromosome 20. I also made the sibling names on the left a little bigger. My mistake above on Sharon’s paternal bar is corrected and verified by her large paternal Hartley cousin match with Jim below.

chr20withmatches

I had to bring this back into PowerPoint to re-add the surnames. The places where the cousin matches start or stop may be crossovers for me and my siblings. From comparing the top part of the chart to the bottom, it should be obvious which crossovers are for me and my siblings and which are for the cousins. The good news is that the raw DNA phasing confirms my initial visual phasing done in January, 2016. The raw DNA phasing just filled in what I was unable to. The other good news was that there were significant cousin matches on both the paternal and maternal side of Chromosome 20 to make sure that all the grandparents were identified correctly. Since I did the original visual phasing last January 2016, I have gotten the DNA results of 2 more cousins. Also one additional cousin who previously had her match to only me at 23andme uploaded her results to Gedmatch.

Notes/Summary

  • The hard work in Raw DNA phasing is assigning all the bases of the siblings to the correct parent. Then patterns are discerned and noted.
  • The fun part is mapping out the results.
  • Raw DNA phasing and mapping is more accurate and complete than visual phasing. However, it takes a lot of work and works best when there is at least one tested parent.
  • The comparison of the raw DNA mapping to the actual cousin matches points out the fuzzy boundaries noted by others. This may be seen in Sharon’s short Lentz segment. Her cousin Judy match (who has Lentz ancestry) appears to exceed the length of Sharon’s Lentz segment.
  • Out of the four siblings, Sharon is the one who didn’t get the huge dose of Frazer ancestor matches. That means that she would be the best for looking for smaller matches at Gedmatch.com. Her smallest match is 9.3 cM (5.9 Gen) and my smallest match at Gedmatch is 10.7 cM (5.2 Gen).
  • At a glance, one can see who is the best person for finding matches with each of the four side of the family. For example, I received a full dose of Lentz DNA on Chromosome 20. Here is my Lentz grandmother (b. 1900) in her younger days. Her DNA is represented in yellow in the charts above.

emma

DNA Phasing of 4 Siblings When One Parent Is Missing: Part 9

Mom Patterns

Up to this point, I have phased 4 siblings based on 3 principles outlined by Whit Athey. I have looked at the bases the 4 siblings had from their Dad. Those Dad bases made up patterns. Based on those patterns, other Dad bases were added to those siblings within those pattern areas. After those bases were added, mom bases were added where the siblings were heterozygous. The changes were documented in a Base Tracker.

Start stop using access min max – AAAB Mom Pattern

I can just look at my previous Blog to see what I did for my dad pattern. The results of this query:

mompatternquery

get copied to this spreadsheet where I added a column for Pattern:

mompatternspreadsheet

That was my big time saving step from my last query. Before I run each Min Max Total Query, I check a regular Select Query to make sure I have the right pattern. For example, here is my ABBA Mom Pattern check:

abbacheck

In a few minutes, I have 111 Start/Stop Mom Pattern pairs. This time, I’ll add conditional formatting to point out the one position patterns:

startstoponepattern

These single patterns tend to mess me up as I’m looking for patterns, so I’ll take them out of my spreadsheet, but not out of my Access data tables. There were 10 of these. I don’t know if that is a lot.

Getting better starts and stop for the mom patterns

The next step takes a little while. I look at the [now] 95 Start/Stop pairs for the various patterns. I highlighted the overlapping areas in yellow:

mompatternoverlaps

Actually, the first pattern overlaps into the second also. Some of these may be caused by single location patterns. For example at Chromosome 1, when I got to ID# 548 I find this:

momchr1

There is an ABAA Pattern, but it only lasts for one position and then is on to an ABBB pattern. I copied the end location for ABAA and put it at the end of Chromosome 1 to check later and made note of the one position pattern:

onepatternchr1mom

After that, it makes more sense that the ABBB pattern Stop at 2314 goes into an AABB Pattern Start at 2317. Here is the adjusted Chromosome 1 for my siblings’ Mom Patterns:

cleaneduppatternsmom

I moved the first Start to the Start of Chromosome 1 and last Stop to the end of Chromosome 1 as they were already pretty close to those positions. All combinations of patterns are represented here except for ABAB. I don’t have a start and stop for the single patterns as I’ll be taking them out later.

Filling In Mom Patterns

Now that I have all the mom patterns and their starts and stops as well as I can, I will fill in the patterns. I’ll start with AAAB. First I use the Concatenate formula in Excel to get my starts and stops in Access language. Then I sort the patterns in Excel:

mompatterssorted

I have 19 AAAB Mom Patterns. Next I go into Access and create an Update Query using the table called tbl4SibsNewMomPatternsFillin. In the AAAB Pattern, I will want to fill in the missing A’s.

updatequeryaaabmom

This looks like a good query, but I want to track how many bases I’m updating, so this query would make it difficult to track that as I’m adding bases to Sharon and Heidi. So again, I will go with the simpler query.

aaabsimplerupdate

Here is the first Mom Pattern Fill-in update on the Base Tracker:

basetrackeraaabmom

I continued the same process down the Mom Patterns, filling in what was missing from each of the siblngs:

momfillinupdatetracker

In each case for each pattern, I added less than 5,000 bases to each sibling. I also added to my spreadsheet a percentage of overall phasing which is now at 89.1%. This is how the 4 siblings are phased on average. Jon, who tested with the Ancestry V2 is bringing the other siblings’ overall average down.

Principle 3 – Dad Bases From Mom Bases

This is the icing on the cake for me. After all the work of determining Patterns and Starts and Stops, I have an easy step to add bases. Principle 3 says if you are heterozygous and you know one of your bases is assigned to one parent, then the other base must be assigned to the other parent.

I had to look at my previous blog to see how I did this. Let’s see if this looks right:

udatedadfrommom

The first column makes sure that I am heterozygous as my 2 alleles are not the same. The 2nd columns says that I know that I got allele2 from Mom. The 3rd column says to put my allele1 as the one I got from dad. That seems to make sense. This results in 9523 rows of updates in 22 Chromosomes. In part 2 of this Update Query, I switch the alleles:

dadfrommomforjoel

This says if my allele1 is from Mom assign allele2 to be from Dad.

Summary of Pattern Filling In and Dad Bases from Mom Bases

btdadfrommom

Here the overall phasing is 90%, but I had a pretty strict measure of phasing. It involved alleles that Jon was not even tested for. Here we are getting a diminishing return. I could continue the process, but I won’t.

Next Steps

Now I have a good idea where all the crossovers are. I need to assign those to siblings. Then I need to figure out how to portray the final results.

Assigning Crossovers to Siblings

I might as well jump right in. I’ll try a Chromosome that McNeill has mapped. Actually, he only did the 3 siblings at the time, so it may be a little different.

Chromosome 7 Crossovers

This has been mapped by MacNeill to 3 siblings. Let’s see how my mapping compares. Here is the mom pattern:

momstartstop7

Here I have by my own ID’s the start and stop. Then I have gap to the next pattern. This may indicate an AAAA pattern. Under description, I have what the pattern changes are. Then I have the person assigned to the Crossover. Then I have the approximate location of the Crossover. The first line I have the description as ABBA to ABBB. Here, Jon (in the last position) was matching with me as I’m in the first position of course. Then he changed to match with Sharon and Heidi. So I assigned the crossover to him.

Look at the 5th line. The pattern is ABAB to AAAB. This goes through a gap of over 6,000 ID’s. That usually means there is an AAAA pattern there.  AAAA could go to AAAB easily, but to go from ABAB to AAAA would take two crossovers. I don’t have a good idea where the crossover is, so I’ll go to gedmatch. The good news is that I have already tried using visual phasing on this Chromsome:

chr7vismapjon

The crossovers that I looked at above in my spreadsheet were on the maternal side. So that would be the top part of the bar (green-orange). It looks like I have 11 or 12 maternal crossovers, if I did it right. Looking at the top part of the image above, notice the non-match areas. These have no blue bar below and have red areas above. These are important. The reason is that if there is any of these areas at any place, there cannot be an AAAA pattern for maternal or paternal. That means that all 4 siblings cannot match the same grandparent in any of these areas. The only potential AAAA patterns, then are at either ends of the Chromosome or in the middle. The middle locations are about 60-70M. Also note that I have Rathfelder as the same match for each sibling from 56-70M.

There is a discrepancy between my spreadsheet crossovers (7) and the visual above (11 or 12). The other problem is that I need a double conversion from my spreadsheet. The spreadsheet is in ID’s which refers to Build 37 locations and Gedmatch is in Build 36.

Before I start converting numbers, I’ll look at what I have for the Dad Crossovers.

dadpatterncrossover7

Here I added a position number for the Chromosome (Build 37). This matches up with the visual phasing above. What is missing would be the crossover for Joel after an AAAA pattern at the beginning of the Chromosome.

Where is Heidi?

As I look at the maternal visual phasing, I see that Heidi has 3 crossovers. On my spreadsheet, she doesn’t have any. One can be explained as going onto the right end of the Chromosome to an AAAA pattern, but what about the other 2 crossovers, in the middle of the Chromosome? I got these positions from an old file where I compared myself to my 2 sisters. Then I put those in a spreadsheet and converted them to Build 37:

findingheidi

The Chromosome position numbers in blue were where I had Heidi’s crossovers. I then went to my Access Database.

Heidi found

heidi

Here is an ABBB Mom Pattern that I missed. Going through the list, I updated my crossover list:

updatedchr7xover

Now I am up to 12 Maternal Crossovers. The AAAA patterns tend to fit in naturally. Note next to the first blue ‘Joel’. There would be no way to go to an ABBB pattern to an AAAB pattern without 2 changes. That is why an AAAA pattern is required within the other 2 patterns.

Paternal Crossovers – Chromosome 7

crossoverschr7

Here I only show 2 crossovers, where on my map above, I show 3. I am just missing my own crossover from AAAA to ABBB. This is at the beginning of Chromosome 7. Here is my database table for my Dad Patterns:

chr7dadpattern

The position I have highlighted would still be an AAAA pattern as I have A??A. So that is the last position with that pattern. Id 285993 is the first spot I have the ABBB pattern, so I chose the crossover as ID# 285992 (under App. ID):

 

dadcrossover7

Here is what MacNeill had for 3 siblings at Chromosome 7:

macneill-chr7

What is now clear from have 4 DNA tested siblings is that my first crossover is paternal and not maternal. For my first crossover to be maternal, I would have had to have gone from an AAAA pattern to an ABBA pattern which would have been a double crossover. Having my brother Jon (the last ‘A’) tested made that clear.

Summary

In this Blog, I have looked at the Mom Patterns created by 4 siblings. Based on those patterns I have filled in alleles from other siblings. I have also filled in alleles for heterozygous siblings. This is based on the Mom allele being known and assigning the other allele as from the Dad. Then I looked at assigning crossovers to the various siblings. Based on the Patterns, it seemed clear who the crossovers should be assigned to. I then checked the crossovers I had with a visual phasing based on gedmatch. This showed where I was missing crossovers, which I was able to add using Chromosome 7 as an example.

Next: How to show the final results?

DNA Phasing of 4 Siblings When One Parent Is Missing: Part 8

Dad Patterns

In my last Blog, I looked at the Whit Athey 3 Principles and used MS Access to assign bases to the paternal or maternal side for the 4 tested siblings of my family. The next step is to look at Dad Patterns. I have been doing this by querying for a pattern and then scrolling down for start and stop positions. This has been quite tedious. It occurred to me that there may be another way to do this.

MS Access Min Max Functions

Access has a function that finds a minimum or maximum value in a group. In this case the group can be Chromosome.

AAAB Dad Pattern – Access to the rescue

 

aaabminmax

To get the total line I hit the summation [totals] icon in the Show/Hide Group. This adds a Group By to each field to in the Total row. Here I looked for the Minimum and Maximum ID for each chromosome for the AAAB Dad Pattern. That is where Joel’s base from dad was the same as Sharon’s. Sharon’s base from dad was the same as Heidi’s and Heidi’s base from Dad was different from Jon’s. Here is the output for the AAAB Dad Pattern:

aaabminmaxresults

This step has revolutionized my work as it saves me from scrolling through 100’s of thousands of dad base AAAB Patterns.  This takes about 2 minutes vs. the old way which seemed like an hour.

The upside of this method is that it is fast. The downside is that it only finds the minimum and maximum of a pattern within a chromosome. It doesn’t find all the breaks in the patterns within the chromosomes.

Using this method, in a couple of minutes I have 91 Start and Stop locations for all the possible patterns – except for AAAA.

Here are the sorted results for Chromosome 1:

dadpatternchr1

Note that there are some overlaps that will need to be resolved. However, there also clean breaks such as between ABBB and ABAB. ABBB stops at ID# 19797 and ABAB starts at 19837. Also note the last line. AABA has the same Min and Max ID#. This means that this is a single AABA pattern apparently within the AABB pattern.

Looking at the Table

In this step, I’ll look at tbl4SibsNewDadPattern and use the Access Pattern Mins and Maxes to get more accurate Start and Stop points. My spreadsheet above shows that ABAA starts at ID 52. I scroll up from there:

chr1tabledadpattern

At ID# 18 I see ?AG?. I can imagine that being an ABAA pattern, so why not start the ABAA Dad Pattern at ID# 1? Out of 680,000 ID’s, that doesn’t seem too much of a stretch.

Next it seems like the ABAA should stop somewhere before ID# 6605. I’ll hasten the process by a query that looks at the case where Sharon’s base from Dad is not equal to Heidi’s Base from Dad:

abaa-stop

Clearly, there is a break at ID# 5127, so I’ll use that.

chr1dadpatternstartstop

Here, I’ve added a finer Start and Stop for Dad Pattern ABAA. What that means is that in this segment of Chromosome 1, I got my DNA from one of my dad’s grandparents as did Heidi and Jon. Sharon got here DNA from the opposite paternal grandparent.

Here is the Start/Stops filled in:

chr1dadpatternfilledin

I highlighted the 57205 as a reminder that I needed to add an extra ABAA pattern in later. There is a gap between ABAA and ABBB of 1477 ID’s where there is a likely AAAA pattern, which means the 4 siblings got their DNA from the same paternal grandparent.

Finished Start Stop Dad Pattern Spreadsheet

I took out the single patterns and re-sorted by pattern. Then I wrote a formula to get the locations in Access language:

dadpatternstartstop

Next I made a copy of my working table in Access to a new table called tbl4SibsNewDadPatternFillin. I’ll use this to fill in the Dad Patterns.

Filling in the First AAAB Pattern

In this pattern, I will be filling in all the missing ‘A’s of the AAAB pattern. I won’t fill in the B as I won’t know if an ‘A’ or a ‘B’ belong there. Here is my first update query:

aaabupdate

This says if I am missing a base from dad in any of the AAAB Pattern areas that I am in and Sharon has that base, I’ll take the base she has. I can save a little time, by adding on to that query:

joelaaabfromsharonheidi

It is important to put the second ‘Is Not Null’ and ‘Is Null’ on a separate line as that is the ‘or’ line. Otherwise, I would only get the Sharon from Dad and Heidi from Dad bases where they equaled each other.

First I run the query to make sure it shows what I want.

aaabqueryex

It does [although, see below. For one thing I missed the ID criteria in the 2nd line of criteria!]. If I had the criteria all on one line, I wouldn’t have gotten the Heidi from Dad bases where Sharon is missing a base (ID# 63) and visa versa. I will want to check my query later, so I can check it at least two ways. One way is to check at ID# 63 and 99 to see if that base was added. The other way is to see if the Update Query updates 49094 lines as that is the number of lines in the above query.

When I went to run my query, I got this error:

udateaaaberror

Before I give up on this double query, I’ll try one more thing:

heidiorsharonaaabtojoel

Here I say if the conditions I mentioned above apply give either Heidi’s base from Dad or Sharon’s base from dad to me. I note that the update is for 49094 rows, so that seems on the right track. The reason why I don’t mind doing a double query here is that either Heidi’s base from Dad or Sharon’s should always be the same in an AAAB pattern.

I ran this and now I am checking ID# 63:

erroraaab

Unfortunately, Access gave me a -1 instead of Heidi’s C Base from dad. Part of why I wanted to do the one query is so I wouldn’t have to add the 2 queries. However, instead, I’ll just add a line to my base tracker:

basetrackernew

That means that I am back to my simpler query. Sharon should add 3975 bases from Dad to my bases from Dad:

3975row

Heidi was going to add over 2200 of her bases from Dad before Sharon gave me hers. Now it is a lower number:

heidibasestojoel

Now check Line 63:

line63

My base from Dad still isn’t filled in. But that is a good thing. When I checked my double query above, it gave me areas outside the AAAB Pattern area. ID# 63 is actually a different pattern. So that is why the number was so high also. The lesson learned is to keep the queries simple.

Now I’ve updated my Base Tracker for the AAAB Dad Pattern:

aaabbasetracker

Note that the Heidi from Dad Bases didn’t go up in the second round of this query. After she had gotten her extra Dad bases from me in the AAAB region, Sharon didn’t have any extra ones to give to her that I hadn’t already.

nodadbasestoheidifromsharon

AABA Fill-in

This time Heidi will be left out and Joel, Sharon and Jon will get new bases from dad based on others from the AABA areas. This is the same simple query as before, except that the ID#’s are different:

aabafillin

Here is Jon’s first bases from Dad from one of his siblings:

jonfromdad

This brings up an interesting point. There may be cases where Jon has a phased base at a location which his DNA test didn’t cover.

AABB Fill-IN

Here there should be Bases for all siblings. Wherever there is an A and an missing A, add it, and the same for B. Again my first query is the same except for the ID#’s:

aabbfillin

On the AABB bases from Dad, Jon doesn’t have a lot to add to Heidi’s bases, but Heidi has a lot to add to Jon’s:

aabbbasetracker

abaa dad pattern fill-in

Here we start with Joel being updated with Heidi’s bases from Dad because Sharon is the lone B.

abaa

There are more rows updated as the ABAA Dad Patterns had more regions than the other patterns.

In my last update query, I made a mistake:

jonfromjoelmistake

I’m not sure if it makes a difference. I said that in the case where my base from Mom is not null, give Jon my Base from Dad where he doesn’t have any. To check, I run the correct query:

abaaquerymistake

This shows that there are still 2063 bases that didn’t get added to Jon from my bases from Dad. I will add them now. Plus I will add that number to the previous 29113 bases I added to Jon’s bases from Dad from my bases from Dad.

abaatracker

As there were 3 siblings the same in this pattern, I again took 2 rows to add the bases to the table.

ABAB Dad Pattern Fill-in

ababtracker

Jon now has more bases phased than he had tested on his paternal side. He already had more than he had tested on the maternal side.

ABBA and ABBB Dad Pattern Fill-ins

basetrackersummary

As expected, Jon made out best in this Pattern Phasing.

Mom Bases From Dad Bases

This is the part of the project that seems ironic. My dad who wasn’t tested for DNA is now supplying bases to his children that were from their mom. Here I’m looking for where the siblings are heterozygous. In those cases where there is now a Dad base from the patterns and a mom base is missing, we can fill it in.

First, I am making another copy of my table called tble4SibsNewMomPatternFillin.

Here is my first Mom from Dad Update Query:

joeldadfrommom

It says where I am heterozygous and my Dad base is my 2nd one put my first base in as the base I got from Mom, but only if she doesn’t already have a base there. The last part is just an extra precaution so that I don’t overwrite anything.

In the next query, I just reverse the Joelallele1 and 2 to get 12,000 more rows of phased DNA:

momfromdad2

Summary of Mom Bases from Dad Bases

trackermomfromdad

Check the numbers

I have been adding up the rows added. But now I will check my table to see of the Total Bases Phased added up. And the answer is:

countfromtbl

The numbers are pretty close. The above Heidi from Dad is higher than my tracker. I’m guessing the table sums are correct and mine are a little off. The means that Heidi’s paternal phasing should be a little lower.

Part 8 Summary

  • The use of MS Access Min and Max functions to get Dad Pattern starts and stops saved a lot of time
  • It still takes time to verify those starts and stops
  • The Base Tracker makes it easier to track the numbers and the process. It is also interesting to see how the % phased goes up with each round of updates
  • I wasn’t expecting the numbers from my base tracker and actual updated bases to reconcile perfectly, but most of the numbers did. It is possible the discrepancies are from the 2 minor errors I made and tried to correct along the way.

 

 

DNA Phasing of 4 Siblings and One Parent: Part 7 (Starting Over)

In my last blog, I found a few errors when I was checking some odd results. This lead me to think that it would be better to start the phasing process from the beginning. The beginning means using 4 siblings’ raw data and my mom’s raw data. This time I will be more methodical and keep track of the results. I have a new spreadsheet called The Base Tracker. Every step that I take, it will keep track of the bases from each sibling when they assigned to a parent.

A New Table

First I’ll create a new table from the raw data. I’ll start with my mom, me and my 2 sisters as they are all tested using Ancestry Version 1.

3sibtable

I called the table tbl3Sibs.

Next, I combined tbl3Sibs with Jon’s Ancestry V2 results into a new table called tble4SibsNew. I made sure I had a right connect on the arrow. That means that I wanted everything in the 3Sibs table plus what was in Jon’s information. If I had left it an equal join, I would have lost the bases that are in Version 1 but not Version of the AncestryDNA results.

mergejonwsibs

It is important here to connect by rsid. I made the mistake of connecting by IDs last time. As the different AncestryDNA test results versions had different ID’s, this produced crazy results. I also used only Chromosome 1-22 as there are too many special cases for the X Chromosome.

tbl4sibsr1

Then I used a count function to count the number of bases each sibling had. I also figured out how many blank lines there were out of the 682549 and subtracted those 8229 sibling blanks from the total to get 674,320. I’ll use that number to figure out the percent phased. This is the Count Query showing the Totals button in the Access Ribbon:

countrawbases

The results of this query were put in the RawBases Row below.

My New Base Tracker: % Phased

basetracker

The first column has the step taken. P1 is Principle 1. JoelFD is the Joel from Dad column, so all the Dad bases are on the left and mom bases are on the right. This table will give me the % phased for each sibling.

Principle 1 Query – Homozygous Siblings

This Principle is on the Principles from a Whit Athey Paper where you have 2 bases the same and each one is from each parent. The last time I did this, I may have had too much in a query at a time. This time, I’ll do the query separately for each sibling.

First, I opened up my tbl4SibsNew in design view and added more fields to put the new dad and mom bases.

newbasefields

First, I copied the table, so I’d have the raw data table with no additions. I called my new table tbl4SibsNewPrinciples. That is where the phased bases will go.

Here is a simple Principle 1 Update Query for me:

joelprinciple1

It says where I am homozygous, put both those bases in my JoelFromDad and JoelFromMom columns in the new tbl4SibsR1Principles.

joelprq

That little query phased over 900,000 of my bases into Paternal and Maternal sides.

I was interested in seeing the effect of Jon’s testing using AncestryDNA V2:

jontracker

Jon has a ways to go to catch up on being phased. This is due to the differences in AncestryDNA V1 and V2.

Principle 2 – Homozygous Mom

Here if my mom has the same base twice, one of those has to go her child. Here is a query to update my mom bases. As my dad’s DNA was not tested, he gets a non-applicable in that column.

joelpr2

Note that I have a criteria ‘Is Null’. This means only update this base if there is a blank there already. Here is the Principle 2: Homozygous Mom summary:

p2summary

Here I don’t know why my Principle 2 Bases were so low. I think it is because I made a mistake above, so I’ll do these steps over from the beginning.

Here I get more consistent results for my mom bases:

pr2joelfix

Here is the revised Principle 2 Summary:

prtrackerrev

Jon’s results also changed to be more realistic to where he was after Principle 1. I can also use the Access Count function to check these numbers:

countpr2

All the numbers match up except for JonsFromMom. For some reason, the spreadsheet is showing a higher number of Total Bases from Mom for Jon of 540956. If I subtract that from his Principle 1 bases from Mom, I get 272250. I’ll put that in as his Principle 2 bases from mom and assume that I made a mistake in writing down Jon’s Principle 2 base from Mom number.

pr2summaryreconciled

I suppose it’s like reconciling my bank statement. I assume that these are Jon’s mom bases filling in where Jon didn’t have test results that lined up with the AncestryDNA V1 results for his mom and siblings.

Moving On To Principle 3: Heterozyous Siblings

This works when the child is heterozygous and has one base phased to one parent. Then the other base is phased to the other parent. It appears that this would have to work just from the mom side for now to fill in the dad side. That is because we haven’t filled in the ‘fromDad’ side with any Heterozygous sibling results yet.

pr3joel

This query says in the situation where I am heterozygous and I get my allele2 from mom, assign my allele1 to be from my dad. But only do that where there isn’t already a JoelFromDad base there.

However, this raises a question. Here is the same query without the ‘Is Null’ criteria:

pr3joellarge

As you can tell, I am beginning to doubt my work. The question is, if there has been no previous addition of Joel bases from dad based on my heterozygous results why is there a difference between the two queries?

I checked Sharon’s results and found that she didn’t have the same situation. Where she was heterozygous, she didn’t have any bases from dad assigned to her.

Here is a query showing my problem:

p3problem

It is not a problem for phasing, but more for what I will enter into my Base Tracker. Fortunately, I can do a Count Query:

countjoelfromdad

This shows that my JoelFromDad bases have gone up by 25589 somehow since I last tracked them. This means that I should use the larger number for my Base Tracker.

Here is the Principle 3 Summary in my Base Tracker:

p3summary

In a few hours, I’ve phased over 4 million bases. And that time includes making mistakes and fixing them. All siblings are phased at over 80% at this point except for Jon. His Paternal phasing is lagging at only at one half.

I suppose that this is the time for me to say that it takes 20% of your time to get 80% of the result and 80% of your time to get the last 20% of your result.

Summary Part 7

  • After making mistakes, it feels good to start with a clean slate
  • Principals 1-3 of the Athey paper are easy to implement using MS Access
  • If a mistake is found, it is usually good to start from a clean table of data and fix it from there
  • The Patterns don’t lend themselves as well to Access and take more time to get
  • Having a Table to track the work and results is helpful and interesting.
  • In the next Blog (Part 8), I will be back looking at filling in the Patterns areas

Raw DNA Phasing of 4 Siblings Using One Parent’s DNA: Part 6

In my last Blog, I was still playing catchup in going from my original 3 sibling phasing, to incorporating my brother’s new DNA results.

Missing Principle 2 for Jon

Here is Principle 2 from the Whit Athey Phasing Paper I’ve been using:

Principle 2 –If data from one of the parents are available, and that parent is homozygous at a SNP location, then another almost trivial phasing is possible
since obviously that parent had to send the only type of base s/he had at that location to the child
I checked this in MS Access. Here is the query:
homozygousmom
This says if mom is homozygous, here allele1 is the same as her allele2. For those if Jon has null values in his FromMom column, then I skipped this step.
homomomerror
Clearly, I did mess this step from position one. As I was doing my previous steps, I thought that Jon’s results were very sparse.
principle 2 fix

For this, I will again use the update query.

homomomfix

In this case, I didn’t bother writing ‘Is Null’ under the JonFromMom column. That is because even if there is something in there, I would just as soon overwrite it, as this is such a basic principle. I only missed 481,000 rows.

second part of fix

Now that I have mom’s bases, I will go back and fill in Jon’s dad bases based on his mom bases. Those are also Principle 2 fillns where Jon is heterozygous. I don’t mind doing these updates in Access as they are so easy.

dadsbasefrommomjon

This says in the case where Jon is heterozygous and his mom has allele1, put Jon’s allele2 in as Jon’s allele from Dad. This query says if a Jon’s has allele1 from his Mom, the allele2 has to go to his dad.

jonallele1fordad

So that is an easy way to update over 7,000 rows in a few minutes.

Next, On to Mom Patterns

It’s a good thing that I added these mom bases to Jon, because now it is time to look at mom patterns. From Athey:

In the next step, we use the pattern on the mother’s side to fill in as many more cells as possible. Finally, we can project the information in those newly filled cells back to the father’s side using Principle 3 again.

 This procedure will be the same one that I used for the Dad Patterns.
aaab mom pattern
 I might as well go in alphabetical order. In this pattern, Jon will not match the other siblings.
aaabmompattern
This works, but it doesn’t include the areas where there are missing mom bases. So I will use it to get rough ID’s. There were about 45 AAAB Mom Patterns that I found. Perhaps the rough ID’s will do.
AAAB Quality Control

My spreadsheet counts the numbers of ID’s between patterns.

aaabqc

619 is close to the cutoff that I had set. I went back to the original spreadsheet and found other AAAB patterns between the Stop and next Start. So I can combine those 2 AAAB Chromosome 15 patterns. I checked another pattern with about 700 ID’s from the Stop to the next AAAB Start. However, there was another pattern between, so those were a valid Stop and Start. There were about 45 AAAB Mom Patterns or about 2 per chromosome which seems like a lot.

ABAA Mom Pattern

The query should be similar to the previous one. If Sharon isn’t the same as her siblings, we will have an ABAA Pattern.

abaaqpattern

This pattern was easier to figure out. There were about 35 of them.

aaba Mom pattern

This is the one I should have done second if I had wanted to stay in alphabetical order. I checked a few with differences of about 500 between Stop and next Start, but they looked OK. There were a few single allele patterns.

aabb mom pattern

I have 3 criteria for this one:

aabbmompatternquery

I had to enter that Sharon’s allele from mom could not be the same as Heidi’s allele from mom or I would get a lot of AAAA Patterns. When I looked for these, there appear to be 19 AABB maternal patterns.

abab mom pattern

Again, this is a bit out of alphabetical order. This query is not unlike the previous one.

ababq

When I make Heidi’s mom base different from Sharon’s mom base, that gives me the ABAB pattern:

ababresults

Here I have Excel on the left where I am entering the results from the Mom Patterns that I found in Access.

ababworksheet

The jump in Chromosome 4 from position 6M to 37.9M indicates a change in pattern. That is entered in Excel on the left. The change from the previous pattern is shown as 7544 ID’s. ID’s should be the same as SNPs.

A change in Chromosome is an obvious Stop and Start:

ababex

There were about 30 ABAB mom patterns for me and my 3 other siblings. I’ve done:

  • AAAB
  • AABA
  • AABB
  • ABAA
  • ABAB
abbb mom pattern

It looks like this must be the last Mom Pattern. This is the mom pattern where I show my individualism – unlike my siblings who have the same mom base:

abbbq

Here’s an ABBB example:

abbbex

In this case on Chromosome 9, there is a jump from position 38M to 71M. However, the SNP (or ID) count between the two is only 190. That means this must be an area where the SNPs are not counted for some reason, so I would think that I could continue the Mom Pattern through that area. However, when I look at my Access table, I see this:

chr9ex

Above ID 370485 is a different pattern of AABB in the last four columns. This would have come out when I merged all my patterns and I would have had to fix it then. However, I might as well get this as good as I can now. As it is, there will be a discrepancy to work out:

chr9discrepancy

The AABB pattern started at 369193 which is before the ABBB Pattern stopped at 370295. This means I need to go back to the Table:

 

chr9problem

 

Here is position 370295 where I had the ABBB Pattern ending. However, this is a a very small pattern, going only up to ID# 370290. Before that is the AABB pattern again. Here the AABB Pattern picks up again.

chr9aabb

Here is how I corrected my Chromosome 9 Mom AABB Pattern:

chr9aabbcorrected

However, note that I had to break my 500 ID/SNP rule. That 51 represents the tiny ABBB Pattern between two AABB Mom Patterns.

Here is the start of the AABB pattern at 369193:

morechr9issues

First note, that it would actually start at 369192. Before that is a single ABBB pattern. Then above that in the first row is an ABAA pattern. The first row is the end of an ABAA Pattern that I already recorded in my spreadsheet at ID# 369181, so that doesn’t need to change:

chr9abaa

At 369190 there is a single pattern of ABBB. This will be noted in my spreadsheet, but not entered as a start/stop position.

Re-Sort the Mom Patterns by Pattern

Now I have 426 lines of Mom Pattern Locations. I need to sort them by pattern and hope there are not many weird issues like I found in Chromosome 9. I will also take out the single patterns. When I do this, I get quite a mess. Here is Chromosome 1:

chr1mompatternsorted

Here we have quite a few nested patterns.

chr1fix

The first AABB pattern is a single, so I can take that out, but what do I do with the AABB Stop? It looks like that was a single also, so I can take that out.

chr1fix2

The AAGG is between a CTTT and an AGG? which would turn out to be an AGGG. What I had previously described was a single pattern going to another single pattern within a valid non-single pattern.

Next, starting at ID# 6608 I have three starts in a row which cannot be good. Looking at the first two patterns of ABAA and AABB, they look like they could be good.

fixchr1

I’ll add a ‘G’ where the cursor is above and call that the end to a very short ABAA Mom Pattern.

chr1fix3

Here is the corrected ABAA stop. I highlighted the next ABAA Stop in yellow as that will need work.

Next I’ll look at the ABBB Start at 19885. It looks like I missed the previous AABB Stop at 19884.

chr1fix4

 At least that makes for a clean cut. I made a note of my correction:
chr1fix5
I also made note to look at the next AABB Stop (in yellow). Now there is a Start for an ABBB followed by a Stop for an AABB which looks fishy. Here is the area following ID# 19885:
chr1table
It seems that there are about 5 ABBB patterns followed by a single AABB Pattern, a single ABBB pattern and another AABB Pattern. As this looks confusing, let’s look at the full table for the single ABBB Pattern area at ID# 19905:
fullspreadsheet

Time for Quality Control

Are there any errors here? Principle 1 says that if a person is homozygous, then one base is from the dad and one is from the mom. I have CC and Jon has TT. My assignment is correct, but Jon is missing a T from his dad.
Let’s look at this Query:
jonqc
This looks for missing Dad bases for Jon that should be there where Jon is homzygous. It turns out he is mising about 1300 results:
jonqcresults
I ran this query to see if Jon was missing any mom bases and he wasn’t. I also ran this query for myself and saw that I was missing dad bases. I will have to re-run this update to the current table. This is not a problem as this is an easy thing to do in Access.

Just Like Starting Over

Based on the errors that I’ve found, I will start from scratch in Part 7

A Hartley Z17911 STR Tree

In my previous blogs on Hartley YDNA, I mentioned that my terminal SNP is Z17911. That is a part of the L513 Branch of the larger L21 Branch of R1b. Here is what the L513 Branch looks like. This Tree represents those who have taken the Big Y Test in the colored area above.

l513chart

My Hartley Z17911 is difficult to see but it is slightly to the left of the middle and to the left of an orange area. The checkerboard pattern shows the part of England that my Hartleys are from. As far as I know I am the only Hartley that has had SNPs tested positive for Z17911, or for L513, for that matter.

STRs and Z17911

However, quite a few Hartleys have tested their YDNA. They have tested STRs. As a result, it is possible to do a comparison to others taking this test. STRs are not SNPs which are a more definitive designation of where you are on the Y Tree. However, they can suggest what SNP you should belong to. I belong to an L513 and the Administrator Mike is actively looking for others that might be in L513. As a result, Mike has put out lists of people that appear to be L513 based on their STR patterns. I have mentioned in past Blogs that some of those people are Hartleys.

Here is a recent list:

suspectedz17911

The first on the list above is me. Then follows three other Hartleys. Administrator Mike has grouped these other 3 Hartleys next to me. Based on their STRs, he has grouped them as Z17911. This is even though these 3 have not tested for Z17911, L513, or probably not even for L21 which is way up on the Y Tree. The row with the orange, green and yellow above the results has what is called STR Rates. These are the rates at which each individual STR mutates. Some are very slow and some mutate relatively quickly. The selected mode above is likely the mode of L513. This will come in handy later on in this Blog in a few ways.

Z17911 and Signature STRs

It turns out that STRs form themselves into groups. That means that for groups of people that are related by YDNA have combinations of STRs that are almost always unique to that group. Here I will make an assumption that the other 3 Hartleys are indeed Z17911, even though they haven’t tested their SNPs.

In the results section to the right of the Hartley names are the values for each STR marker. The colored values are the ones that vary from the L513 Mode. These values, especially the ones that are in the darker colors will result in a signature for these Z17811 Hartleys. The darker colors indicate more of a variance or distance from the mode. Another way to put it, is that the L513 mode is the older value and the Z17811 Hartley numbers are the newer values for the STRs that have mutated away from the L513 mode.

Up or Down?

These Z17811 STRs may mutate up or down. The blue shaded numbers are going down and the reds are going up. Why is this important? It is important as I’d like to build a tree from these 4 Hartleys. I will need to know who is descending from whom. Or at least, which of the 4 branches of Hartleys may be the oldest.

Here is an example:

str-example

These are some of the results of our 4 presumed positive Z17911 Hartleys. It is  difficult to create a mode of these results as the mode is the value which occurs the most. If there are 2 of each value, which value do you use? This happens the #449 Marker results. I am 31 at the top, but there are two 31’s and two 32’s. I have the L513 mode at the top of the image. The value for Marker #449 is 29. That means I have the older 31 value and the other 2 Hartleys have newer 32 values. They are moving away from 29.

Defining Hartley Z17911 STRs

Next, I looked at all the STRs where the 4 Hartley had different results. The other results are interesting but in comparing Hartley to Hartley they don’t matter if they are the same. Well, they might matter if there was a STR that mutated up and back down again, but the chance of that happening should be relatively rare.

hartley-strs

Here I have compacted 67 STR results to 12. This is a good time to point out the STR rates. The rate for 447 is about 0.09. The rate for CDYb is 35. That means that CDYb will change over 350 times as fast as 447. Another point is that Hartley #4 seems to be a special case. He was categorized as a non-L513 person which was thought by the L513 Administrator Mike to be a mistake. I don’t know if that was ever resolved. I do note that some of his STRs are a bit different than the other 3 Hartleys, but not totally different. I also note that he has tested positive for R-L21, so perhaps this has been resolved.

But Wait, There’s More

I had forgotten, there is one more Hartley in the group. He doesn’t have a Hartley last name but believes that he is descended from the Hartley Line. Great news. I will call him Hartley #5.

5-hartleys

Previously, I had missed Marker 481. Also when I copied things, my numbers didn’t get colors, but that’s alright. Now I have 13 markers and 5 Hartleys.

References for Trees

I’m aware of 3 references for creating STR trees.

  • Robert Laurence Baber – He has written quite a few articles on STR trees. I have not read them all yet. I downloaded a 5 part study he wrote but I don’t totally understand his method yet – though I understand some of the principles. He uses an upstream STR mode as I tried to do above.
  • Robb Hand Drawn Tree example – He compares a hand drawn tree to the Fluxus software. Although he likes the hand drawn version better, he learns some from using difficult to use the Fluxus software
  • Gleeson STR Tree – Maurice Gleeson gives a method and example of how to build a STR tree

More on Modes

I seem to be getting hung up on Modes:

more-modes

Here I have the L513 Mode and various modes from downstream SNPs. The 458 mode went quickly from 17 for L513 to 19 for S5668 and then appeared to stay there for quite a while.As a result, I chose 19 for the mode. Had I just looked at the older L513 Mode, I may have come to a different conclusion as to which way this STR was mutating.

Then the very fast CDYb seemed to move up in a regular way through the ages. Of course, in reality, it could have gone up and down over that period of time, but we wouldn’t know it if it did. I picked the lower 39 value for the CDYb STR at the Hartley mode level. To the right, I have the GD or generational distance from the Hartley Mode. This says that these Hartleys should be related at about the same level – around 4 or 5 GDs or STR mutations.

A 5 Hartley Likely Z17911 STR Tree

Here is the tree I came up with. It is along the line of and in the form of the Gleeson STR Tree example mentioned above:

5hartleytree

  • The Hartley common ancestor’s signature STR values are listed at the top. The mutations from that are shown down the branches to the individual Hartleys.
  • I also added some dates assuming that on average, a STR will mutate every 170 years given a test of 67 STRs. The lower horizontal lines above happen at the 2 or 3 STR mutation rate (which is the same as the GD). The top horizontal line happens at a GD of 4 or 5. The Hartley #5 horizontal line is up higher as the 358b mutation is a double one from 16 to 18.
  • In the above scenario, Hartley #5 is by himself. Another scenario would have Hartley #4 and Hartley #5 together as they share a mutation at 389b. Instead, I chose the above tree due to Hartley #1, 4, 3, and 2 each sharing 2 STRs.

This image shows some of my rationale for the tree:

5hartley-groping

I chose the double combo of 25-32 that Hartley #2 and #3 shared. I also chose the double combo of 17-40 (in yellow) that Hartley #1 and #4 shared. Other possible single combos that I didn’t choose to group were the two step 16>18 mutation for Hartley #4 and 5, the 11 mutation for Hartley #1 and 5 and the 16 mutation for Hartley #1 and 3. The principle used is to try to get the tree as simple as possible. This is what Gleeson calls the parsimony principle. My assumption is that my groupings achieve that goal.

How Do the Hartleys Compare to the Z17911 Mode?

In comparing Hartleys to the Z17911 Mode,  I go from the age of surnames to before the age of surnames. There are 4 that have tested positive for Z17911. They are Hartley (me), Goff, Thomas and Merrick. In that group, the level of GD’s and the variance in surnames indicate a pre-surname common ancestor.

So the GD’s will be further back also.

z17911gds

Here I am assuming no back mutations. Under the previous tree I assumed that Hartley #5 had a back mutation at CDYb. Due to the volatility of this marker, it is sometimes ignored in these analyses. Notice that now the range of GDs is from 3 to 8. Again, I group Hartley #1 and #4 together and Hartley #2 and #3 together.

z17911tree

Hartley #4 has the GD of 8. This is due to 2 double mutations. That pushes back his connection to Z17911 to around the year 600. This seems to be pushing back to a possible age of Z17911. Z17911 positive Thomas has submitted his Big Y results to YFull, so I am hoping to get a date from YFull for Z17911. It will be interesting to see what they come up with. The structure of the tree is the same as the previous Hartley Tree. I just adjusted the relative heights of the horizontal arms.

Summary and Conclusion

  • STRs from 5 Hartleys who have tested their YDNA seem to indicate a relatively close relationship – at least in YDNA terms
  • I have had my SNPs tested and the administrator of the R1b-L513 project has grouped the other STR-testing Hartleys in the same Z17911 group as me based on similar STR patterns. That is quite a way down the SNP tree.
  • If any of these Hartley were to test for for the L513 SNP or further down for Z17911, it could confirm what the STRs seem to be saying. Then I wouldn’t be the lone SNP tested Z17911 Hartley
  • SNPs create a solid reliable marker for relationships. It is best to have the SNP relationship established through testing before doing this type of STR analysis. However, even without SNP testing, STR trees can be informative
  • Back mutations and the different mutation rates leading to unpredictable STR mutations are the 2 major variables that make STR testing less accurate than SNP testing
  • The weakness of the SNP testing is that many have not done it. The other issue is SNP testing may only take you up to a certain date. After that date, STR analysis is  more useful
  • STR testing is best used in conjunction with SNP testing
  • Making a STR tree takes some practice and knowledge of STRs and mutations.
  • This YDNA research and resulting connections could shed light on the history of this branch of the Hartley family over the past 400-1400 years or so.

 

Raw Data Phasing Via Access, Athey and MacNeill: Part 2

In my last Blog on raw data phasing, I went through 3 principals that Whit Athey laid out in a paper on phasing raw data when one parent’s DNA results were missing. Using those principals, and the MS Access program, I was able to sort many of my bases and 2 sisters’ bases into ones we received from our mom and ones that we received from our dad. I checked a few of my results with a chromosome map made for me by M Macneill.

Paternal Patterns

I had gotten to the part of the Athey paper where he talks about paternal patterns of bases that the sibling combinations received. I noted a space between the first two paternal patterns that I looked at. Below the pattern goes from an ABA pattern to an ABB pattern.

change-in-dad-pattern-hilite

There was a gap between the ABA and ABB pattern where there was no ‘pattern’ as my 2 sisters and I shared the same base there. When my sisters and I all share the same base, that is an AAA “pattern”. That AAA area corresponded exactly to the area between the 2 yellow lines below in the chromosome map made for me by M MacNeill – prairielad_genealogy@hotmail.com .

macneill-chr1-hilite

In the map above, MacNeill was able to determine that my 2 sisters and I got our DNA from our paternal grandmother in the area between the 2 yellow lines. Further, the first yellow line described Sharon’s first paternal crossover point and the second yellow line described my (Joel’s) first paternal crossover point.

Finding All the Paternal Crossover Points

At this point in the Athey Paper, he recommended looking at the paternal pattern and filling in the missing bases based on the known pattern. I was looking for an easier way to do this, so decided to take a different approach. I decided that I would find all the paternal crossover points first. Then, armed with that information, I would create a formula that would fill in most or all of the missing bases for each pattern.

However, this required a modification of my database to make the work easier. I wanted a number to define the range of patterns, so that I could apply an easy query to add missing bases. I already had this but I hadn’t used it. Back when I imported the 4 sets of raw data into Access, Access assigned an ID to every row of data. That meant that I needed to add that ID into all the queries that I had done previously to make tables and further queries. This took a while, but I believe that it was worth it.

table-with-id

The ID is the first column.

I started going down all my data and noting the change of each pattern. I put the results into an Excel table. Here the Start and Stop numbers are the Access assigned ID numbers. The ID’s corrrespond with the number of DNA locations looked at. In this case there were a bit over a total of 700,000 of these locations for my mom, my 2 sisters, and me.

excel-pattern

Then I noted the patterns are repeating as would be expected. For example, my first pattern was ABA, but 3 patterns later, that same ABA repeated. My thought was to create a query just for ABA patterns. Then when scrolling down looking for changes, the separation between rows should be greater and it would be easier to see where those changes were.

Here is what my Access query looks like. I changed the query name to DadSpecificPattern.

dad-specific-pattern-queryquery

This particular query gives me the ABB pattern. I have the HeidifromDad base equal to the SharonFromDad base. That makes me the A and Sharon and Heidi the BB of the ABB Pattern. If you think about it, that also means in these areas that Heidi and Sharon will have their base from the same paternal grandparent and mine will be from the other paternal grandparent. I’m learning as I go. I’m sure that information will come in handy later.

My plan seemed to be good, but there was one catch. Once I refined my query, most or all of the blanks disappeared. That meant that the start and end points might not be exact. Here is an example of what I mean.

change-in-pattern-rouch

This is from my old Dad Pattern query with the blanks still there. The change from ABB to ABA happens at ID or line 19809. However, the new query takes out the blanks to make it look like the change is at ID Line 19826.

Here is what my DNA results look like so far without a filter (or query). The last 3 columns are the bases from Dad columns. There is a lot going on between lines 19809 and 19826.

pattern-unfiltered

Once I apply a formula to add bases, it will say something like: In the lines that have the ABA pattern where there is a blank at either A spot, replace the blank with the A that is there. If I apply the rule too late, I will be missing an area. Worse, If I were to use the 19826 cutoff, I may be still using the previous rule. That rule would say basically the same thing except, “Where the row is ABB and one of the B’s is missing replace the missing B with the one that is there.” If I apply an ABB rule to an ABA area, I’ll get bad results.

Long story short, I ended up recording a rough start and stop in my Excel Spreadsheet.

revised-spreadsheet-for-pattern

I started naming the segments, but realized that was not necessary. Some of the patterns were only at one point rather than in a long segment. I believe that is an anomaly due to a bad read, mutation or some other problem. Those are the ones in the spreadsheet that had no end point. It took me part of a morning to get all the paternal crossover pattern points for all 23 chromosomes. Fortunately for 3 siblings, the patterns are only ABA, AAB and ABB.

I just went back and checked the error points/aonomalies. I reran the Heterozygous Sibling Query and it fixed at least the first problem and hopefully the others. When I added the ID’s in, I had to redo all the queries quickly, so I suppose that is where the errors came in. That is not a problem as long as the problem can be found a fix can usually also be found. There actually weren’t that many errors. There are still some anomalies that are just anomalies. I have left those in yellow in the spreadsheet image below.

So in my spreadsheet, I have all the rough starts and ends for all the crossovers for my 2 sisters and myself. Here is the top part of the spreadsheet sorted by rough start:

rough-start-sort

Next, all I need are more exact start and end points. Here is the start of what I have:

pos-and-id-and-pattern

I picked this section because it looks pretty complete already. Note that my Start and Stop numbers are pretty close to each other. That means that there are no other AAA segments in-between. I had to do an additional Access query to add in the position numbers for the Start and Stop of each chromosome’s pattern change. This was important if I want to convert the results from Build 37 to Build 36 to compare to MacNeill’s work or to gedmatch.com.

Starting to Find Paternal Crossovers and Assigning to Siblings

Previously I had been calling the start and end of my patterns crossovers. These two terms aren’t totally interchangeable as the start or stop of a pattern may happen at the beginning or end of a Chromosome and therefor not be a crossover at that point. It seems like it should be pretty easy to find the crossovers. Look at the image above. The first and second rows show ABA going to AAA. The order in me and my siblings are JSH or Joel, Sharon and Heidi. The only letter that changes is the B to A. That is the position that Sharon is in, so the paternal crossover has to go to her. From row 2 to row 3 the pattern changes from AAA to ABB at Chromosome 1, position 23,288,828, Build 37. That doesn’t mean that 2 siblings have a crossover there as we are looking at the patterns, not the letters. It is actually the letter that stayed the same that represents the crossover here. AAA to ABB means: all the same (AAA) goes to one different and 2 the same (ABB) – in this case Sharon and Heidi). The one that is different is me and I get the crossover at this location. The next change is from ABB to ABA. This is a little harder to see. I would say that that this crossover goes to Heidi if my reasoning is right. BB was the same before and goes to BA. It must be Heidi that changed because now she matches Joel who didn’t change. I’ll need to figure out how to make better bar graphs in Excel, but here is how the beginning part my father’s Chromosome 1 broke up for 3 of his children. Or another way to look at it the vertical lines are where my father’s maternal and paternal chromosomes combined in each of his 3 children that we are now looking at.

excel-bar-chart-chr1

Where:

  • Series 1 is Sharon. Where the color goes from blue to orange is where Sharon has a change from one paternal grandparent’s DNA to another paternal grandparent’s DNA. The number to the right of Series 1 is the Build 37 Chromosome position number for Sharon’s crossover.
  • Series 2 is Joel’s first crossover (between orange and gray) and
  • Series 3 is Heidi’s first crossover position between gray and yellow [The same explanation under Sharon above applies to Joel and Heidi]

I’ll go back to the M MacNeill Standard. It’s like having an answer sheet to my questions.

macneill-chr1-hilite

According to MacNeill, I have assigned the crossovers to the correct siblings. In the above chart, just look at the red. I haven’t gotten to the maternal part yet, which MacNeill has in blue. The first 3 crossovers are where the red changes from light to dark or dark to light red. The difference in the MacNeill Chart is that his chart is split out one bar for each sibling. The other difference is that MacNeill has build 36 Chromosome position numbers and the numbers I have are from Build 37.

The Process

  1. Phase the siblings into maternal and paternal DNA using the principles that Athey outlines
  2. Find the paternal and maternal crossovers by pattern changes
  3. Assign the crossovers to the correct sibling using the pattern changes
  4. Assign the segments to the correct grandparent. This requires knowledge of cousin matches on the appropriate grandparent side.

That is the big picture which I am understanding as long as I don’t get too lost in the details.

Back to the Details: Fill in More A’s, G’s, C’s and T’s

I have been setting up my data for this, so hopefully, this will be easy. I now have 3 areas to look at:

  • AAB
  • ABA
  • ABB
AAB paternal update

Now I go back to my spreadsheet and sort it by Dad Pattern:

sort-by-pattern

The Start and Stop areas are the ones I want to update. First, I’ll copy my most up to date Table in Access which is tblSibHetorzygous. I’ll rename that tblDadPatternUpdate. Then I want to look for missing data and update the blanks using the AAB pattern.

In Access, I create a query with the new table.

dad-pattern-update-1

I chose the position fields and Paternal Pattern fields. I will change this to an update query which adds an Update To row. The criteria I want is when JoelFromDad = Sharon from Dad (AAB). Actually, I forgot, I was going to use ID criteria. So in the ID field, I need a lot of information. For the first AAB segment, I need everything between ID 45393 and 54155. This is what the criteria looks like:

aab-first-area

When I choose that area, I get over 8,000 lines. However, I only want to update when there is one missing value in the first 2 and the one that isn’t missing is not equal to the third. Here is the result of the above query in my first AAB area:

aab-patterns

I assume that the first blank should be a T. This would be one of the AAA results by chance in an AAB area. I don’t want to fill in the second line as I don’t know if it will be GGG or something else. That is what I meant by saying I don’t want to fill anything in unless there is only one missing value. In the 5th line there is A?G. That would have to be AAG (in an AAB Pattern area). There are some lines that have everything missing that I don’t want to touch.

How to create a query?

First, I want the situation where Joel doesn’t equal Sharon or Joel Doesn’t equal Sharon. That would create an AAB situation:

heid-not-joel-or-sharon

This query results in 1,666 rows of data including rows that are already filled in. Note that I had to write the range of ID’s twice because in order to get an OR situation I needed to put Joel not equat to Heidi and Sharon not equal to Heidi on separate lines. A simpler query is this one:

heidi-not-joel-or-sharon-one-line

The above achieves the same results in one line. Now, for this query, if Joel is blank, replace it with Sharon’s results. If Sharon is blank, replace it with Joel’s results. Here is the query prior to the updating part:

joel-sharon-blanks

This shows that there are 29 blanks for Joel and Sharon meeting this AAB criteria in the first range of AAB’s:

29-records-aab

Next, I apply the same logic to all the AAB segments. In the Expression Builder of Access, I type in this simple formula:

Between 45393 And 54155 Or Between 60990 And 72548 Or Between 207109 And 220679 Or Between 313271 And 317516 OR Between 326845 And 326912 OR Between 389395 And 390311 OR Between 400045 And 405578 OR Between 419982 and 427158 OR Between 433191 And 446672
OR Between 482297 And 492542 OR Between 532520 And 539292 OR Between 571557 And 579594 OR Between 589614 And 589666 OR Between 630037 And 630314 OR Between 630319 And 630378 OR Between 658744 And 659375 OR Between 670533 And 672360 OR Between 673325 And 682544

Simple but long. This has the AAB Starts and Stops for 23 chromosomes. Then I copy it into the next ID criteria line and get this result:

all-missing-aabs

It took a few minutes to type the criteria, but the goal is to update 1,514 lines of missing Paterrnal Pattern data with the push of one button. I still think it is quicker than going line by line and will be more accurate if I got the criteria right.

Next, I change the above Select Query to an Update Query.

paternal-aab-update-query

When my (Joel’s) base from Dad is missing, I update to Sharon’s base. When Sharon’s base from Dad is missing her base is updated with mine. Isn’t sharing great? I didn’t look at the case where Heidi’s base from dad was missing, because if that was missing we wouldn’t be able to see any AAB Pattern.

Let’s UPdate

I push the run button and check the results. Here is my standard dire warning:

standard-dire-warning

Now I will check if it worked. I’ll try ID or Line # 682124:

bad-aab-results

Unfortunately, that was an undesirable result. Before I had A?G. I changed this to ?AG. It appears that my query both replaced my value with Sharon’s, but replaced Sharon’s with my blank. I hadn’t expected that. Next, I’ll check ID# 682182. I had ?AG and replaced it with A?G. So until, I can think of a solution, I’ll need to split the 2 queries.

Fix it! Quick!

First I recopied by Heterozygous Sibling Table back to the Dad Pattern Update 1 Table. This got the table back to the way it was. Here is my simpler query.

dad-aab-simpler-query

Here if my base from Dad is null, replace it with Sharon’s base from Dad. I’ll check ID# 682182 again:

second-mistake

This gets into the category of trial and error. Sharon’s result still got replaced with nothing. See in the previous query I still was telling Access to put update Sharon’s results with mine. I needed to take that out:

fix

There. Now the SharonFromDad Update To is blank. I go through the same procedures and now it looks right.

right-results

We now went from ?AG to AAG in the last 3 columns. These are the bases from Dad columns.

The next step is pretty easy:

sharon-missing-aab

I took out my criteria and put criteria in the SharonFromDad field. When she has a blank, replace it with Joel’s base from Dad. I hit run and it updated over 600 rows. Here is my original check spot at ID# 682124 with better results in the last 3 columns:

better-results

It took a while, but at least I got it right. The moral of the story is to not ask Access to do 2 things at once when those 2 things involve the same 2 people.

The Next Step: ABA

This time I’ll try a different query. I want there to be a B from the ABA in each case, so I’ll make sure that Sharon’s base from Dad is there:

aba-query

Maybe I’ll figure what went wrong last time or come up with a new error. Above, I want the criteria on the first line to be for my blank base: If Sharon’s base from Dad is not equal to Heidi’s Base from Dad Put Heidi’s base from Dad in my blank spot. For Heidi, When Joel’s base from Dad doesn’t equal Sharon’s base from Dad, put Joel’s Base in Heidi’s spot.

I’m so tempted to try this query, but before I do, I’ll copy the previous table of the DadPatternUpdate to a new Dad Pattern Update ABA Table.  This will preserve what I have in the now older DadPatternUpdate Table in case anything goes wrong. Hey, what could go wrong?

query-aba-dad

I pushed the Update Button and updated over 30,000 rows. The results don’t appear to be any better, so I’m back to my 2 step process.

Here is my new slimmed down query:

slimmed-down-query

This new Update Query should update my Line 18 in the new UpdateABA Dad Pattern Table and it does:

lne-18

I now have a full ABA pattern on that line. According to Access over 30,000 Lines were updated, so it wasn’t a total waste of time.

heidi-aba

Run and check Line 149:

check-149

We have ABA in the last 3 columns, so that is good. Line 18 is still OK. I checked it just to make sure.

Query AAB Revised

After seeing how well the ABA Query went, I decided to revise the old AAB Query:

aab-query-rev

This is now looking at over 37,000 rows. This updates my AAB Blanks to tblDadPatternAAB. I don’t know if it is a better query, but at least I’m being consistent.

sharon-missing-aab-rev

This was over 80,000 rows, so I’ll assume that bigger is better.

I copied that resulting Table to tblDadPatternUpdateABA and reran the 2 ABA Update Queries. Here is one of the rerun queries updating the ABA Paternal Table:

rerun-aba

Down to ABB

My Last updated Paternal Table was updating ABA, so I’ll copy that to a new Table called tblDadPatternUpdateABB. I’ll also copy my last query and put in the appropriate Starts and Stops for the paternal ABB patterns. Again,

abb1

This says when Joel’s base from dad is not the same as Heidi, put that Joel from Dad into the space. Probably a more precise query would have said when Sharon from Dad is null and Joel from Dad is not equal to Heidi from Dad. I suppose technically the above query could be writing over a base with the same base in most cases.

I’ll fix that and notice that I had the wrong table in the top, so I’ll change that also.

abb-rev

This only updated 944 rows, so maybe bigger is not better. Here is Part 2:

abb2

This was almost 3,000 rows updated. Now I should check if it worked. I scrolled for an ABB Pattern in an old query and found this:

dad-pattern-abb

Here is my check:

abb-check

I guess I’ve been working too long. Here I have an AAB instead of the ABB I wanted. That is because I had Heidi updated to me (the A) instead of Sharon (the B). Here is the correction:

abb-corrections

I made a fresh Table of ABB. When I opened up the Query, it was saved this way:

corrected-abb

So Access changed my query. Note that there are 2 fields with HeidiFromDad in them. One is for the Update To and the other has Criteria. That is probably a clearer way to do it. Who should argue with Access?

I updated that and I take a cue from Access for Part 2:

access-abb-part-2

In English, the above says, “For this range when JoelFromDad is not blank but Sharon from Dad is, and Joel from Dad has a different value that Heidi from Dad, put that Heidi from Dad value where Sharon had the blank. It sounds a little complicated.

Back to Row 197704 and I’ll look at 197709 while I’m at it:

corrected-abb-pattern

Oh no, it is still wrong! I checked the previous ABA Table and that was the reason for the error. The error is also in the old AAB Table. However, the error was not in the file before that. My guess is that the AAB rule got applied to the wrong range of rows. I don’t see an error there, so I’ll have to rerun all the queries.

That’s OK, because I’m brushing up on the queries and will use the Is Null value so we will only be filling in the missing bases.

rev-aab-query

I had more problems, so I deleted the AAB Table and recopied the previous Table into it. I reran the Revised AAB Query halfway and it looked OK. However, when I ran the second half of the AAB query – filling Sharon’s results, the problem came back at ID# 197704. Very mysterious. The problem was where I thought it was originally. Look at the ID Criteria for the AAB Pattern Query:

the-problem

There is an extra digit in the first between. The range goes from 45393 to 544155. The second number should be 54155. So this query was performed on 450,000 more rows than intended. I updated the AAB query with fewer rows. Again fewer is better. After many requeryings, I got the desired result for ID# 197704:

197704

That should be the end of the first phase of nit picky work on the Paternal Side.

Summary, Conclusion and What’s Next

  • This was a lot of work, but the good news is that this update is for all the Chromosomes at once.
  • The bad news is that I have to do this again for the Maternal Side
  • Next up should be easy. That is just re-applying the Principles that Whit Athey Outlined on the new bases that I added from knowing the patterns. This should update missing maternally received bases from the updated paternally received bases.
  • I haven’t filled in blanks for the AAA patterns yet.
  • I am a little ahead of the game as I looked at how some of the first paternal crossovers will look.
  • Also with some basic phasing, I was able to deduce who those first paternal crossovers belonged to – one each to my two sisters and one for me.
  • If anything can go wrong it will

Phasing Raw DNA with MS Access a la Whit Athey: Part 1

In this Blog, I would like to look at my raw DNA data. Those are the A’s, T’s, G’s and C’s. I have tested at AncestryDNA as has my mom and 2 sisters, so I will use those results. Whit Athey has a paper that describes how to phase your DNA when the DNA from one parent is missing:

Journal of Genetic Genealogy, 6(1), 2010
Journal of Genetic Genealogy
Fall 2010, Vol. 6, Number 1
T. Whit Athey
Many have used MS Excel to phase their raw DNA results. However, it occurred to me that perhaps MS Access would be a better tool for phasing than Excel. When I download my AncestryDNA data, I get about 700,000 lines of data. That is a lot more data than Excel can handle easily. I will go through the Athey Paper and use Access to get results. However, I will not be giving a tutorial on Access as that would take too long.

Downloading AncestryDNA: Getting Rid of Zeros

Many people have downloaded raw data to upload to gedmatch.com. Ancestry raw data is in text form. Access gets along with Excel well, so first I import the AncestryDNA text data into Excel. Perhaps if you are curious, you have taken a look at your raw data to see what it looks like. Unfortunately, it takes a while to open up such a large file. Here is what a few lines of my AncestryDNA text file look like:

ancestry-text-file

It is important to note in the information above that Ancestry uses Build 37. That means that these results need to be converted to compare to Build 36. For example, Gedmatch uses Build 36. I remove the information above the column titles and bring it into Excel. However, I put my name on the top of the last 2 columns because eventually there will be columns for 4 people’s results (mine, my mom’s and my 2 sisters’). I will need to distinguish between each person’s alleles. It is important to note that when importing this text file to Excel, Excel retains the file as text. This is probably such a file as note that the no-calls have been changed to zeros. To save the file as an Excel file, you must specifically do that step.

Here is a file with the no-calls as blanks, like I want them, and with my name at the top and the verbiage removed:

joelallele12-text

Here is the file in Excel. I have used the search and replace in the last 2 columns. I want blanks for no-calls and not zeros which Excel likes to add.

ancestry-raw-excel

Using Access

At this point, I had to switch to my laptop as I don’t have Access on my desk top. I open up Access and name a new database. I go to External Data and choose the Excel icon with the arrow pointing up to import my 4 Excel Files of Raw DNA for Mom, my 2 sisters and me.

access-import

Next under Create, I choose Query Design. I choose the 4 Excel files that I have imported to Excel.

4-tables-in-query

I should note that when I imported the Excel files, that Access creates a unique ID for each row. I let Access do that. It has set that ID as a key identifier. I could have used the rsid as a key that is somewhat as a unique constant. Next I will connect each table by the rsid’s with something called an equal join. That is the dark line I added between the rsid Field for each persons DNA data.

equal-join

This means return the results when the rsid is the same for each file. Note the last table ( 2 images above) was wrong, so I took that out and added my sister Heidi’s Raw data table on the right. It is important to get the initial importing right and in the right format as this will save a lot of time later. Here is the form that I want the data in:

whit-table-1

This is a portion of Table 1 from the Whit Athey Paper. The difference is that Whit only had part of Chromosome 16. I will have all Chromosomes at once. In my Access query I choose the Excel Titles as Fields. I need the rsid, chromosome and chromosome position only once. Then I add the 2 alleles for each person. FTDNA uses right and left alleles. AncestryDNA uses allele 1 and 2. They are the same undifferentiated alleles.

fields-for-query

When I run the view the query results, I get this:

view-first-query-results

So with one push of the button, I have all the raw results of 4 people in my family in one area. I actually have more information than I need. AncestryDNA includes chromosome 24 and 25 which is YDNA and mitochondrial information that I don’t care about here. This is easily filtered out in the criteria section of the design view. I choose ‘Between 1 and 23’ there. That gives me each chromosome between and including 1 and 23.

chromosome-criteria

Now I am down from roughly 701,000 lines of data to the 700,000 lines that I want. It is important to save these results as a Table in Access as we will be using that Table to make more tables. Also save the query. Even though I say to do this, I didn’t. but just saved the results under the next step.

Whit Athey’s Principle 1

This Principle is simple and straightforward. It says that if you have two letters the same in your results, one of those came from one parent and one came from the other. In line 1 of my results above I have TT. All my siblings have this result also. My mother is already shown as TT as she was tested. My father who was not tested must have had a T which he gave to me and my 2 sisters. Here is Table 2 from Athey showing the next set of data that we need to produce the AncestryDNA raw data. Ancestry didn’t tell us which side each of our bases came from, so we will figure that out.

athey-table-2

I have only 3 siblings that I’m looking at right now, so I need 6 more ‘Fields’ in my database. There are a few ways to do this in Access. Here is one way that I did it.

Athey Principal 1 in Access: Homozygous Siblings

Homozygous is just a fancy term for my TT result found in the 1st position tested of my 1st Chrmosome. I created 6 more fields. These are to show what allele (letter) I got from my dad and my mom when I had a TT or other such homozygous results. Here is what the first field out of six that I added looks like on the Access Query Screen.

homozyous-sib-query

JoelFromDad is the first new field name. After the semicolon is the criteria. In English is says that if my allele1 is the same as my allele2, then put my allele1 in as the result I got from my dad. I used the same reasoning for a field called JoelFromMom and in similar fields for my two sisters. I viewed the results to make sure they made sense. I chose Make Table as I want the results in a Table to use later.

make-table

 I hit the Run button and created a Table called tblAncestrySibHomozygous. Here I have squished the results together.
tblsibhomozygous
The results are as above: MomAllele1,2, etc. Then I added in the last 6 columns: JoelFromDad; SharonFromDad; HeidiFromDad; JoelFromMom; etc. In the first line above, The T’s that we all had were added as contributed from our mom and dad. There appears to be an error on line 3. Note that there were no-calls for Joel and Heidi. What we got from Dad was right, but we shouldn’t know what we got from our mom, just based on our own results. I must have saved my next step to this table also.
Fortunately, when I view the original query, the results are correct:
qrysibhomozygous
Note now that the blanks that should be there in the end of the 3rd line are there. Now I have 700,153 lines of results showing where my 2 sisters and I got our DNA from each parents just based on our own ‘homozygous’ results. Good old Principle 1.
Another tip is that when you make a Table from a query, the order may be slightly different than what you want. To keep the same order, in the Sort row, choose Ascending for the chromosome and position.
sort-order

This will make sure that the chromosomes and positions within the chromosomes stay in the correct order. Otherwise, Access may try to sort by the first field which is the rsid.

Principle 2 in Access: Homozygous Parent

In my case, the homozygous parent is my mom. I spilled the beans already by my mistake above. In Line 3 above, my mom is GG. That means she had no other choice than but to contribute one of those G’s to each of her children at that location on Chrmomosome 1. Now I will put that Principle into Access language. For this portion I will use an Update Table. An Update Table will add new information to an existing Table. In this case, I added it to my tblAncestrySibHomozygous Table. That is why it showed the results already above. Here is what the Update Query looks like in design:

update-for-momhomozygous

Here I have the tblAncestrySibHomozygous Table which I reran (or un-updated). This query says for the criteria where Momallel1 equals Momallele2, update the JoelFromMom, etc Fields with the Momallele2 value. Obviously I could have chosen either of her alleles to update the fields as they are the same. In the bottom left of the image above there is a pink highlighted query called qryMomHomozygous for Table. That is this update query. the ! means that it is going to create something. I assume that the little symbol to the left of the ! means that it is an update query. I ran the query and then created a new table with the results called tableAncestryMomHomozygous. Again, what I had forgotten was that by running this query, I also updated tblSibAncestryHomozygous. It’s always good to do quality checks – especially when you are dealing with over 700,000 rows of results at one time.

I did the update and got a warning from Access that I was updating over 400,000 rows.  And that action cannot be reversed. Here is my old tblAncestryMomHomozygous to show the zeros that I didn’t like:

old-mom-homozygous

I’ll delete that table and replace it with the update on tblAncestrySibHomozygous that I just did. Here is the new table without zeros.
new-momhomozygous
I still had to sort the table to get it right. The trick is to sort the position first and then the chromosome and everything comes out in the right order. Notice that I got rid of my old zero problem. Now I have over 700,000 rows of phased DNA based on homozygous results. Next, I look at heterozygous DNA. Whoa.

Principle 3: Heterozygous Child

I’ll copy the Athey Principle as he stated it as it is slightly more complicated than the previous two:

Principle 3 — A final phasing principle is almost trivial, but it is normally not useful because there is usually no way to satisfy its conditions: If a child is heterozygous at a particular SNP, and if it is possible to determine which parent contributed one of the bases, then the other parent necessarily contributed the other (or alternate) base. This principle will be very useful in the present approach.

How to put Principle 3 into access?

Here is an example of heterozygous children alleles where the mother’s contributing base is known:

heterozygous-example

We know that each sibling got a G from mom as she only has G’s at this location. All the siblings have TG for their raw results, which means the T must have come from dad. I can go through over 700,000 lines and apply that rule or try to use the Access Update Query to produce the same results. This time I copied the tblAncestryMomHomozygous to a table called tblSibHeterozygous before I did the update to maintain the integrity of the older table. In the Update Query, I combined 2 steps. First I set a criteria that there has to be something in the JoelFromMom for this to work. So I said that JoelFromMom is not Null. Next, my allele1 is T. I want this to go into my JoelFromDad spot. If this T doesn’t equal the G I got from my mom, I am already heterozygous, so I don’t need an extra query for that. [That was the step that I didn’t need.]

Here is what I have for an Update Query:

query-heterozygous-allele2

However, note that instead of looking at allele1 here I chose allele2. I am thinking that this will be a 2 step process for each allele. This query is updating over 70,000 rows, or a little over 10% of all the data. I’m trying to show that this Update Query did not address my example above which had to do with allele1 and it didn’t:

allele2-query-heterozygous

The first line was my example. The 3 blanks in the first line are the bases from Dad that were not produced from the query as expected. However, it did work for the second line. In that case, my allele2 was not equal to the allele I got from my mom, so it inserted that allele (G) as the allele I got from my Dad. Next I’ll copy the query: qryHeterozygousSibsAllele2 and rename it as qryHeterozygousSibsAllele1. Then I changed 6 of the allele2’s to allele1’s. This is to cover my original example where allele 1 wasn’t the same as the base contributed from Mom.

allele1-query-heterozygous

In English: When allele1 doesn’t equal the allele that you got from Mom, put it in as the allele you got from Dad. This results in over 76,000 row changes. By the way, if I haven’t mentioned it, in the Update Query, the row that says Update To is the one where the update to your data is happening. So in the above example if Sharon’s allele1 doesn’t equal the one she got from Mom, that allele is then known to be the one from dad and is inserted in the correct place in a new table.

I check my updated table for good old rs13303118 and find:

update-allele1-sib-heterozygous

So I think that it looks pretty good. The first line is now filled in with our Dad’s contributing base. Also all the applicable following lines out of 700,000. There are some situations there should be blanks. In the third line, my mom is AC and I am AC. That is the situation where it is not possible to know what base came from what parent. So the base for each of my contributing parent is left blank – meaning that it is unknown.

One Last Step: Looking for Patterns

This is about as far as I’ve gotten and understood. The next issue that Whit Athey looked at in his paper were patterns. In his example there were 4 siblings tested, so more patterns. He added a column between the allele inherited from Dad and one from Mom called Dad informative pattern.

dad-informative-pattern

The idea is that there will be a pattern that lasts for a long time as we go down the results sequentially. These are the patterns of the segments that we inherit from our grandparents. Where the patterns change are the crossovers. Whit says to use those patterns to fill some of the missing letters. I haven’t started filling in the missing bases yet for a few reasons. One is that I’m not sure why I need to. In scanning the Athey paper there is a repetitive procedure of going back and forth between the data using the base from Dad’s side fill-in’s to help with the base from Mom’s side and then back again. First I’m not sure how to automate this yet. And if I could, how much better would the data be? I have quite a bit of data already. Once I get some answers of why I need to do this, I will continue on.

Here is a paragraph from the Athey Paper concerning the above Table 2:

Note the pattern of inheritance from Dad shown in Table 2 for the four siblings in the leftmost four columns. The first few rows show an AABB base pattern, but this gives way in about lines 12-13 to a new pattern, ABBB. Even though we only can see the pattern showing in some of the rows, these patterns persist over hundreds or thousands of SNPs, and can be assumed to exist also in the intervening rows where no pattern was discernable (and in the underlying sequence). Note that often there will be the same base in every location, a case of “accidental matching” which does not contribute to or detract from the pattern we are looking for. When two or more bases are different in a row, however, this represents an informative pattern—if any two are different, then since there are only two possible chromosomes contributing, it means we can see the chromosomal origins of the bases.

One of the reasons that I quote the above is to address the accidental matching where there was the same contributing parent base for each sibling. However, what I didn’t see addressed is that there are cases where that is not just accidental which I will discuss later.

Finding the crossovers

I do know the importance of finding the crossovers. I wrote a query in Access to cull out the patterns that Whit mentions.

query-for-patterns

Above is my query in design view using the table that has Principles, 1, 2, and 3 already applied. This query basically filters out the situations where the 3 siblings have the same base. The thought is that if one sibling has one base that is different from one of the others, then the three siblings’ will not share the same base.

results-of-dad-pattern-query

Above is the start of the results of the query. Note the XYX pattern. This should make it possible to fill in Heidi’s missing bases from Dad. It looks like multiple choice test answers, but I would add C, G, C, C, C and A in the last column for the bases that Heidi got from Dad.  My homework assignment is to find a formula to fill in those letters so I don’t have to do it manually 10’s of thousands of times.

Another thing I want Access to do is find where the crossovers are. Here I scrolled down all the bases the my sisters and I had from Dad. I can see where the XYX pattern changes to XYY:

change-in-dad-pattern

But there was a problem. the XYX pattern stopped at position 18,759,377 and the XYY pattern started at 23,288,828. That means we have a large area with no pattern. Exactly. That is the area of XXX pattern that I just queried out. That has to be the area where all three siblings match the same paternal grandparent.

Checking my results with m macneill’s work

Fortunately, I have secret weapon. M MacNeill – prairielad_genealogy@hotmail.com has also been looking at my raw DNA using his own Excel spreadsheet method. Here is what he has for Chromosome 1:

macneill-chr1

Now just look at the first 3 red bars above. They represent my paternal side. The first break would be on Sharon’s bar – the third red bar from the top. The end of her dark red bar is at 18,631,964:

sharon-chr-1

Look at Sharon’s bar in that region and then scan up the 3 red bars. There is an area where all three siblings match on the paternal grandmother side (lighter red).

That is my paternal XXX Pattern.

To satisfy my curiosity, I went back to my unfiltered/unqueried table at the spot that the first pattern changed from XYX to XXX. The end of the first base pattern from Dad is highlighted in blue.

xyx-to-xxx

Line 2 is a no-call. Line 3 is one of the random XXX matches in the XYX pattern area that Athey mentioned above. Note that I could not likely fill in line 4 with what I know as I don’t know if that should be AAA, AGA, or something else. Actually, I could fill in Heidi’s with an A. If her results are AAA or AGA, Heidi still gets the A from Dad. It is only Sharon’s base from Dad that I don’t know.

However, starting at CCC, it seems like it would make sense to fill in all the letters in the XXX pattern area – even if there is only one known base out of three.

Converting Build 37 to Build 36 positions

At the top of the Blog I had mentioned that AncestryDNA results were in Build 37. M MacNeill’s work is in Build 36. I really didn’t want to have to convert results and thought that I was being clever by using all AncestryDNA results. However, to compare to M MacNeill’s Map above or to Gedmatch results, I still have to convert positions. Hey, life is tough.

NCBI genome Remapping service

Fortunately there is a way to convert positions here.

conversion

Assuming we are all homo sapiens, we select that choice and we select that we want to go from Build 37 to 36:

build-37-to-36

Here is the place to enter the data we want converted. It has to be in the format below – “chr1:” followed by the position number.  There is also a place to upload a file which I haven’t tried.

data-for-conversion

These are the 2 positions from my query where one pattern stopped and another started. Here is what they look like in Build 36 under Map Location:

chr-conversion-results

These Build 36 position numbers match up perfectly with M MacNeill’s map positions which gives me some confidence. This is where I’ll end Part 1.

I have found 2 paternal crossover points. However, I have not yet figured out which siblings they belong to – unless I cheat and look at the MacNeill Map above. I can easily do the same thing and find the pattern changes for the maternal side. I have shown 2 crossovers, but all the others exist in my query for 23 chromosomes. I just haven’t looked for them yet.

Summary

  • The Whit Athey Paper has been very helpful in phasing my raw DNA based on my mother and 2 siblings test results.
  • M MacNeill has piqued an interest in raw DNA data that I never thought I would have
  • M MacNeill’s Chromosome Maps are very helpful in checking my work
  • MS Access appears to be a great tool to use to quickly phase a lot of raw DNA
  • There is probably no way around DNA remapping or conversions
  • I still need:
    • An easy way to find all the crossover points
    • A formula to fill in the various patterns
    • A good reason to fill in those missing bases
  • I have a lot more to learn about DNA phasing using raw DNA data

 

 

 

More Hartley DNA – Patricia’s DNA

This blog is a follow-up on my last Blog: My Hartley Autosomal DNA. I was inspired to write that blog following this year’s Hartley reunion in Rochester, Massachusetts. I intended to send around a little poster I made up about Hartley DNA and get a DNA sample from my father’s cousin Martha, but didn’t get a chance to. Instead I wrote a blog. I did talk to Patricia though. She is my second cousin and the sister of my childhood best friend, Warren. She had taken an AncestryDNA test. I think her daughter bought it for her. I asked if she could upload her DNA to gedmatch.com and she said that her daughter would be good at doing that.

Here are Patricia’s 2 brothers and Patricia. The one in the middle was my best friend in my first 6 years of school. I remember seeing home movies of Curtis, Warren’s older brother. He came to one of my older siblings’ birthday party when he was about this age.

Patricia and family

In my last blog, I wrote about the Hartley DNA matches my father’s first cousin Jim had with me and my 2 sisters. I was surprised to find out that every match that we had represented one of my four 2nd Great Grandparents. They were all born around the 1830’s. It turns out that Patricia’s matches with cousin Jim represent the same four 2nd great grandparents. In addition Patricia’s DNA matches with my 2 sisters and me represent the same four old timers.

Here is what my DNA match to Patricia looks like at AncestryDNA:

Patricia Ancestry

Here, AncestryDNA has it right that we are 2nd cousins. They show we match for a total of 206 cM (centimorgans) across 14 DNA segments. That is about all you can get out of ancestry. They won’t tell you which chromosomes we match on or how much we match on each chromosome. That is why people upload their results to gedmatch.com. Ancestry does show other people that match DNA to both Patricia and me. These are my 2 sisters and 5 others. All these people also descend from the same Rochester Hartley ancestors, but none of them have uploaded their results to gedmatch.com, so we don’t know their detailed DNA matching information.

Here is the same match between Patricia and me at Gedmatch:

Pat Joel Gedmatch

Ancestry has 14 segments vs. the 8 at Gedmatch. But at Gedmatch we know on which chromosome we match, how much on each chromosome and the exact start and stop location on the Chromosome. However, even with Ancestry’s 14 segments, their total is a bit smaller. Here is how I match Patricia on Chromosome 15 in the Gedmatch Chromosome Browser:

Joel Pat Chr 15

The blue areas represent the two DNA matches Patricia and I have on Chromosome 15.

Patricia on the Hartley Family Tree

Growing up, Patricia’s grandmother was my great aunt and also one of my neighbors, my Aunt Mary.

Patricia's Tree

The bottom box in each row are the people that have tested their DNA and uploaded to gedmatch.com. I now show 3 of the 13 children of James Hartley and Annie Louisa Snell (James, Mary and Annie). I now can check how my sisters and I match Patricia’s DNA as well as how Patricia matches Jim’s DNA.

Here are my great grandparents and three of their older children.

James and Annie Hartley

It is in interesting photo. Two of the children are looking away. I think that one is my grandfather James. The mother, Annie, is looking at something in her hands. The older son Dan is looking at a book and the father James doesn’t look comfortable being dressed up.

Patricia’s DNA at Gedmatch

One of the basic functions at gedmatch is called ‘One to Many’. In this case, I took Patricia’s DNA and compared them to everyone else that has ever uploaded their DNA results to gedmatch. Here are her 1st 4 matches:

Patricia's 1st 4 matches

Not surprisingly, her top matches are her 1st cousin, once removed, Jim, me and my sister’s Sharon and Heidi. The Gen column lists how far away gedmatch thinks Patricia’s matches are to a common ancestor. Patricia and I are 3 generations to James Hartley and Annie Snell, so that is right. Patricia shows 2.6 generations to a common ancestor with her match to Jim. A first cousin once removed would typically be 2.5 generations, so she shares a little less DNA than average here with Jim. Patricia also shares 19.3 cM of the X Chromosome with cousin Jim which I find interesting.

The Hartley X Chromosome

I’m taking the X Chromosome out of order because I find it interesting. There is one most important thing to know about the X Chromosome. If you are a male, you get one from your mother. If you are a female, you get one from your mother and one from your father. My father only got an X chromosome from his Frazer mother, so he doesn’t match anyone further up on the Hartley line by the X Chromosome. However Patricia and Jim both have maternal matches that carry up the line.

Here is how Jim got his X Chromosome from his mother and her ancestors:

Jim's X Inheritance

Jim only inherited his X Chromosome from those ancestors in pink or blue. So, for example, he got no X Chromosome from any Bradford before Harvey Bradford.

We need to compare Jim’s chart with Patricia’s X Inheritance Chart:

Patricia's X Inheritance

Here I didn’t show the X Chromosome that Patricia got from her father as this won’t match Jim. Then of what I show, only the bottom half will match Jim. This means that going back 4 generations from Patricia, she could match Jim by the X Chromosome on the Emmet, Snell or Bradford Line. One other difference between Jim and Patricia is that Jim got 100% of his total X Chromosome from his mother and Patricia only got 50%. However, that is a confusing way to put it because Patricia did get 2 X Chromosomes. So her one 50% must be similar to Jim’s 100% if that makes sense.

Here is what the X Chromosome match looks like between Patricia and Jim at gedmatch.com on their browser:

Jim Patricia X Match

The yellow part with the blue under it is where they match at the end of the X Chromosome. That is enough on my X diversion for now.

Back to the Hartley DNA Matches on the Other 22 Chromsomes

At gedmatch, I go to the Jim’s ‘One to Many’ matches to see how he matches my family and Patricia. Here are Jim’s top 4 matches. You may have already guessed who they are:

Jim's top 4 matches

Above, I said that Patricia matched Jim a little less than expected. My sister Heidi at the top of the list matches him a little more than average.

Here are Jim’s DNA matches on Chromosome 1

Pat Chr 1

  1. Me
  2. Heidi
  3. Sharon
  4. Patricia

Here Patricia has identified a new piece of DNA in green that is a Hartley ancestor that we didn’t know about before. Again, this “Hartley” ancestor may be Hartley, Emmet, Snell or Bradford.

Here is another new Hartley segment on Chromosome 2:

Pat Chr 2

Patricia matched Jim on Chromosome 2. My sisters and I had no match with Jim on that Chromosome.

It looks like Patricia got a double segment of Hartley DNA on Chromosome 5:

Patricia Chr 5

Patricia is #1 above. Where the color changes from orange to yellow likely represents a change from Greenwood Hartley to Ann Emmet DNA or Isaiah Snell to Hannah Bradford DNA.

Patricia Helping Me Map My Chromosome 7

I’ve tried to map all my chromosomes as well as my 2 sisters’ to my 4 grandparents. I got a little stuck on Chromosome 7:

Chr 7 Map Pat

My chromosome 7 depiction is the one with the J to the left of it. On my paternal side (which is the blue (FRAZER) and red bar), I have the DNA I got from my dad’s mother in blue and my dad’s Hartley dad in red. Above that is the gedmatch depiction of how I match my 2 sisters by DNA and how they match each other. The bright green bar is called the Fully Identical Region or FIR. This means wherever that occurs a sibling matches the other sibling by getting the same DNA from the same 2 grandparents (one maternal and the other paternal). So in comparing Sharon to Heidi, they have that FIR from 0 to 25. It turns out that their 2 grandparents were their mother’s mother (Lentz) and their father’s father (Hartley). In the tiny section between 0 and 4, I have what is called a Half Identical Region or HIR. That means that I shared one grandparent’s DNA  with my sisters and the other grandparent I didn’t get any of their DNA. In this case I had to share either the Lentz or Hartley grandparent with my 2 sisters, but I didn’t know which.

That is where Patricia’s results came in handy. Here is how she matches Sharon, Heidi and me:

Patricia Chromosome 7

Patricia has 3 good matches with Sharon and Heidi and one tiny one with me (#3 on the Chromosome Browser). However, the tiny one is the one I need. The pink match shows that my Chromosome 7 from 0-4 (in millions) is where I got my DNA from my Hartley grandfather and not my Frazer grandmother.

Here is my completed Chromosome 7 thanks to Patricia. I extended the Rathfelder on my Chromosome 7 all the way to the left or beginning and added a small chunk of red Hartley from my grandfather.

Chr 7 complete

Another Type of Chromosome Mapping

There’s is another type of Chromosome Mapping developed by Kitty Munson. The way the Munson Mapping is generally used is to map out your relatives’ common ancestors. In the case of Patricia and Jim our common ancestors are James Hartley and Annie Louisa Snell. Here is what my new Chromosome Map looks like with the addition of Patricia’s DNA matches with me shown in blue.

New Kitty Map for Joel based on Pat

Well, that’s about enough for Patricia’s DNA for now.

Summary and Conclusions

  • Patricia shared the first Hartley X Chromosome match that I’ve seen.
  • The X tends to shy away from the male line, so Patricia and Jim’s match is more likely down somewhere in the Massachusetts colonial line rather than the English Line.
  • I would like to use Hartley DNA to break through the Hartley genealogical brick wall. Right now I’m stuck in the early 1800’s in Trawden, England. There were too many Hartleys there with the same first name to figure out who was who. Patricia’s DNA may help in finding matches to other Hartleys
  • Patricia’s DNA helped me in mapping my chromosomes in 2 different ways.