In my last Blog, I looked at Maternal and Paternal Patterns created by initial phasing of my raw DNA. The Paternal side patterns were pretty complete, but I didn’t have much on the Maternal side. I summarized the patterns I found in a spreadsheet. I added siblings’ crossovers where found and added start and end positions on each chromosome where found.
Filling in the Blanks
- First I made copies of my Mom and Dad Spreadsheets
- Then I cleaned them up, taking out the Patterns that were only at one location (one SNP)
- Then I used a filter to get the patterns.
The AAAAB Pattern filter on the Paternal Patterns spreadsheet looks like this:
On the Maternal Table, it looks like this:
I’ll start with the maternal Patterns as it will be easier. First I copied that Table I was working on to create a new Fillin Table called tblFillinStep1.
Here is what I portion of the Chromosome 10 AAAAB Maternal Pattern looks like:
In my previous analyses before I had 5 siblings tested, I had sequential ID’s. This made things easier in choosing patterns. Now, however, I am having trouble getting the ID’s sequential. This is probably due to my adding the extra alleles. My options now are to update the missing alleles by hand or create queries to update them.
Just Like Starting Over – a New Access Database
I decided to try importing all the raw DNA files into a new Access Database. This time I won’t let Access assign the key data field. I’ll use the rsid as the key data field. Perhaps then I can assign a new ID that will apply to the merged V1/V2 AncestryDNA dataset.
One problem that I’ve noted with the AncestryDNA downloads is that they only give you a date on the filesname. There is no indication in the file or in the file name of who the data is for. That means that it is important to rename your files. I uploaded my sister Heidi’s results to Gedmatch on 26 April 2015, so that is a clue. I see an AncestryDNA zip file from 25 April 2015, so that is a clue. I guess I’ll add her name to the file now!
The Clean start
Here is the Access Database with just 5 tables:
Here’s my Query to create an AncestryDNA V1 Raw Data Table:
I see in my results, I got more lines than last time. I checked and I forgot to only include Chromosome 1-23. I corrected that and got the 700,153 lines as before. I did and analogous query for Jon and Lori and got 666,532 rows.
All V1 Results plus the v2 results where they are they match v1
Next, I’ll create a Query between the two tables I just made.
This will have a right hand join. As stated above, it will produce all the V1 data and only those records from V2 where they equal V1. That brings in Jonathan and Lori into the V1 results. I made this data into a new table.
Finding v2 data not in v1
Next I want to add only the V2 part that isn’t in V1. When I get this, I can add this to my query to get all the results. This is the query that I couldn’t remember how to do last time. I put V2 on the left, with a right hand query.
This gives everything from V2 plus the V2 that equals V1. However, the trick is that I set V1 to ‘Is Null’ which takes out the V1. That should give only the right section of the peach colored circle below.
For some reason, my new number for the right entire circle is 666,532. The new query results is the right hand circle minus the overlapping data which is 242,494. I’ll append this query to the table I made with my previous Query. I renamed this table as tblV1andV2 and it has 942, 647 rows.
Next, I want to sort the table and add an ID. I copied the structure of the table to a new table. Then I added a Position ID (Pos ID) with an autonumber. Then I made a query to append the old data to the new table with the autonumbered Pos ID. This gave me what I wanted but not in the right order. So then I used the sort function to get the table in the shape I wanted:
I can tell the table is right by going to the last record:
Before the last record would include the V2 alleles that were added but they weren’t sorted. This long process gives me a sorted ID in the same relative position as the sorted Chromosome positions. The reason I need that is to describe my patterns in a simpler way than by Chromosome start and stop.
Summary and Next Steps
- Going down the road of filling in patterns, I found something that I needed to correct in an earlier step
- This caused me to start afresh with cleaner tables.
- I was able to add a unique ordered Position ID to the combined V1/V2 table
- This position ID will be used to identify each Pattern for filling in missing alleles
- However, I have to re-do the patterns. This time, I will include the Position ID in the start and end of each pattern in my summary of Mom and Dad Patterns.