New Contact Info!
PSD, Inc.
1338 S. Foothill Suite 324
Salt Lake City, UT 84108
(801) 349-5559

 

GenmergeDB Overview

The technology used in GenMerge for record linking is available in GenMergeDB. You can license GenMergeDB or let our professionals work on your project on a fee basis.

GenMergeDB uses the same record linking algorithms as the desktop version, but uses a database system for a backing store, eliminating the 200,000 record limitation for record linking projects.  GenMergeDB allows changes to many record linking parameters including cut off scores and weight tables to make GenMergeDB useful for more general record linking projects.  GenMergeDB can be used to build large genealogies from family data but can also be used to reconstitute populations based on census records, vital statistics records (birth, death and marriage), probate data and other files with identifying information for individuals and families.  The following details describe the use of GenmergeDB to build a genealogy from GEDCOM files.

Building a genealogy consists of the following steps:

  • Preprocess source GEDCOM files
  • Link the GEDCOM files to the genealogy
  • Review the linking results
  • Update the genealogy database

Chart 1 illustrates this process.

GenMergeDB can also be used to link alternate types of family data (such as census data) or data with individual identifying information (such as DMV data or medical data).  Use of different format source data requires the implementation of a driver in Java to read the data and map it to the GenMergeDB data model.

Pre-process Source GEDCOM files

Good record linking results are only obtained with good quality source data.  The first phase of the record linking job is the preparation of the source data.  Each GEDCOM file is processed individually to eliminate internal duplicates and to find and repair two types of errors: loops and inconsistent birth years.  A loop occurs when an individual is their own ancestor or a spouse is an individual’s ancestor.  The software breaks the loop by removing the individual as a child from the parent’s marriage.  A birth year inconsistency occurs when an individual has a birth year prior to their parent’s or another ancestor.   If the birth year inconsistency can be solved by providing a century, e.g., if the birth year is 20 and the relatives have birth years in the 1800’s, the birth year is corrected, this is done.  Otherwise, the individual with the offending birth year is removed as a child from their parent’s marriage.

A quality score is computed for each GEDCOM file.  This score is a value between 0 and 100 and gives an indication of the number of errors, the number of individuals that have no names and no places, the maximum number of generations found in the pedigree, etc.  The smaller the score, the poorer quality the GEDCOM file is.  GEDCOM files with low scores (less than 15) are usually best eliminated from consideration.

Link the GEDCOM Files to the Genealogy

Once a set of GEDCOM files has been prepared and has been loaded into the database tables (GenyInd and GenyMar) they can be linked to the existing

 

 

Chart 1

 

genealogy.  This process consists of building partitions, finding the best matches (find candidates) and then using a recursive step to look at the families of the best matches.   The final step is to update the CompositePerson, CompositeMarriage, UniquePerson and UniqueRelation tables with the results of the record linking.

Build Partitions

When linking a new data source to the existing genealogy it would be ideal to compare every record in the new data with each record in the genealogy.  In most cases this proves impractical.   Binning criteria is used to create partitions that contain records most likely to be the same.  The binning criteria used by GenMergeDB are a phonetic coding of the surname and given name.  Each person with a surname and given name that sound similar will be grouped in a partition.   Comparisons are only done between records in the same partition.  Because of the recursive step in the record linking process, there is rarely any benefit in changing the binning criteria.  As long as at least one person in a family tree links, the entire family will be considered for linking. 

Find Candidates

The purpose of this record linking step is to find links that are well above the confidence threshold.   This process is called Find Candidates.  Each individual in a partition is compared with all other individuals in the same partition that have the same gender (or the gender of one person is unknown) and a birth year within 20 years (or the birth year of one person is unknown).  All individuals with a score over the Minimum Individual Score (default 110) are then checked for a matching family.  If the family matching is successful, a link is added to a cluster.  If the family matching fails, the link is rejected. 

The family matching looks at the two sets of parents, any spouses and any children.  In the event there are no overlapping relatives or there are fewer than three generations (parents, spouse, children) a sliding scale is used to allow individuals with high scores to match.  For example, if two candidates have no relatives or no relatives that overlap and the score between the candidates must be 500 or over for the link to be accepted. 

These links between individuals that do not have overlapping relatives are called “weak links” because there is no information available to verify the family connections.   There is a parameter that controls whether or not these weak links are accepted or not.   The default is to accept weak links and this is usually not problematic because of the high score required.  However, for some data it may be appropriate to change the value of this parameter and reject weak links.

As candidates are added to clusters several things can happen.  If the neither of the candidates is already in a cluster, then a new cluster is created and the two individuals are added to the new cluster. 

If one of the candidates is already in a cluster, then the new person is scored with everyone already in the cluster to insure that there are no conflicts.  If the new person does not score well with someone already in the set we remove the person that provides the least information (has the lowest score).  This means that the new person may displace someone already in the cluster or the new person will not be added.  Because we exhaustively compare every person in the partition, the displaced individual will be seen again so that at the end of the comparisons each record is in the best cluster and no disambiguation is required.   This process insures that the clusters have closure.  Each individual in the set links to every other individual.

The final case is when both people are already in clusters.  All individuals in both clusters are compared and either the clusters will be combined or the new potential link is discarded. 

When Find Candidatescompletes the clusters each contain two or more individuals who are duplicates.  This is the set of duplicates that seeds the Recursive record linking step.

Recursive Step

The final phase of record linking is to look at the people that are already clustered and decide if their parents, spouses and children should also be clustered.  Consider a simple example of a cluster with the records for two men, who have only parents for relatives.  In an ideal situation, the two fathers would already be in a cluster and the two mothers would be in a cluster and there is no work to do.  However, because of name variations or missing data this is often not the case, so this is an opportunity to cluster records that were missed in the initial pass.  Once the father and mother clusters are added (if they weren’t already there), these clusters are then used to check their relatives for duplicates.   In this way the entire connected family is checked for possible duplication. 

The recursive step is critical for picking up duplication where there differences in names. 

Update the Genealogy

The genealogy consists of two sets of tables.  The first set is used for additional record linking, CompositePerson and CompositeMarriage.  CompositePerson contains one row for each cluster.  The demographic data for the composite record is the most complete data available from all the records in the cluster.  The UniquePerson table are used to map from the CompositePerson record to the original source records that make up that person.  This is essentially the permanent record of the cluster.

Once the record linking is complete and the clusters are satisfactory, these genealogy tables are updated.  Information from new records that linked to an existing CompositePerson record is added to the CompositePerson record if it enhances the CompositePerson record.  New CompositePerson records are built for any new records that did not link to existing records.   Similarly for the CompositeMarriage records.

The UniquePerson table is updated to reflect the clusters.   New records that linked to an existing CompositePerson add rows to the corresponding UniquePerson.  New records that did not link to CompositePerson records are added as new UniquePerson records.  Similarly for UniqueRelations.

These updates are accomplished with a series of database queries issued by GenMergeDB.

Running a record linking job

GenMergeDB has a programmer interface that allows record linking jobs to be initiated and monitored under program control.  The API consists of a set of methods that can be called to set up and run the different steps of the record linking job.  Details of the interface may be found in the Java Doc with the release.

The basic steps of a record linking job are:  preprocessing the GEDCOM files, linking the GEDCOM files to the genealogy and updating the genealogy with the record linking results.

The following methods are all on the static GenMergeDB object which is created and referenced in the DataSource class:

public static GenMergeDB gmdb = DataSource.gmdbDS;

Initialize GenMergeDB

GenMergeDB relies on a property file for several settings to connect to the database.  Call InitSystem to pass the property file to GenMergeDB.  This will connect to the database and set up internal structures in preparation of other method calls.

public boolean InitSystem (String propertyFile)

Preprocess GEDCOM files 

GEDCOM files are preprocessed and added to the GenyInd and GenyMar database tables as the first step in the record linking process.   Each GEDCOM file added to the database has a row in the Source table and is identified with a unique identifier called the gid.  Once a GEDCOM file is in the GenMergeDB system it is referred to using the gid number.

The following methods are available to add and remove GEDCOM files from the database:

public int addGedcoms ( String [] gedPathList, boolean verbose )

Given an array of path names, this method pre-processes each pathname and puts the results in GenyInd and GenyMar and puts a row in the Source table.  The row has an initial status of UNPROCESSED.    The verbose parameter set to true will echo progress in the command window (system out).   If a pathname already exists in the database, that file will not be reprocessed.  To reprocess a file, the row from Source must be removed.  The value returned from addGedcoms is the number of files added to the database.

An entry is also made for each file added successfully to the RLJob table in anticipation of record linking. 

public int removeGedcoms ( int [] gidList, boolean verbose )

This method removes GEDCOM files from the database (all the rows from GenyInd, GenyMar, Source and RLJobs).    GEDCOM files that have been linked to the genealogy can not be removed.
 
If the RLJobs table does not contain the gids of the files to be processed, the following methods can update the RLJobs table with either all unprocessed sources, or just a specific list.

public int addUnProcessedSources ( boolean verbose )
 
Reset the RLJobs table to contain all sources that have not been linked. 
 
public int addUnProcessedSources ( int [] gidList, boolean verbose )

Reset the RLJobs table to contain the gids  in the provided gidList.

Run the record linking job

Once the RLJobs table has the list of gids to process a record linking job is started using the following method:

public int startLinkingRun ( int command )

The commands that can be used are:

FULL - build partitions, record link, update the database
LINK - build partitions, record link
UPDATEDB - update the database

The RLStatus table consists of one row that is updated as a record linking job progresses.  As a step completes, the status is updated to reflect the completed step.  When the status of a record linking job is CREATEPARTITIONS the partitions have been created and FINDCANDIDATES is in progress.  When the status is FINDCANDIDATES, the RECURSIVE step is in progress.  When the status is RECURSIVE the record linking is complete.  The final state of a record linking job is UPDATEDB.  When a row in the database has the value UPDATEDB there is no active record linking job.  This row may be removed from the table at this point.  GenMergeDB removes completed rows when a new job is started. There can be only one record linking job in progress at a time.  If there is a row in RLStatus with a status that is not UPDATEDB, then a new record linking job may not be initiated. 

As each step of a record linking job is completed a row is entered in the Log table with the runid, a time stamp and  the description of the step

If the record linking job is only run through the record linking step one of three actions is appropriate:

a) the jobmanager is started to complete the linking run (UPDATEDB)
b) linking parameters are modified, the job is reset (see resetRLRun) and the jobmanager is used rerun the linking
c) all the data for this run is deleted (see discardRLRun)

Discard the Results of a Record Linking Run

If the results of a record linking run are not satisfactory the following method clears all the temporary tables, deletes the rows from RLJobs, deletes the row from RLStatus.  An entry is made in the log file indicating that the run was discarded.

public int discardRLRun ( )

Reset a Record Linking Run

The reset method resets the record linking to a previous state

public int resetLinkingRun ( int state )

The reset state can be one of the following two previous states:

JOBSTARTED- Reset to this step to repeat the building of partitions.  Do this if the binning criteria changes.

CREATEPARTITIONS - Reset to this step to repeat the record linking.  Do this if record linking options or thresholds have changed.

 


Check out GenMerge!

Family History Expo 2010
St. George, UT
Feb. 26 - 27

BYU Family History Technology Workshop
March 10, 2010
Provo, UT

NGS Family History Conference
Salt Lake City, UT
April 28 - May 1, 2010

More...

Have a club or user group? Check out our discount program.

Get GenmergeDB for bigger projects! Click here for more information.

Home | Download | Support | Privacy Policy | Buy GenMerge | Links | Contact Us

Pleiades Software Development, Inc.