% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fitPoly.r
\name{fitOneMarker}
\alias{fitOneMarker}
\title{Function to fit multiple mixture models to signal ratios of a single
bi-allelic marker}
\usage{
fitOneMarker(ploidy, marker, data, diplo=NULL, select=TRUE,
diploselect=TRUE, pop.parents=NULL, population=NULL, parentalPriors=NULL,
samplePriors=NULL, startmeans=NULL, maxiter=40, maxn.bin=200, nbin=200,
sd.threshold=0.1, p.threshold=0.99, call.threshold=0.6, peak.threshold=0.85,
try.HW=TRUE, dip.filter=1, sd.target=NA,
plot="none", plot.type="png", plot.dir, sMMinfo=NULL)
}
\arguments{
\item{ploidy}{The ploidy level, 2 or higher: 2 for diploids, 3 for triploids
etc.}

\item{marker}{A marker name of number. Used to select the data for one marker,
referring to the MarkerName column of parameter data. If a number, the number
of the marker based on alphabetic order of the MarkerNames in data.}

\item{data}{A data frame with the polyploid samples, with (at least) columns
MarkerName, SampleName and ratio, where ratio is the Y-allele signal
divided by the sum of the X- and Y-allele signals: ratio == Y/(X+Y)}

\item{diplo}{NULL or a data frame like data, with the diploid samples and (a subset
of) the same markers as in data. Genotypic scores for diploid samples are
calculated according to the best-fitting model calculated for the polyploid
samples and therefore may range from 0 (nulliplex) to <ploidy>, with the
expected dosages 0 and <ploidy> for the homozygotes and <ploidy/2> for the
heterozygotes.\cr
Note that diplo can also be used for any other samples that need to be
scored, but that should not affect the fitted models.}

\item{select}{A logical vector, recycled if shorter than nrow(data):
indicates which rows of data are to be used (default TRUE, i.e. keep all rows)}

\item{diploselect}{A logical vector like select, matching diplo instead of data}

\item{pop.parents}{NULL or a data.frame specifying the population structure. The
data frame has 3 columns: the first containing population IDs, the 2nd and 3rd
with the population IDs of the parents of these populations (if F1's) or NA
(if not). The poopulation IDs should match those in parameter population. If
pop.parents is NULL all samples are considered to be in one population, and
parameter population should also be NULL (default).}

\item{population}{NULL or a data.frame specifying to which population each
sample belongs. The data frame has two columns, the first containing
the SampleName (containing all SampleNames occurring in data),
the second column containing population IDs that match pop.parents. In both
columns NA values are not allowed. Parameters pop.parents and population
should both be NULL (default) or both be specified.}

\item{parentalPriors}{NULL or a data frame specifying the prior dosages for
the parental populations. The data frame has one column MarkerName
followed by one column for each F1 parental population. Column names (except
first) are population IDs matching the parental populations in pop.parents.
In case there is just one F1 population in pop.parents, it is possible to
have two columns for both parental populations instead of one (allowing two
specify two different prior dosages); in that case both columns for each
parent have the same caption. Each row specifies the priors for
one marker. The contents of the data frame are dosages, as integers from 0
to <ploidy>; NA values are allowed.\cr
Note: when reading the data frame with read.table or read.csv, set
check.names=FALSE so column names (population IDs) are not changed.}

\item{samplePriors}{NULL or a data.frame specifying prior dosages for individual
samples. The first column called MarkerName is followed by one column per
sample; not all samples in data need to have a column here, only
those samples for which prior dosages for one or more markers are available.
Each row specifies the priors for one marker. The contents of the data frame
are dosages, as integers from 0 to <ploidy>; NA values are allowed.\cr
Note: when reading the data frame with read.table or read.csv, set
check.names=FALSE so column names (population IDs) are not changed.}

\item{startmeans}{NULL or a data.frame specifying the prior means of
the mixture distributions. The data frame has one column MarkerName,
followed by <ploidy+1> columns with the prior means on the original
(untransformed) scale. Each row specifies the
means for one marker in strictly ascending order (all means NA is allowed, but
markers without start means can also be omitted).}

\item{maxiter}{A single integer, passed to CodomMarker, see there for explanation}

\item{maxn.bin}{A single integer, passed to CodomMarker, see there for explanation}

\item{nbin}{A single integer, passed to CodomMarker, see there for explanation}

\item{sd.threshold}{The maximum value allowed for the (constant) standard
deviation of each peak  on the arcsine - square root transformed scale,
default 0.1. If the optimal model has a larger standard deviation the marker
is rejected. Set to a large value (e.g. 1) to disable this filter.}

\item{p.threshold}{The minimum P-value required to assign a genotype (dosage)
to a sample; default 0.99. If the P-value for all possible genotypes is less
than p.threshold the sample is assigned genotype NA. Set to 1 to disable
this filter.}

\item{call.threshold}{The minimum fraction of samples to have genotypes
assigned ("called"); default 0.6. If under the optimal model the fraction of
"called" samples is less than call.threshold the marker is rejected. Set to 0
to disable this filter.}

\item{peak.threshold}{The maximum allowed fraction of the scored samples that
are in one peak; default 0.85. If any of the possible genotypes (peaks in the
ratio histogram) contains more than peak.threshold of the samples the marker
is rejected (because the remaining samples offers too little information for
reliable model fitting).}

\item{try.HW}{Logical: if TRUE (default), try models with and without a
constraint on the mixing proportions according to Hardy-Weinberg equilibrium
ratios. If FALSE, only try models without this constraint. Even when the HW
assumption is not applicable, setting try.HW to TRUE often still leads to
a better model. For more details on how try.HW is used see the Details
section.}

\item{dip.filter}{if 1 (default), select best model only from models
that do not have a dip (a lower peak surrounded by higher peaks: these are not
expected under Hardy-Weinberg equilibrium or in cross progenies). If all
fitted models have a dip still the best of these is selected. If 2, similar,
but if all fitted models have a dip the marker is rejected. If 0, select best
model among all fitted models, including those with a dip.}

\item{sd.target}{If the fitted standard deviation of the peaks on the
transformed scale is larger than sd.target a penalty is given (see Details);
default NA i.e. no penalty is given.}

\item{plot}{String, "none" (default), "fitted" or "all". If "fitted" a plot
of the best fitting model and the assigned genotypes is saved with filename
<marker number><marker name>.<plot.type>, preceded by "rejected_" if the
marker was rejected. If "all", small plots of all models are saved to files
(8 per file) with filename
<"plots"><marker number><A..F><marker name>.<plot.type> in addition to the
plot of the best fitting model.}

\item{plot.type}{String, "png" (default), "emf", "svg" or "pdf". Indicates
format for saving the plots.}

\item{plot.dir}{String, the directory where to save the plot files. Must be
specified if plot is not "none". Set this to "" to save plot files
in the current working directory.}

\item{sMMinfo}{NULL (default), for internal use only. Prevents unneeded checking
and recalculation of input parameters when called from saveMarkerModels.}
}
\value{
A list with components:
\describe{
 \item{log}{A character vector with the lines of the log text.}
 \item{modeldata}{A data frame as allmodeldata (see below) with only the
 one row with data on the selected model.}
 \item{allmodeldata}{A data frame with for each tried model one row with
 the marker number, marker name, number of samples and (if the marker is
 not rejected) data of the fitted model (see below).}
 \item{scores}{A data frame with the name and data for all samples
 (including NA's for the samples that were not selected, see parameter
 select), with columns:\cr
 marker (the sequential number of the marker (based on alphabetic
 order of the marker names in data)\cr
 MarkerName\cr
 SampleName\cr
 ratio (the given ratio from parameter data)\cr
 P0 .. P<ploidy> (the probabilities that this sample belongs to each of the
 <ploidy+1> mixture components)\cr
 maxgeno (0..ploidy, the genotype = mixture component with the highest P
 value)\cr
 maxP (the P value for this genotype)\cr
 geno (the assigned genotype number: same as maxgeno, or NA if
 maxP < p.threshold).}
 \item{diploscores}{A data frame like scores for the samples in the data
 frame supplied with argument diplo. If diplo is NA also diploscores will be
 NA.}
}
The modeldata and allmodeldata data frames present data on a fitted model.
modeldata presents data on the selected model; allmodeldata lists all
attempted models. Both data frames contain the following columns:
\describe{
 \item{marker}{the sequential number of the marker (based on alphabetic
 order of the marker names in data)}
 \item{MarkerName}{the name of the marker}
 \item{m}{the number of the fitted model}
 \item{model}{the type of the fitted model. Possible values are "b1", "b2", "b1,q",
 "b2,q", each by itself or followed by "HW" or "pop". The first 4 refer to
 the models for the mixture means: b1 and b2 indicate 1 or 2
 parameters for signal background, q indicates that a quadratic term in the
 signal response was fitted as well. HW and pop refer to the restrictions on
 the mixing proportions: HW indicates that the mixing proportions were
 constrained according to Hardy-Weinberg equilibrium ratios in case of only
 one population, pop indicates that multiple populations were fitted (see
 Details section). For more details see Voorrips et al (2011),
 doi:10.1186/1471-2105-12-172.}
 \item{nsamp}{the number of samples for this marker for which select==TRUE,
 i.e. the number on which the call rate is based.}
 \item{nsel}{the number of these samples that have a non-NA ratio value}
 \item{npar}{the number of free parameters fitted}
 \item{iter}{the number of iterations to reach convergence}
 \item{dip}{whether the model had a dip (a smaller peak surrounded by
 larger peaks): 0=no, 1=yes}
 \item{LL}{the log-likelihood of the model}
 \item{AIC}{Akaike's Information Criterion}
 \item{BIC}{Bayesian Information Criterion}
 \item{selcrit}{the selection criterion; the model with the lowest selcrit
 is selected. If argument sd.target is NA selcrit is equal to BIC, else
 selcrit is larger than BIC if the standard deviation of the mixture
 components is larger than sd.target; see Details for details.}
 \item{minsepar}{a measure of the minimum peak separation. Each difference
 of the means of two successive mixture components is divided by the average
 of the standard deviations of the two components. The minimum of the
 values is reported. All calculations are on the arcsine-square root
 transformed scale.}
 \item{meanP}{For each sample the maximum probability of belonging to any
 mixture component is calculated. The average of these P values is reported
 in meanP}
 \item{P80 .. P99}{the fraction of samples that have a probability of at
 least 0.80 .. 0.99 to belong to one of the mixture components (by default a
 level of 0.99 is required to assign a genotype score to a sample)}
 \item{muact0 ..}{the actual means of the samples in each of
 the  mixture components for dosages 0 .. <ploidy> on the transformed scale}
 \item{sdact0 ..}{the actual standard deviations of the samples in each of
 the  mixture components for dosages 0 .. <ploidy> on the transformed scale}
 \item{mutrans0 ..}{the means of the mixture components for dosages 0 ..
 <ploidy> on the transformed scale}
 \item{sdtrans0 ..}{the standard deviations of the mixture components for
 dosages 0 .. <ploidy> on the transformed scale}
 \item{P0 ..}{the mixing proportions of the mixture components for dosages
 0 to <ploidy>. If multiple populations are specified there are two
 possibilities: (1) the specified population structure is used in the
 current model; then for each population the mixing proportions are given
 as <npop> sequences of <ploidy+1> fractions, or (2) the population
 structure is ignored for the current model, the mixing proportions are given
 in the first sequence of <ploidy+1> fractions and all following sequences
 are filled with NA. The the item names are adapted to have the
 population names between the P and the dosage}
 \item{mu0 ..}{the model means of the <ploidy+1> mixture components
 back-transformed to the original scale}
 \item{sd0 ..}{the model standard deviations of the <ploidy+1> mixture
 components back-transformed to the original scale}
 \item{message}{if no model was fitted or the model was rejected, the reason
 is reported here}
}
}
\description{
This function takes a data frame with allele signal ratios for
multiple bi-allelic markers and samples, and fits multiple mixture models to
a selected marker. It returns a list, reporting on the performance of these
models, selecting the best one based on the BIC criterion, optionally
plotting results.
}
\details{
fitOneMarker fits a series of mixture models for the given marker by
repeatedly calling CodomMarker and selects the optimal one. The initial
models vary according to the values of try.HW, pop.parents,
parentalPriors, samplePriors and startmeans:
\itemize{
 \item no pop.parents, try.HW FALSE: 4 models with different constraints
  on the means (different or equal X and Y background signal, ratio a linear or
  quadratic function of dosage), no restrictions on the mixing proportions
  (the fractions of samples in each dosage peak)
 \item no pop.parents, try.HW TRUE: The previous 4 models are fitted and
 also 4 models with the same restrictions on the means and the mixing
 proportions restricted to Hardy-Weinberg ratios (assuming polysomic
 inheritance)
 \item pop.parents specified, no parentalPriors / samplePriors / startmeans,
 try.HW FALSE: 4 models
 are fitted with the same restrictions on the means as above, but with
 different restrictions on the mixing proportions for each population:
 no restriction on parental populations, none on accession panels, polysomic
 F1 segregation ratios on F1 populations. Additionally 4 models are fitted
 with all samples considered as one population, with the same 4 models for
 the means and no restrictions on mixing proportions.
 \item pop.parents specified, no parentalPriors / samplePriors / startmeans,
 try.HW TRUE: 4 models
 are fitted with the same restrictions on the means as above, but with
 different restrictions on the mixing proportions for each population:
 no restriction on parental populations, HW-ratios for accession panels,
 polysomic F1 segregation ratios on F1 populations. Additionally 4 models
 are fitted with all samples considered as one population, with the same 4
 models for the means and mixing proportions according to HW ratios.
 \item pop.parents and parentalPriors specified, try.HW FALSE: 4 models
 are fitted with the same restrictions on the means as above, but with
 different restrictions on the mixing proportions for each population:
 no restriction on parental populations and the accession panels,
 polysomic F1 segregation ratios on F1 populations ignoring the parental
 priors. Additionally 4 models are fitted with the same restrictions on the
 means and mixing proportions of the accession panels, but where the
 mixing proportions of the parental populations are set to (almost) 1 for
 the prior dosage and (almost) 0 for all other dosages, and those for the
 F1 populations to the polysomic segregation ratios expected for the
 parental priors.
 \item pop.parents and parentalPriors specified, try.HW TRUE: same as with
 try.HW FALSE, except that the mixing proportions of accession panels are now
 restricted to HW ratios.
 \item if parentalPriors and/or samplePriors are specified, these and the
 signal ratios of the corresponding samples are (also) used to estimate starting
 values of the mixture component means in the EM algorithm. Alternatively
 startmeans can be specified directly.
}
Because convergence to the optimal solution often fails, the models are fitted
with several start values for the <ploidy+1> means of the mixture
distributions: (1) based on initial clustering of the ratios, (2) based on
a uniform distribition from 0.02 to pi/2-0.02 on the asin(sqrt(x)) scale,
and (3) if startmeans are specified or can be calculated from samplePriors
and/or parentalPriors these are used for a third set of model fits.\cr
The main difference between parentalPriors and samplePriors is that
parentalPriors are treated as fixed (and if both parents of an F1 population
have priors, the F1 segregation is also fixed) while samplePriors are only
used to calculate starting ratio means for each dosage. Depending on the
confidence the user has in the prior dosages of the parents they can be
supplied as parentalPriors or samplePriors.
In some cases an additional fit is performed with a modified set of initial
means.\cr
An optimal model is selected based on the Bayesian Information Criterium
(BIC), which takes into account the Log-Likelihood and the number of fitted
parameters of the models. If sd.target is specified and the standard
deviation of the mixture model components is larger than this target a
penalty is applied, making is less likely that that model is selected.\cr
The plots consist of one histogram per (non-parent) population showing the
frequency distribution of the signal ratios of the samples in that population.
The fitted model is shown in green (density and means), and for F1 populations
the samples of parent 1 and 2 are shown as red and blue triangles.\cr
If diplodata are present, a histogram for the diploid samples is plotted
in the top histogram (diploid bars are narrower and gray). The diploid bars
are scaled so the maximum bar is half the maximum polyploid bar. At the
bottom of the plot for the fitted model a rug plot shows the scores of each
sample, while the bottom (red) samples are unscored.
}
\examples{
\donttest{
 # These examples run for a total of about 9 sec.

 data(fitPoly_data)

 # triploid, no specified populations
 fp <- fitOneMarker(ploidy=3, marker="mrk039",
                    data=fitPoly_data$ploidy3$dat3x)

 # tetraploid, specified populations
 # plot of the fitted model saved in tempdir()
 fp <- fitOneMarker(ploidy=4, marker=2,
                    data=fitPoly_data$ploidy4$dat4x,
                    population=fitPoly_data$ploidy4$pop4x,
                    pop.parents=fitPoly_data$ploidy4$pop.par4x,
                    plot="fitted",
                    plot.dir=paste0(tempdir(),"/fpPlots4x"))

 # hexaploid, specified populations, start values for means,
 # plot of the fitted model saved in tempdir()
 fp <- fitOneMarker(ploidy=6, marker=1,
                    data=fitPoly_data$ploidy6$dat6x,
                    population=fitPoly_data$ploidy6$pop6x,
                    pop.parents=fitPoly_data$ploidy6$pop.par6x,
                    startmeans=fitPoly_data$ploidy6$startmeans6x,
                    plot="fitted", plot.dir=paste0(tempdir(),"/fpPlots6x"))
}

}
