Single Nucleotide Polymorphisms (SNPs) are common among human populations.
SNPs that are proximally located within a small human chromosome region are generally strongly
correlated that a subset of SNPs, termed tag SNPs, can provide enough information to infer neigh-
boring SNPs. Such correlations are generally known as linkage disequilibrium (LD) and are measured
either pair-wise, such as $r^2$, or multi-to-one (multi-marker). For any given set of SNPs, a variety of
algorithms have been proposed to identify a subset of tag SNPs by which the remaining SNPs can
be inferred. This paper focuses on finding that number of tag SNPs from which remaining SNPs
can be inferred through multi-allelic LD or pair-wise LD with a pre-defined $r^2$ threshold. We call
this the optimal tag SNP selection problem. Although this problem is theoretically NP-hard, it can
be formulated as an integer programming (IP) problem under a certain constraint, and the opti-
mal solution can be efficiently found by our newly developed IPMarker program. In addition, the
flexibility of the computational framework allows us to formulate and solve the problem of finding
common tag SNPs for multiple populations that have different LD patterns. Various datasets, in-
cluding ENCODE and the Major Histocompatiability Complex (MHC) region, were used to evaluate
the performance of IPMarker. We also extended IPMarker to the whole genome HapMap Phase I
data. Results showed that IPMarker significantly reduces the number of tag SNPs required when
compared to the most widely used program, Haploview, although a significant longer running time
is required. Thus, overall, genotyping a selected set of tag SNPs is the most cost-effective way to
conduct large-scale genome-wide association studies.