Data mining is defined as the process of seeking interesting or
valuable information within large data sets. This presents novel challenges and
problems, distinct from those typically arising in the allied areas of
statistics, machine learning, pattern recognition or database science. A
distinction is drawn between the two data mining activities of model building
and pattern detection. Even though statisticians are familiar with the former,
the large data sets involved in data mining mean that novel problems do arise.
The second of the activities, pattern detection, presents entirely new classes
of challenges, some arising, again, as a consequence of the large sizes of the
data sets. Data quality is a particularly troublesome issue in data mining
applications, and this is examined. The discussion is illustrated with a
variety of real examples.