14
ISSN 2477-9105 Número 23 Vol.1 (2020)
method of clustering based on division. Given a
certain K value, the algorithm divides the data
into K disjoint groups. The K-means algorithm
is simple and fast. The complexity is O ( l * k * n
), where l is the number of iterations and k the
number of clusters. Furthermore, this algorithm
converges normally in a reduced number of ite-
rations (9).
K-means is a very popular method for general
grouping. In K-means the clusters are represen-
ted by mass centers of the members, and it can be
shown that the K-means algorithm by switching
between assigning membership to the cluster for
each data vector to the nearest cluster center and
calculate the center of each cluster as the cen-
troid of its member data vectors is equivalent to
finding the minimum of a sum of squares cost
function using the coordinate offspring function
(10).
The K- means algorithm is sensitive to outliers
since an object with an extremely large value can
substantially distort the data distribution. How
could the algorithm be modified to decrease that
sensitivity? Instead of taking the average value of
the objects in a cluster as a reference point, a Me-
doid can be used, which is the most centrally lo-
cated object in a cluster. Therefore, the partition
method can be performed based on the princi-
ple of minimizing the sum of the differences be-
tween each object and the reference point corres-
ponding. This forms the basis of the K- Medoids
method (11).
Hierarchical clustering with correlation (Hie-
rarchical): This algorithm produces a hierarchy
of clusters rather than a fixed set number of clus-
ters in advance. At the basic or initial level, each
observation forms its own group. At each sub-
sequent level, the two "closest" groups combine
to form a larger group. The "average" method is
used, which means that “distance" between the
groups is the average (12).
Diana: At each step, a divisive method divides
a group into two smaller ones, until; finally, all
groups contain a single element. This means that
the hierarchy is built again in n-1 steps when the
data set contains n objects. A divisive analysis
proceeds by a series of successive divisions. In
step 0 (before starting the algorithm), all the ob-
jects are together in a single cluster. In each step,
a group is divided, until in step n-1 all objects are
separated (forming n groups, each with a single
object) (13).
Agglomerative nesting: Agnes function: The
Agnes function is of the hierarchical agglome-
rative type; therefore, it produces a sequence of
clusters. In the first grouping, each of the n ob-
jects forms its own separate group. In later steps,
the groups are merged, until (after n - 1 steps)
there is only one large group 18. There are many
of these methods. In Agnes, the group average
method is taken as the default, based on robust-
ness, monotonicity and consistency arguments
(14)
Clustering Large Applications (Clara): It can
deal with much larger data sets. Internally, this
is achieved by considering subsets of fixed size
(size) data so that time and storage requirements
become linear is at n instead of quadratic (15).
Partition around the medoids (Pam): The pam
algorithm is based on the search for k representa-
tive or medoid objects among the observations in
the data set. These observations should represent
the structure of the data. After finding a set of k
medoids, k clusters are constructed by assigning
each observation to the nearest medoid (15).
Biological Validation Measures: Biological vali-
dation evaluates the ability of a clustering algori-
thm to produce biologically significant clusters.
A typical application of biological validation is
in microarray data, where the observations co-
rrespond to genes (where "genes" could be open
reading frames (ORF), expressed sequence
tags (EST), analysis tags of expression of genes
(SAGE), etc.). There are two measures available,
the biological homogeneity index (BHI) and the
biological stability index (BSI) (16) .
These measurements can also be used for any
other molecular expression data. The biological
homogeneity index (BHI) and the biological sta-
bility index (BSI) both assess the performance
of an algorithm to produce biologically similar