PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

Share a file Manage my documents Convert Recover PDF Search Help Contact



kbs .pdf



Original filename: kbs.pdf
Title: Evolutionary algorithm based on different semantic similarity functions for synonym recognition in the biomedical domain
Author: José M. Chaves-González

This PDF 1.7 document has been generated by Elsevier / Acrobat Distiller 8.1.0 (Windows), and has been sent on pdf-archive.com on 24/08/2017 at 20:02, from IP address 173.208.x.x. The current document download page has been viewed 311 times.
File size: 337 KB (8 pages).
Privacy: public file




Download original PDF file









Document preview


Knowledge-Based Systems 37 (2013) 62–69

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Evolutionary algorithm based on different semantic similarity functions
for synonym recognition in the biomedical domain
José M. Chaves-González ⇑, Jorge Martínez-Gil
University of Extremadura, Escuela Politécnica, 10003 Cáceres, Spain

a r t i c l e

i n f o

Article history:
Received 12 December 2011
Received in revised form 9 May 2012
Accepted 13 July 2012
Available online 17 August 2012
Keywords:
Semantic similarity
Evolutionary computation
Semantic web
Synonym recognition
Differential evolution

a b s t r a c t
One of the most challenging problems in the semantic web field consists of computing the semantic similarity between different terms. The problem here is the lack of accurate domain-specific dictionaries,
such as biomedical, financial or any other particular and dynamic field. In this article we propose a
new approach which uses different existing semantic similarity methods to obtain precise results in
the biomedical domain. Specifically, we have developed an evolutionary algorithm which uses information provided by different semantic similarity metrics. Our results have been validated against a variety
of biomedical datasets and different collections of similarity functions. The proposed system provides
very high quality results when compared against similarity ratings provided by human experts (in terms
of Pearson correlation coefficient) surpassing the results of other relevant works previously published in
the literature.
Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction
The vast amount of information on the World Wide Web
(WWW) has made the study of web semantic techniques [27]
one of the most interesting areas of research. The semantic web
is based on semantic organization of resources to process them
in a more efficient manner. The semantic similarity [4] of text
expressions is a very relevant problem, especially in fields, such
as data integration [10], query expansion [9] or document classification [12] among others. The reason is that those specific fields
need semantic similarity computation to work properly.
On the other hand, semantic similarity measurements are usually performed with some kind of metrics [20]. The most common
metrics are text-based semantic similarity measures obtained from
the degree of similarity or dissimilarity between two text strings
[5]. Semantic similarity measures provide a floating point number
between 0 (total dissimilarity) and 1 (complete similarity).
Most of existing works reach a high level of accuracy when solving general purpose datasets [22], because they describe approaches based on large and updated dictionaries [20]. However,
these methods do not obtain precise results when they are applied
to specific domains, such as biomedicine. The reason is that most of
methods in this domain work with a unique ontology. Therefore,
results depend on the ontological detail and coverage [26]. In this
work, we propose a new technique which beats other methods
⇑ Corresponding author.
E-mail addresses: jm@unex.es (J.M. Chaves-González), jorgemar@unex.es
(J. Martínez-Gil).
0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2012.07.005

based on semantic similarity measures previously published in
the specific literature. The evolutionary algorithm (EA) designed
combines information provided by a variety of well-known semantic similarity functions. Thus, our EA works as a high level heuristics designed to improve the results obtained by using each
individual similarity function. Furthermore, in order to validate
our results, we use different biomedical datasets for which classical
algorithms do not usually get optimal results.
The rest of the manuscript is organized as follows: Section 2 discusses related work. Section 3 describes the problem and the approaches we have used in this paper. The methodology and
results are explained in Section 4. Finally, conclusions and possible
lines of future work are discussed in the last section.

2. Previous work
Semantic similarity measure has traditionally been an interesting research area within the Natural Language Processing (NLP)
field [13,18]. The reason is that synonym recognition is a key aspect for human conversations. In fact, the fast development of
the semantic web has led researchers to focus on the development
of techniques based on synonym recognition to improve the discovery of resources on the WWW [14]. Thus, a user who is looking
for the term cars obtains results including terms such as automobiles and vehicles.
A first approach to measure the semantic similarity between
two terms consists of computing the Euclidean distance (one of
the most popular metrics) between those words. However,

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69

63

Fig. 1. Illustrative diagram of the proposed approach, HH(DE).

Euclidean distance is not suitable for all types of input data, such as
the case in which we have to compute the distance of different
word meanings. For this reason, most of previous works have been
focused on designing new semantic similarity measures.
Traditional semantic similarity measures techniques use some
kind of dictionaries in order to compute the degree of similarity
between words. The problem here is that most works use general
purpose resources such as WordNet [20]. However, these sources
offer limited coverage of biomedical terms. For this reason, several
resources have been developed in recent years to improve
semantic similarity measures, for example, Medical Subject
Headings (MeSHs) [22] or Unified Medical Language System
(UMLS) [21]. Semantic similarity measures fall into three main
categories:
Path-based measures are based on dictionaries or thesaurus. If a
given word has two or more meanings in those sources, then
multiple paths may exist between the two words. The problem
with this method is that it relies on the notion that all links in
the taxonomy are at uniform distances.
Information content measures are based on frequency counts of
concepts when they are found in the corpus text (Pedersen
et al. [20]). These measures assign higher values to specific concepts (e.g. pitch fork), and lower scores to more general terms
(e.g. idea).
Feature based measures consider the similarity between terms
according to their properties. In general, it is possible to estimate semantic similarity according to the number of common
features. For example, these methods could be based on the
relations between similar terms according to concept descriptions retrieved from dictionaries.
As was previously mentioned, the problem of traditional
semantic similarity metrics is that there are not complete and updated dictionaries for specific fields. If we focus on the biomedical
domain, we have to say that several outstanding works have been
proposed in recent years. For instance, Pirró [22] proposed a new
information content measure using the MeSH biomedical ontology
which successfully improved existing similarity methods. However, our study improves the results obtained in that work by using
a combination of several similarity functions. Al-Mubai and Nguyen [1] also proposed an ontology-based semantic similarity measure and applied it to the biomedical domain. This proposal is
based on the path length between concept nodes and the depth
of each term in the ontology hierarchy tree. Our results are also
nearer to human judgments in this case. Furthermore, there are
important works by Sánchez et al. [24,26] in which several semantic similarity measures based on approximating concept semantics
in terms of information content are successfully presented. Once
again, our technique obtains more precise results. Finally, Pedersen
et al. [21] implemented and evaluated a variety of semantic similarity measures based on ontologies and terminologies found in
the Unified Medical Language System (UMLS). Our results are also

better in this case because we combine information provided by
different methods.
3. Differential evolution for synonym recognition
In this section we describe the problem and the proposed solution. Our approach is based on the similarity scores provided by
different atomic similarity functions. The evolutionary algorithm
works as a hyper-heuristics which assigns different coefficient values to the similarity scores obtained from the pool of functions included in the system. At the beginning of the process, all functions
(or metrics) evenly contribute to calculate the semantic similarity
value of a specific pair of terms. Then, the system evolves so that
the functions which provide the most similar values to the human
expert’s opinion have the highest coefficients. Fig. 1 shows the
working diagram of the proposed approach.
The differential evolution (DE) algorithm [29] was chosen
among other candidates because, after a preliminary study, we
conclude that DE obtained very competitive results for the problem
addressed. The reason lies in how the algorithm makes the solutions evolve. Our system can be considered as a hyper-heuristics
(HHs) which uses differential evolution, HH(DE), to assign to each
similarity function a specific coefficient. These values modify the
relevance of each function. Differential evolution performs the
search of local optima by making small additions and subtractions
between the members of its population (see Section 3.3). This feature fits perfectly the problem, because the algorithm works with
the scores provided by the similarity functions (Fig. 1). In fact,
the individual is defined as an array of floating point values, s,
where s(fx) is the coefficient which modifies the result provided
by the similarity function fx. Fig. 2 illustrates the representation
of the individual used in this work.
3.1. The synonym recognition problem
Given two text expressions a and b, the problem consists of providing the degree of synonymy between both words. However,
synonym recognition usually extends beyond synonymy and also
involves semantic similarity measurements. According to Bollegala
et al. [3], a certain degree of semantic similarity is observed not
only between synonyms (e.g. lift and elevator), but also between
metonyms (e.g. car and wheel), hyponyms (leopard and cat), related words (e.g. blood and hospital), and even between antonyms
(e.g. day and night).

Fig. 2. Individual representation.

64

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69

Table 1
Classification of the most relevant similarity metrics.
Type

Similarity function and
reference

Description

Path based
measures

Path length (PATH) [20]
Leacock and Chodorow (LCH)
[15]
Wu and Palmer (WUP) [30]
Hirst and St-Onge (HSO) [11]

PATH is inversely proportional to the number of nodes along the shortest path between the words
LCH can be computed as log (length/(2 D)), where length is the length of the shortest path between the two
words and D is the maximum depth of the taxonomy
WUP considers the depths of the words in the taxonomies, along with the depth of the LCS (least common
subsumer). The formula is: score = 2 depth(LCS)/(depth(s1) + depth(s2))
HSO looks for lexical chains linking the word senses

Information
content
measures

Resnik (RES) [23]
Jiang and Conrath (JCN) [6]
Lin (LIN) [17]

RES computes the information content of the LCS
JCN can be calculated as follows: 1/IC(term1) + IC(term2) 2 IC(LCS), where IC refers to information content
LIN can be computed as: 2 IC(LCS)/(IC(term1) + IC(term2)

Feature based
measures

Adapted Lesk (LESK) [16]
Gloss Vector (vector) [2]

The similarity score is the sum of the squares of the overlap lengths
This method works by forming second-order co-occurrence vectors from the glosses or WordNet definitions
of concepts
This metric forms separate vectors corresponding to each of the adjacent glosses

Gloss Vector, pairwise modified
(vector_pairs) [20]

We have designed an algorithm which provides a floating point
number. This number refers to the similarity between two biomedical expressions. A value of 0 indicates that the words compared
are absolutely dissimilar. On the other hand, a value of 1 means
that the expressions share exactly the same meaning.
3.2. Semantic similarity metrics
There is a large number of similarity metrics published in the
literature. Table 1 summarizes the most relevant similarity functions according to their classification. All of them have been used
in our work. The first column (Table 1) indicates the general type
of the function. Second column contains the similarity metrics
and a reference to obtain more detailed information. Finally, the
third column includes a brief explanation of the metrics.
The main advantage of Path based measures is that they are easy
to develop because they are easy to understand. On the contrary,
this kind of measures needs rich taxonomies. Moreover they only
works with nodes belonging to those taxonomies and only the
relation is-a can be used to link to terms. The advantage of
information content measures is that they use empirical information
from real corpora. Although information content measures only
worked with nodes belonging to specific taxonomies in the past,
Sánchez et al. shows that this kind of measures obtain very good
results with real corpora [24]. Finally, feature based measures do
not require underlying structures and use implicit knowledge from
real corpora. As a disadvantage of these metrics, the definitions of
terms can be too short and the computation tends to be very
intensive in most cases.
3.3. The differential evolution algorithm
Differential evolution (DE) is a population based evolutionary
algorithm (EA) for global optimization created by Storn and Price
[29]. Due to its simplicity and effectiveness, DE has been applied
to a large number of optimization problems in a wide range of
domains [7]. It is based on the generation of new individuals by
calculating vector differences between randomly-selected solutions. Fig. 3 illustrates the explanation of our version of the
algorithm. Our particular DE has been carefully configured and
adapted to the problem managed in our study, as will be explained
in Section 4. For this reason, among the different variants of the
algorithm, we chose the discrete handling approach version with
the selection scheme best/1/bin [19]. This configuration provides
more competitive results than the rest of variants. The notation

Fig. 3. Outline of the DE algorithm for the Synonym Recognition Problem.

best/1/bin indicates how the crossover and mutation operators
work [19]. Thus, our DE includes binomial crossover (bin) and uses
a unique (/1/) difference vector for the mutation of the best solution (best) taken from the population.
As we can see in Algorithm 1, DE starts with the random generation of the population (line 1) through the assignment of a random coefficient to each gene of the individual (see Fig. 2). As was
explained previously, the population consists of a certain number
of solutions (this is a parameter to be configured). Each individual
is represented by a vector of weighting factors, as described in Fig
2. After the generation of the population, the fitness of each individual is assigned to each solution using the Pearson correlation
[28]. This correlation, corr(X, Y), is calculated according to Eq. (1)
with the values provided by a human expert for every pair of words
of the specific dataset under study. The final result is a floating
point value between +1 (perfect positive linear correlation) and
1 (perfect negative linear correlation) which indicates the degree
of linear dependence between the variable X (human expert opinion) and Y (our solution).

65

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69
Table 2
Optimal parameter setting.

Algorithm 1. Pseudo-code for the DE algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:

generateRandomPopulation (population)
calculateFitness (population)
while (stop condition not reached) do
for (each individual of the population)
selectIndividuals (xTarget, xBest, xInd1, xInd2)
xMut
diffMutation (xBest, F, xInd1, xInd2)
xTrial
binCrossOver (xTarget, xMut, CrossProb)
calculateFitness (xTrial)
updateIndividual (xTarget, xTrial)
endfor
endwhile
return bestIndividual (population)

covðX; YÞ

rX rY

¼

E½ðX lX ÞðY lY Þ

rX rY

Optimal value

Population size
Mutation factor, F
Crossover probability, crossProb
Max generations
MIN, MAX

100
0.5
0.1
100
100, 100

4.1. Methodology

The nearer the value of correlation is to any of the extreme
values ( 1 or +1), the stronger is the correlation between the
variables and the higher is the quality of the solution. On the
other hand, if the result gets nearer to 0, it means that the variables are closer to be uncorrelated, and consequently, the solution must be considered of poor quality. We use the correlation
so we can compare in fair terms against other published studies
[20,22]. For a more detailed explanation of Eq. (1), please consult
[28].

corrðX; YÞ ¼

Parameter

ð1Þ

The main loop of the algorithm (Fig. 3) starts after evaluating the whole population (line 2). DE is an iterative algorithm,
where the successive generations try to get an optimal solution,
stopping when the maximum number of generations is reached
(line 3).
First, we select four solutions (line 5). xTarget and xBest are the
solution being processed and the best solution found so far respectively. xInd1 and xInd2 are two randomly chosen solutions different
from xTarget and xBest. Next, mutation is performed (line 6)
according to the expression: xMut
xBest + F (xInd1 xInd2). This
operator is explained in Fig. 3 in two steps (diffMutation 1 and 2).
In the first part xDiff is calculated through the expression: xDiff
F
(xInd1 xInd2). xDiff represents the mutation to be applied to the
best solution. The parameter F e [0, 2] establishes the mutation
amount used in that operation. Then, xMut is generated by modifying each gene of xBest with the mutation indicated in xDiff (Fig. 3).
After mutation, xTarget and xMut individuals are crossed (line 7)
using binary crossover [31] according to a crossover probability,
crossProb e [0, 1]. Then, the obtained individual, xTrial, is evaluated
(Eq. (1)) to check its quality (line 8) and compared against xTarget.
The best individual is saved in the xTarget position (line 9 and
Fig. 3). This process is repeated for each individual in the population (line 4) while the stop condition is not satisfied (line 3). In
our case, the stop condition is a certain number of generations
which is also a parameter to be configured (see Section 4.1). At
the end of the process, the best individual (Fig. 2) is returned as
the final result of our system (line 12). It is important to point here
that results have been obtained after a complete experimental process using different biomedical datasets. In case of using other
datasets from other domains, further experiments would be
necessary.
4. Experiments and results
In this section we summarize the main experiments and the results obtained in our study. We have used different similarity metrics and biomedical datasets to test the proposed system.

All the experiments have been run under the same environment: an Intel Xeon 2.33 GHz processor and 8 GB RAM. On the
software side, we have used the GCC 4.1.2 compiler on a Scientific
Linux 5.3 64 bits OS. Since we are dealing with a stochastic algorithm, we have carried out 100 independent runs for each experiment. Results provided in the following subsections are average
results of these 100 executions. It is important to point here that
the use of the arithmetic mean is a valid statistical measurement
because the results follow a normal distribution (a p-value greater
than 0.05 for the Shapiro-Wilk test confirms this fact [8]). Moreover, all results present an extremely low dispersion, since the
standard deviation for all experiments is lower than 10 15, so results can be considered statistically reliable.
In the following subsection we will discuss the results obtained
in our experiments using different similarity metrics and different
word datasets. Table 2 summarizes the parameter adjustment performed to the system developed. The parameter setting is very relevant, since the quality of the results largely depends on the
accuracy of this adjustment. Therefore, we performed a complete
and precise study for each parameter.
4.2. Result discussion
We have performed two kinds of experiments. First, we explain
the results provided by our proposed system using two different
sets of similarity metrics (from WordNet dictionary1 and from
the Pirró study [22]). Next, we discuss the results obtained using
two different datasets from the biomedical domain [1,22].
4.2.1. Experiments with different metrics
We compare our results against two sources. First, we use the
study published by Pirró [22], in which the authors propose a
new metric based on features (P&S). Table 3 summarizes the study
using the same biomedical dataset. Every value is normalized in
the interval [0, 1].
The first column (in Table 3) identifies the word pair. Then,
each word under evaluation is presented (columns 2 and 3). The
fourth column corresponds with the value of correlation provided
by a human expert. Next, all computational methods appear classified in different methodologies. Our results are shown in the last
column. As described previously (Section 4.1), they are highly
reliable because they are average results from 100 independent
executions.
Our results are similar to human judgment when the scores of
the functions included in our hyper-heuristics (Fig. 1) are precise.
For example, our scores for pairs with labels in Table 3: P17, P24
and P35 are close to the human opinion because basic functions
provide precise results (see Table 3). On the other hand, if the basic
functions do not perform accurately for a specific pair (e.g. P13,
Table 3), our score is better than most of the values provided by
those functions.

1

http://www.wordnet.princeton.edu/.

66

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69

Table 3
Similarity results obtained by our system (last column) against other results published.
Word pair

P01
P02
P03
P04
P05
P06
P07
P08
P09
P10
P11
P12
P13
P14
P15
P16
P17
P18
P19
P20
P21
P22
P23
P24
P25
P26
P27
P28
P29
P30
P31
P32
P33
P34
P35
P36

Word 1

Word 2

Anemia
Otitis Media
Dementia
Bacterial Pneumonia
Osteoporosis
Sequence
Acq. Immunno. Syndrome
Meningitis
Sinusitis
Hypertension
Hyperlipidemia
Hypothyroidism
Sarcoidosis
Vaccines
Asthma
Diabetic Nephropathy
Lactose Intolerance
Urinary Tract Infection
Neonatal Jaundice
Anemia
Psychology
Adenovirus
Migraine
Myocardial Ischemia
Hepatitis B
Carcinoma
Pulmonary Stenosis
Failure to Thrive
Breast Feeding
Antibiotics
Seizures
Pain
Malnutrition
Measles
Chicken Pox
Down Syndrome

Appendicitis
Infantile Colic
Atopic Dermatitis
Malaria
Patent Ductus Arteriosus
Antibacterial Agents
Congenital Heart Defects
Tricuspid Atresia
Mental Retardation
Failure
Hyperkalemia
Hyperthyroidism
Tuberculosis
Immunity
Pneumonia
Diabetes Mellitus
Irritable Bowel Syndrome
Pyelonephritis
Sepsis
Deficiency Anemia
Cognitive Science
Rotavirus
Headache
Myocardial Infarction
Hepatitis C
Neoplasm
Aortic Stenosis
Malnutrition
Lactation
Antibacterial Agents
Convulsions
Ache
Nutritional Deficiency
Rubeola
Varicella
Trisomy 21

Human expert

IC based
Resnik

Lin

J&C

Li

P&S

HH(DE)

0.031
0.156
0.060
0.156
0.156
0.155
0.060
0.031
0.031
0.500
0.156
0.406
0.406
0.593
0.375
0.500
0.468
0.656
0.187
0.437
0.593
0.437
0.718
0.750
0.562
0.750
0.531
0.625
0.843
0.937
0.843
0.875
0.875
0.906
0.968
0.875

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.331
0.619
0.000
0.000
0.517
0.612
0.468
0.470
0.000
0.601
0.627
0.267
0.229
0.595
0.649
0.246
0.658
0.000
0.000
1.000
0.880
0.861
0.622
0.924
1.000
1.000

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.483
0.726
0.000
0.000
0.790
0.759
0.468
0.588
0.000
0.720
0.770
0.332
0.243
0.918
0.823
0.626
0.781
0.000
0.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000

0.190
0.160
0.290
0.030
0.150
0.270
0.070
0.190
0.360
0.210
0.470
0.750
0.250
0.520
0.870
0.790
0.470
0.670
0.190
0.790
0.810
0.450
0.370
0.890
0.860
0.850
0.810
0.180
0.040
1.000
0.900
1.000
1.000
1.000
1.000
1.000

0.130
0.100
0.130
0.100
0.000
0.160
0.080
0.130
0.130
0.130
0.510
0.630
0.070
0.000
0.520
0.770
0.360
0.420
0.160
0.360
0.800
0.350
0.170
0.800
0.660
0.450
0.660
0.130
0.080
0.990
0.810
0.990
0.980
0.990
0.990
0.990

0.133
0.000
0.202
0.000
0.000
0.184
0.000
0.131
0.117
0.109
0.561
0.718
0.169
0.344
0.749
0.741
0.468
0.604
0.000
0.712
0.751
0.398
0.269
0.830
0.790
0.651
0.763
0.126
0.029
1.000
0.990
0.954
0.874
1.000
1.000
1.000

0.116
0.056
0.165
0.024
0.037
0.159
0.030
0.115
0.152
0.112
0.443
0.665
0.134
0.251
0.627
0.696
0.451
0.533
0.073
0.622
0.714
0.358
0.266
0.713
0.715
0.488
0.707
0.111
0.033
1.000
0.887
0.920
0.780
0.965
1.000
1.000

Table 4
Pearson correlation between computational methods and human
judgments.
Similarity function

Correlation

Resnik
Lin
J&C
Li
P&S

0.721
0.718
0.718
0.707
0.725

HH(DE)

0.732

Table 4 presents the correlation of Pearson between the methods appeared in Table 3 [22] and the human expert score. As can
be seen, our approach provides the best results of the study.
HH(DE) surpasses other similarity metrics because it is more robust than most of existing methods.
The second step in our experimental evaluation consists of
using a different set of similarity functions to tackle the same word
dataset. Table 5 presents the results obtained with every metric included in WordNet::similarity resources [20].
Every metric used has been previously explained in Section 3.2
(Table 1). Our results appear in the last column (Table 5). Our
scores are also statistically robust since they are average results
of 100 independent executions, with very low standard deviations
(see Section 4.1). Although there are functions which do not obtain
good correlation results (Table 6), such as HSO (0.332), JCN (0.237),
LIN (0.218) or vector_pairs (0.333), they have been included in our
system (Fig. 1), because the higher number of functions used, the

Hybrid

Features

EA

better results obtained (see Table 7). Although the global correlation of a particular function is not very good, that particular function is important in our system, since our hyper-heuristics
modifies its coefficient so that the function provides relevant information to the global system.
Once again, we can see that our HH(DE) improves the results
provided by the basic functions because it uses a linear combination of every function (see Section 3). Although a specific function
obtains bad results, it is difficult that every metric provides poor
results.
As may be seen in Table 6, our proposal improves significantly
the rest of metrics. In fact, the correlation value reached (0.809)
is even higher than the result obtained in the previous set of experiments (0.732, Table 4), although in that case, every similarity
function provided better results. The scores provided by every similarity function are worse than in the previous experiment because
WordNet is a general purpose resource which is not very appropriate for the biomedical domain [25]. Therefore, we can conclude
that the higher is the number of similarity functions included in
our hyper-heuristics (see Fig. 1), the higher is the quality in the global results. Table 7 shows how the global quality of our method is
improved as more basic functions are incorporated into our pool.
4.2.2. Experiments with different datasets
In this subsection we study the results provided by our system
using other biomedical dataset [1]. The configuration of our
system is exactly the same, therefore, we can say that the parameter setting of our HH(DE) is consistent with at least two datasets.
To our best knowledge, there are no works in which more datasets

67

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69
Table 5
Similarity results obtained by our system (last column) against WordNet results.
Word pair

Human expert

HSO

JCN

WUP

PATH

LIN

LESK

RES

LCH

Vector_pairs

Vector

HH(DE)

P01
P02
P03
P04
P05
P06
P07
P08
P09
P10
P11
P12
P13
P14
P15
P16
P17
P18
P19
P20
P21
P22
P23
P24
P25
P26
P27
P28
P29
P30
P31
P32
P33
P34
P35
P36

0.031
0.156
0.06
0.156
0.156
0.155
0.06
0.031
0.031
0.5
0.156
0.406
0.406
0.593
0.375
0.5
0.468
0.656
0.187
0.437
0.593
0.437
0.718
0.75
0.562
0.75
0.531
0.625
0.843
0.937
0.843
0.875
0.875
0.906
0.968
0.875

0.250
0.188
0.000
0.125
0.000
0.000
0.000
0.000
0.000
0.250
0.313
1.000
0.000
0.000
0.313
1.000
0.000
0.250
0.125
0.000
0.375
0.250
0.250
0.000
0.313
0.313
0.313
0.000
0.250
0.250
0.313
0.313
0.000
1.000
0.250
1.000

0.000
0.000
0.000
0.092
0.000
0.058
0.052
0.000
0.000
0.000
0.000
0.000
0.000
0.048
0.000
0.044
0.000
0.000
0.059
0.108
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
1.000
0.455
0.402
0.000
0.000
0.000
0.000

0.842
0.800
0.480
0.720
0.174
0.300
0.375
0.609
0.556
0.842
0.889
0.900
0.720
0.267
0.923
0.182
0.250
0.963
0.400
0.571
0.900
0.842
0.957
0.720
0.933
0.889
0.917
0.636
0.857
0.941
0.952
0.947
0.571
1.000
0.966
1.000

0.250
0.200
0.071
0.250
0.050
0.077
0.091
0.100
0.111
0.250
0.333
0.333
0.125
0.083
0.333
0.053
0.053
0.500
0.077
0.100
0.333
0.250
0.500
0.125
0.333
0.250
0.333
0.111
0.250
0.500
0.500
0.500
0.100
1.000
0.500
1.000

0.000
0.000
0.000
0.524
0.000
0.060
0.075
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.221
0.471
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
1.000
0.897
0.861
0.000
0.000
0.000
0.000

0.004
0.002
0.002
0.010
0.002
0.010
0.028
0.002
0.003
0.007
0.078
0.019
0.007
0.009
0.013
0.105
0.001
0.050
0.011
0.051
0.153
0.029
0.428
0.038
0.074
0.228
0.100
0.029
0.017
0.661
0.098
0.152
0.040
1.000
0.752
1.000

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.331
0.619
0.000
0.000
0.517
0.612
0.468
0.470
0.000
0.601
0.627
0.267
0.229
0.595
0.649
0.246
0.658
0.000
0.000
1.000
0.880
0.861
0.622
0.924
1.000
1.000

0.624
0.564
0.285
0.436
0.188
0.305
0.350
0.376
0.404
0.624
0.702
0.702
0.436
0.326
0.702
0.202
0.202
0.812
0.305
0.376
0.702
0.624
0.812
0.436
0.702
0.624
0.702
0.404
0.624
0.812
0.812
0.812
0.376
1.000
0.812
1.000

0.010
0.007
0.013
0.080
0.018
0.022
0.051
0.011
0.003
0.018
0.195
0.221
0.070
0.016
0.227
0.041
0.034
0.062
0.043
0.074
0.167
0.042
0.020
0.034
0.359
0.134
0.438
0.125
0.218
0.119
0.197
0.050
0.022
0.333
0.055
0.500

0.183
0.211
0.111
0.326
0.081
0.152
0.397
0.121
0.057
0.457
0.727
0.097
0.222
0.251
0.375
0.396
0.108
0.591
0.112
0.329
0.515
0.093
0.612
0.116
0.583
0.480
0.833
0.309
0.849
0.769
0.628
0.419
0.322
1.000
0.720
1.000

0.139
0.169
0.115
0.113
0.034
0.073
0.195
0.134
0.083
0.281
0.472
0.199
0.165
0.125
0.358
0.435
0.298
0.536
0.029
0.450
0.539
0.212
0.504
0.446
0.463
0.385
0.550
0.164
0.366
0.894
0.710
0.562
0.554
0.791
1.000
0.720

Table 6
Pearson correlation between computational methods
calculated
in
WordNet::similarity
and
human
judgments.
Metrics

Correlation

HSO
JCN
WUP
PATH
LIN
LESK
RES
LCH
Vector_pairs
Vector

0.332
0.237
0.490
0.517
0.218
0.517
0.721
0.553
0.333
0.593

HH(DE)

0.809

from the biomedical domain have been used, so we cannot perform
more comparisons. Table 8 presents the word pairs of the dataset
and the scores provided by human experts. As in the previous case,
every value is normalized in the interval [0, 1].
Our results are shown in Table 9. The two first columns are taken from Table 8 and identify the word pair and the human expert
score. Next 10 columns correspond with the values obtained from
the WordNet similarity tool [20]. The last column contains the result obtained by our HH(DE). As in previous experiments, these results are statistically confident because they are average result of
100 independent executions, with very low standard deviations
(lower than 10 10).

Table 7
Pearson correlation values for different number of
similarity functions. HH(DE) uses the similarity functions of more quality in each case.
Similarity functions
included in HH(DE)

Correlation

10
9
8
7
6
5
4
3
2
1

0.809
0.771
0.769
0.768
0.701
0.692
0.692
0.678
0.658
0.642

As may be observed in Table 9, our results clearly surpass the
result provided by the rest of metrics. We can also verify that
our results are more reliable using different datasets than results
provided by other functions, because, for example, the HSO function obtains poor results in the previous dataset (0.332, Table 6)
and in this case provides interesting results (0.701, Table 10), using
in both cases WordNet.
Furthermore, as happened in the previous experiment, our
strategy improves the results for word pairs nicely evaluated by
basic functions (e.g. WP01, WP04 or WP29), and it provides not
so bad results for words not poorly evaluated (e.g. WP21 or WP30).

68

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69
Table 8
Human expert scores for another biomedical word dataset.
Word pair

Word 1

Word 2

Human expert

WP01
WP02
WP03
WP04
WP05
WP06
WP07
WP08
WP09
WP10
WP11
WP12
WP13
WP14
WP15
WP16
WP17
WP18
WP19
WP20
WP21
WP22
WP23
WP24
WP25
WP26
WP27
WP28
WP29
WP30

Renal failure
Heart
Stroke
Abortion
Delusion
Congestive heart failure
Metastasis
Calcification
Diarrhea
Mitral stenosis
Chronic obstructive pulmonary disease
Rheumatoid arthritis
Brain tumor
Carpel tunnel syndrome
Diabetes mellitus
Acne
Antibiotic
Cortisone
Pulmonary embolus
Pulmonary fibrosis
Cholangiocarcinoma
Lymphoid hyperplasia
Multiple sclerosis
Appendicitis
Rectal polyp
Xerostomia
Peptic ulcer disease
Depression
Varicose vein
Hyperlidpidemia

Kidney failure
Myocardium
Infarct
Miscarriage
Schizophrenia
Pulmonary edema
Adenocarcinoma
Stenosis
Stomach cramps
Atrial fibrillation
Lung infiltrates
Lupus
Intracranial hemorrhage
Osteoarthritis
Hypertension
Syringe
Allergy
Total knee replacement
Myocardial infarction
Lung cancer
Colonoscopy
Laryngeal cancer
Psychosis
Osteoporosis
Aorta
Alcoholic cirrhosis
Myopia
Cellulites
Entire knee meniscus
Metastasis

1
0.75
0.7
0.825
0.55
0.35
0.45
0.5
0.325
0.325
0.475
0.275
0.325
0.275
0.25
0.25
0.3
0.25
0.3
0.35
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25

Table 9
Similarity results obtained by our system (last column) against the results obtained with WordNet using the second biomedical dataset.
Word pair

Human expert

HSO

JCN

WUP

PATH

LIN

LESK

RES

LCH

Vector_pairs

Vector

HH(DE)

WP01
WP02
WP03
WP04
WP05
WP06
WP07
WP08
WP09
WP10
WP11
WP12
WP13
WP14
WP15
WP16
WP17
WP18
WP19
WP20
WP21
WP22
WP23
WP24
WP25
WP26
WP27
WP28
WP29
WP30

1
0.75
0.7
0.825
0.55
0.35
0.45
0.5
0.325
0.325
0.475
0.275
0.325
0.275
0.25
0.25
0.3
0.25
0.3
0.35
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25

1.000
0.313
0.000
1.000
0.188
0.000
0.000
0.000
0.000
0.188
0.000
0.188
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0.000
0.078
0.055
0.000
0.000
0.000
0.000
0.000
0.057
0.000
0.000
0.000
0.098
0.000
0.000
0.043
0.000
0.000
0.000
0.000
0.000
0.104
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

1.000
0.600
0.333
1.000
0.778
0.556
0.174
0.556
0.333
0.833
0.500
0.846
0.750
0.160
0.560
0.167
0.200
0.300
0.300
0.667
0.222
0.583
0.632
0.286
0.261
0.571
0.692
0.375
0.546
0.250

1.000
0.111
0.077
1.000
0.200
0.111
0.050
0.111
0.077
0.200
0.200
0.200
0.143
0.046
0.083
0.048
0.059
0.067
0.067
0.100
0.046
0.091
0.125
0.063
0.056
0.100
0.111
0.091
0.091
0.077

0.000
0.370
0.079
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.540
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.540
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

1.000
0.210
0.060
0.195
0.114
0.047
0.018
0.027
0.074
0.022
0.008
0.114
0.043
0.029
0.095
0.036
0.077
0.040
0.037
0.022
0.066
0.045
0.052
0.012
0.033
0.014
0.022
0.025
0.021
0.048

0.000
0.320
0.066
0.907
0.469
0.269
0.000
0.269
0.066
1.000
0.297
0.582
0.508
0.000
0.477
0.000
0.000
0.052
0.066
0.508
0.066
0.477
0.352
0.066
0.052
0.352
0.508
0.052
0.320
0.000

1.000
0.404
0.305
1.000
0.564
0.404
0.188
0.404
0.305
0.564
0.298
0.564
0.472
0.162
0.326
0.175
0.232
0.266
0.266
0.376
0.162
0.350
0.436
0.248
0.216
0.376
0.404
0.350
0.350
0.305

0.010
0.007
0.013
0.080
0.018
0.022
0.036
0.051
0.011
0.003
0.018
0.195
0.221
0.070
0.016
0.227
0.041
0.034
0.062
0.043
0.500
0.074
0.167
0.042
0.020
0.034
0.359
0.134
0.438
0.125

0.183
0.211
0.111
0.326
0.081
0.152
0.327
0.397
0.121
0.057
0.457
0.727
0.097
0.222
0.251
0.375
0.396
0.108
0.591
0.112
1.000
0.329
0.515
0.093
0.612
0.116
0.583
0.480
0.833
0.309

1.000
0.517
0.094
0.742
0.193
0.019
0.050
0.030
0.123
0.128
0.030
0.079
0.056
0.055
0.124
0.023
0.125
0.018
0.066
0.007
0.298
0.029
0.029
0.010
0.018
0.005
0.191
0.081
0.221
0.170

Table 10 shows a wide range of scores for the correlation coefficient. This is mainly due to the fact that the correlation not only
depends on the strategy implemented but on the amount and quality of the background data. For example, metrics based on vectors

are often good; however, they are not successful in our case because the terms examined are not extracted from high quality corpora. A general purpose dictionary (WordNet) does not containing
many biomedical terms, so these functions are not precise in this

J.M. Chaves-González, J. Martínez-Gil / Knowledge-Based Systems 37 (2013) 62–69
Table 10
Correlation between different computational methods
and human judgments for the second dataset.
Metrics

Correlation

HSO
JCN
WUP
PATH
LIN
LESK
RES
LCH
Vector_pairs
Vector

0.701
0.111
0.483
0.753
0.077
0.712
0.106
0.687
0.351
0.289

HH(DE)

0.885

Table 11
Pearson correlations between different approaches and
the human expert opinion.
Similarity
function
(metric)

Correlation

Vector [21]
LIN [21]
J&C [21]
RES [21]
Path [21]
L&C [21]
PATH [1]
L&C [1]
W&P [1]
C&K [1]
Metrics proposed in [1]

0.76
0.69
0.55
0.55
0.48
0.47
0.818
0.833
0.778
0.702
0.836

HH(DE)

0.885

case. Our HH(DE) assigns a coefficient to each metric to modify the
importance of each metric in the global system and avoid negative
results.
Finally, Table 11 summarizes all results related to the second
dataset. As can be seen, our approach improves any other similarity function applied over the same word dataset. There are several
reasons which can explain this good behaviour. For example, the
approaches presented in [21] are limited by the fact that a unique
ontology is exploited. According to Sánchez et al. [26] these results
rely completely on the coverage and completeness of the input
ontology. On the other hand, the metrics proposed in [1] are nearer
to our results since their strategies consist of experimentally optimized parameters for the evaluated dataset. However, even compared against those metrics, our hyper-heuristics obtains more
successful results.
5. Conclusions and future work
In this work, we have presented a novel approach that surpasses existing similarity functions when dealing with datasets
from the biomedical domain. The novelty of our work consists of
using other similarity functions as black boxes which are smartly
combined. This allows our HH(DE) to make use of the best features
from every similarity function.
We present the novel approach of applying an evolutionary
algorithm to this kind of problem. Furthermore, it provides the best
similarity scores (see Section 4) when compared against other relevant works published in the bibliography.
As future work, we propose to explore further possibilities for
synonym recognition in other domains, especially those in which

69

good synonym dictionaries do not exist. We also look forward to
improve our fitness function to make it domain independent.
References
[1] H. Al-Mubaid, H.A. Nguyen, Measuring semantic similarity between
biomedical concepts within multiple ontologies, IEEE Transactions on
Systems, Man, and Cybernetics, Part C 39 (4) (2009) 389–398.
[2] S. Banerjee, T. Pedersen, Extended gloss overlaps as a measure of semantic
relatedness, IJCAI (2003) 805–810.
[3] D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between
words using web search engines, in: Proceedings of the World Wide Web
Conference, 2007, pp. 757–766.
[4] A. Budanitsky, G. Hirst, Evaluating wordnet-based measures of lexical
semantic relatedness, Computational Linguistics 32 (1) (2006) 13–47.
[5] P-I. Chen, S-J. Lin, Word AdHoc Network: Using Google Core Distance to extract
the most relevant information, Knowledge Based System 24 (3) (2011) 393–
405.
[6] D. Conrath, J. Jiang, Semantic similarity based on corpus statistics and lexical
taxonomy, in: Comp. Linguist Proc., Taiwan, 1997, pp. 19–33.
[7] S. Das, P.N. Suganthan, Differential evolution: a survey of the state-of-the-art,
IEEE Transaction on Evolutionary Computation 15 (1) (2011) 4–31.
[8] J. Demsar, Statistical comparison of classifiers over multiple data sets, Journal
of Machine Learning Research 7 (2006) 1–30.
[9] F.A. Grootjen, T.P. van der Weide, Conceptual query expansion, Data
Knowledge Engineering 56 (2) (2006) 174–193.
[10] A.Y. Halevy, A. Rajaraman, J.J. Ordille, Data integration: the teenage years,
VLDB (2006) 9–16.
[11] G. Hirst, D. St-Onge, Lexical chains as representations of context for the
detection and correction of malapropisms, in: C. Fellbaum (Ed.), WordNet: An
Electronic Lexical Database, MIT Press, 1998.
[12] J. Hu, R.S. Kashi, G.T. Wilfong, Comparison and classification of documents
based on layout similarity, Information Retrieval 2 (2/3) (2000) 227–243.
[13] A. Java, S. Nirenburg, M. McShane, T.W. Finin, J. English, A. Joshi, Using a
natural language understanding system to generate semantic web content,
International Journal on Semantic Web and Information Systems 3 (4) (2007)
50–74.
[14] E. Kaufmann, A. Bernstein, Evaluating the usability of natural language query
languages and interfaces to Semantic Web knowledge bases, Journal of Web
Semantics 8 (3) (2010) 377–393.
[15] C. Leacock, M. Chodorow, G.A. Miller, Using corpus statistics and wordnet
relations for sense identification, Computational Linguistics 24 (1) (1998) 147–
165.
[16] M. Lesk, Information in data: using the Oxford English dictionary on a
computer, SIGIR Forum 20 (1–4) (1986) 18–21.
[17] D. Lin, An information-theoretic definition of similarity, in: Int. Conf. ML Proc.,
San Francisco, CA, USA, 1998, pp. 296–304.
[18] C.D. Manning, H. Schütze, Foundations of Statistical Natural Language
Processing, MIT Press, Cambridge, Massachusetts, 1999.
[19] E. Mezura-Montes, J. Velázquez-Reyes, C.A. Coello-Coello, A comparative study
of differential evolution variants for global optimization, in: Proceedings of the
8th Annual Conference on Genetic and Evolutionary Computation (GECCO ‘06),
ACM, New York, NY, USA, 2006, pp. 485–492.
[20] T. Pedersen, S. Patwardhan, J. Michelizzi, Word-Net: similarity – measuring the
relatedness of concepts, Association for the Advancement of Artificial
Intelligence (2004) 1024–1025.
[21] T. Pedersen, S. Pakhomov, S. Patwardhan, C.G. Chute, Measures of semantic
similarity and relatedness in the biomedical domain, Journal of Biomedical
Informatics 40 (3) (2007) 288–299.
[22] G. Pirró, A semantic similarity metric combining features and intrinsic
information content, Data Knowledge Engineering 68 (11) (2009) 1289–1308.
[23] P. Resnik, Using information content to evaluate semantic similarity in a
taxonomy, IJCAI (1995) 448–453.
[24] D. Sánchez, M. Batet, D. Isern, Ontology-based information content
computation, Knowledge Based System 24 (2) (2011) 297–303.
[25] D. Sánchez, M. Batet, D. Isern, A. Valls, Ontology-based semantic similarity: a
new feature-based approach, Expert Systems with Applications 39 (9) (2012)
7718–7728.
[26] D. Sánchez, A. Solé-Ribalta, M. Batet, F. Serratosa, Enabling semantic similarity
estimation across multiple ontologies: an evaluation in the biomedical
domain, Journal of Biomedical Informatics 45 (1) (2012) 141–155.
[27] N. Shadbolt, T. Berners-Lee, W. Hall, The semantic web revisited, IEEE
Intelligent Systems 21 (3) (2006) 96–101.
[28] J.S. Simonoff, Smoothing Methods in Statistics, Springer, 1996.
[29] R. Storn, K. Price, Differential Evolution – A Simple and Efficient Adaptive
Scheme for Global Optimization Over Continuous Spaces, TR-95-012,
International Computer Science Institute, Berkeley, 1995.
[30] Z. Wu, M. Palmer, Verb semantics and lexical selection, in: Assoc. Comput.
Linguist Proc., Las Cruces, NM, USA, 1994, pp. 133–138.
[31] D. Zaharie, A comparative analysis of crossover variants in differential
evolution, in: Proceedings of the International Multiconference on Computer
Science and Information Technology, Wisla, Poland, 2007, pp. 171–181.


Related documents


biomedical semantic similarity
kbs
biomedical fuzzy logics
5512
61 207 3 pb
fuzzy aggregation semantic similarity


Related keywords