Fault Prognosis Text Mining.pdf
Martinez-Gil et al.
iiWAS ’17, December 4–6, 2017, Salzburg, Austria
Figure 2: This automatically generated heatmap should be interpreted in the following way: if the practitioner experiences
some smoke (specially black smoke), noise and a possible power loss in a given vehicle then a problem with the engine is
expected. For batteries, just smoke and noise are expected. Whereas for mufflers, just smoke is expected
languages and styles. Other serious problem is the word stemming,
i.e. although two different pieces of literature can refer to the same
mechanical component or prognosis method, these pieces can be
written using plurals, temporal forms, slang words, and so on. For
this reason, it is necessary to develop methods that can identify
mechanical components and prognosis methods independently of
how they appear written in literature. In general, we propose to
avoid the processing of verbs (which can usually adopt a wide variety of forms) and focus on nouns that usually adopt a much more
features, so that it is possible to partly alleviate this problem.
However, there is still an issue concerning the use of very
common word in either the question or the answers. The
problem with common words is that they do not have a great
meaning, and therefore we have a list of common words to
• Degree of uncertainty on the accuracy of the contents. In
this work, we assume the fact that it is quite likely that the
corpus to be analyzed might have some errors or inaccurate
information concerning the information to be discovered.
However, we foresee that the impact of these errors might
be blurred by the overwhelming presence of correct information.
• Language in which the information is represented. To overcome this limitation, we have decided to use only English in
this first version of our approach. Considering other existing
languages remains as a potential future work.
We explain here the implementation decisions that we have taken
in order to achieve a prototype for testing our hypothesis. The most
important implementation details of our approach include:
• Limitations concerning the corpus size. With the emergence
of new paradigms for parallelization and big data management, this kind of problems are losing importance.
• Variability inherent to the processing of natural language.
It is widely assume that meaning is usually represented by
nouns (and noun phrases) so that it is common to built retrieval methods based on noun representations extracted.
For this reason, we have implemented some functionality to
avoid processing verbs, and common stop words.
• Issues concerning domain nomenclature. The problem for
methods trying to exploit information extraction strategies
is that they should be adapted to each different domain. However, we are explicitly avoiding Token-based and Structural
In order to validate our proposal, we have performed a set of experiments following the Question Answering style that our approach
needs to appropriately suggesting prognosis activities in the mechanical domain. This means that we have retrieved ten samples
from the Stanford Question Answering Dataset (SQuAD)  and
we wish to automatically solve these questions using our approach.
The questions that we have selected are those related to fault prognosis in the mechanical field. These questions simulate the situation
whereby a practitioner could ask itself what is the way to proceed