“Let’s test how accurate Athento’s Text Analyzer is, compared to Apache’s Stanbol, which is used by suppliers of document management systems, such as Nuxeo.”
Over the past year, the concept of semantics, associated with document management, has been gaining strength. The importance of this concept applied to document management is found in the capacity of building relationships between documents, which then builds the paths towards the construction of knowledge. Put another way, it means transforming unstructured data into information that can you can use to understand a system, the business, a project, processes, etc. At Yerbabuena, we’ve been talking about semantics and intelligent technologies for two years; and, these days, other providers of DMS systems are beginning to give these technologies more and more importance. Nuxeo is one of those companies which has taken an important step with its “Semantic Entities” package, which uses the Apache Stanbol semantic motor to find the names of people, places and organizations, which it then associates with their respective entries in DBpedia. This Nuxeo plug-in visualizes the entity found via an image (which could be, for example, a photo of the name of a person found, or a flag) and which allows us, with a link, to access all those documents in which the aforementioned word or object is found.
Athento contains a similar semantic module, though it’s one that is more advanced and which can find any term which is deemed important within the context of the text of a document, and can convert those terms into tags which allow us to relate documents which share a theme or have information in common. (Click here to see an example of Auto-tagging in action, as it manages résumés.)
We wanted to test the accuracy of Athento’s text analyzer, compared to the Stanbol semantic motor, when it came to extracting data from documents. To do that, we uploaded the same PDF document to a Nuxeo and Stanbol configuration, and to a Nuxeo and Athento configuration. The document is called “Seis Pasos Para Liberar A Mi Empresa Del Papel” (In Spanish); it was a document whose content many of our Spanish readers have already read, and which is found as an entry in our Spanish blog. The document talks about digitalization. For most people, the most relevant terms included in the text would be centered around these concepts:
We wanted to see how effective both Stanbol and Athento would be in extracting these words from the text, and, to be honest, the results were fairly surprising:
Words identified by Stanbol: “como” [as] y “espaa”.
We assumed that the first term came up for the number of times it had appeared in the text, and the second ought to be “España” [Spain], but because of some coding issue, the system only extracted “espaa”.
Words identified by Athento: 77
Athento’s hit rate: 61.9%
Stanbol’s hit rate: 0%
Over the past year, the concept of semantics, associated with document management, has been gaining strength. The importance of this concept applied to document management is found in the capacity of building relationships between documents, which then builds the paths towards the construction of knowledge. Put another way, it means transforming unstructured data into information that can you can use to understand a system, the business, a project, processes, etc. At Yerbabuena, we’ve been talking about semantics and intelligent technologies for two years; and, these days, other providers of DMS systems are beginning to give these technologies more and more importance. Nuxeo is one of those companies which has taken an important step with its “Semantic Entities” package, which uses the Apache Stanbol semantic motor to find the names of people, places and organizations, which it then associates with their respective entries in DBpedia. This Nuxeo plug-in visualizes the entity found via an image (which could be, for example, a photo of the name of a person found, or a flag) and which allows us, with a link, to access all those documents in which the aforementioned word or object is found.
Athento contains a similar semantic module, though it’s one that is more advanced and which can find any term which is deemed important within the context of the text of a document, and can convert those terms into tags which allow us to relate documents which share a theme or have information in common. (Click here to see an example of Auto-tagging in action, as it manages résumés.)
We wanted to test the accuracy of Athento’s text analyzer, compared to the Stanbol semantic motor, when it came to extracting data from documents. To do that, we uploaded the same PDF document to a Nuxeo and Stanbol configuration, and to a Nuxeo and Athento configuration. The document is called “Seis Pasos Para Liberar A Mi Empresa Del Papel” (In Spanish); it was a document whose content many of our Spanish readers have already read, and which is found as an entry in our Spanish blog. The document talks about digitalization. For most people, the most relevant terms included in the text would be centered around these concepts:
Document
management
|
project
|
costs
|
digitalization
|
investment
|
business
|
paper
|
expenses
|
|
documents
|
capture
|
benefits
|
Software
|
Hardware
|
OCR
|
scanner
|
information
|
digital
|
distributed
|
documentation
|
extract
|
We wanted to see how effective both Stanbol and Athento would be in extracting these words from the text, and, to be honest, the results were fairly surprising:
Words identified by Stanbol: “como” [as] y “espaa”.
We assumed that the first term came up for the number of times it had appeared in the text, and the second ought to be “España” [Spain], but because of some coding issue, the system only extracted “espaa”.
Words identified by Athento: 77
Key
word
|
Found
by Stanbol?
|
Found
by Athento?
|
document
management
|
no
|
yes
|
digitalization
|
no
|
yes
|
paper
|
no
|
yes
|
documents
|
no
|
yes
|
software
|
no
|
no
|
scanner
|
no
|
yes
|
distributed
|
no
|
yes
|
project
|
no
|
yes
|
investment
|
no
|
no
|
expenses
|
no
|
no
|
capture
|
no
|
yes
|
hardware
|
no
|
no
|
information
|
no
|
yes
|
documentation
|
no
|
yes
|
costs
|
no
|
no
|
business
|
no
|
yes
|
ICR
|
no
|
no
|
benefits
|
no
|
yes
|
OCR
|
no
|
no
|
digital
|
no
|
yes
|
extract
|
no
|
yes
|
Athento’s hit rate: 61.9%
Stanbol’s hit rate: 0%
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.
Comparing ECM Systems (including Alfresco, OpenText, Documentum, Filenet, Sharepoint or Nuxeo).








No comments:
Post a Comment