Artezio has developed a system called Sea-ART to solve the problem of advanced information search in the global network. It helps to handle the following tasks:
1. It enables a user to form flexible queries, wherein the following can be used as input information:
- a text row
- a formatted text
- a file
- a link to a resource with the contents similar to the required
2. It allows carrying out trainings of search campaigns: a user sorting out intermediary search results in a manual mode can specify subjective relevance of each found result to improve its quality at each subsequent search iteration.
3. It makes it possible to operate search campaigns including:
- view of history search results (cached copy of the result)
- view of direct links to the found sources
- simultaneous running of several campaigns
- classification of search results including elimination of undesirable categories
4. It contains a notification on location of the required information.
The core of the system is a search platform Apache Solr. In particular, the modified under the project needs Bayesian algorithm of text classification, which proved to be one of the best probabilistic methods of comparison of any data files including texts. A search mechanism gets a draft information flow on the first step:
- Google search results
- Corporate e-mail communications
- Social network news feed
- Other issues
At the subsequent stages, objects of input information flow are compared with the ones determined as samples, thereby a relevance level is figured out. Finally, a user gets much more relevant search results as compared with those suggested by search systems on default.
The Sea-ART architecture allows delivering system both as a separate and independent solution and integration with external systems (up to integration as a component into Java-based systems).
Moreover, special emphasis should be given to the possibilities of the scaling system. Taking into account the usage of the allocated calculations platform Apache Hadoop, the system is almost insensitive to growth of information volumes subject to analysis. Simple adding hardware (or cloud) resources allows system scaling for any information flow without code changes.
We’re planning further development of the product including:
- creation of separate components focused on information search in narrow information flows (e-mail, social networks, CRM and eDMS systems)
- clusterization of documents, i.e. splitting of a set of documents into previously unknown, automatically defined topics.