How to Search Your Predictive Analytics Data

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

To utilize your predictive analytics data you need to know how to find the information you are want to find. There are two main concepts of searching your data in preparation for using it in predictive analytics:

  • Getting ready to go beyond the basic keyword search

  • Making your data semantically searchable

How to use keyword-based search in predictive analytics

Imagine if you were tasked with searching large amounts of data. One way to approach the problem is to issue a search query that consists (obviously) of words. The search tool looks for matching words in the database, the data warehouse, or goes rummaging through any text in which your data resides.

Assume you’re issuing the following search query: the President of the United States visits Africa. The search results will consist of text that contains exactly one or a combination of the words President, United States, visits, Africa. You might get the exact information you’re looking for, but not always.

How about the documents that do not contain any of the words previously mentioned, but some combination of the following: Obama’s trip to Kenya.

None of the words you initially searched for are in there — but the search results are semantically (meaningfully) useful. How can you prepare your data to be semantically retrievable? How can you go beyond the traditional keyword search? Your answers are can be found if you continue reading.

How to use semantic-based searches in predictive analytics

An illustration of how semantic-based search works is a project that Anasse Bari led at the World Bank Group, an international organization whose primary mission is to fight poverty around the world.

The project aimed to investigate existing large scale enterprise search and analytics in the market and build a prototype for a cutting-edge framework that would organize the World Bank data — most of which was an unstructured collection of documents, publications, project reports, briefs, and case studies.

This massive valuable knowledge is a resource used toward the Bank’s main mission of reducing world poverty. But the fact that it’s unstructured makes it challenging to access, capture, share, understand, search, data-mine, and visualize.

The World Bank is an immense organization, with many divisions around the globe. One of the main divisions was striving to have a framework and was ready to allocate resources to assist the Bari team was the Human Development Network within the World Bank.

The vice president of the Human Development Network outlined one problem that sprang from ambiguity: His division used several terms and concepts that had the same overall meaning but different nuances.

For instance, terms such as climatology, climate change, gas ozone depletion, and greenhouse emissions were all semantically related but not identical in meaning. He wanted a search capability smart enough to extract documents that contained related concepts when someone searched any of these terms.

The prototype’ framework for that capability that the Bari team selected was the Unstructured Information Management Architecture (UIMA), a software-based solution. Originally designed by IBM Research, UIMA is available in IBM software such as IBM Content Analytics, one of the tools that powered IBM Watson, the famous computer that won the Jeopardy game.

The Bari team joined forces with a very talented team from IBM Content Management and Enterprise Search, and later with an IBM Watson team, to collaborate on this project.

An Unstructured Information Management (UIM) solution is a software system that analyzes large volumes of unstructured information (text, audio, video, images, and so on) to discover, organize and deliver relevant knowledge to the client or the application end-user.

The ontology of a domain is an array of concepts and related terms particular to a domain. A UIMA-based solution uses ontologies to provide semantic tagging, which allows enriched searching independent of data format (text, speech, PowerPoint presentation, e-mail, video, and so on). UIMA appends another layer to the captured data, and then adds metadata to identify data that can be structured and semantically searched.

Semantic search is based on the contextual meaning of search terms as they appear in the searchable data space that UIMA builds. Semantic search is more accurate than the usual keyword-based search because a user query returns search results of not only documents that contain the search terms, but also of documents that are semantically relevant to the query.

If you’re searching for biodiversity in Africa, a typical (keyword-based) search will return documents that have the exact words biodiversity and Africa. A UIMA-based semantic search will return not only the documents that have those two words, but also anything that is semantically relevant to “biodiversity in Africa” documents that contain such combinations of words as “plant resources in Africa,” “animal resources in Morocco,” or “genetic resources in Zimbabwe.”

Through semantic tagging and use of ontologies, information becomes semantically retrievable, independent of the language or the medium in which the information was created (Word, PowerPoint, e-mail, video, and so on). This solution provides a single hub where data can be captured, organized, exchanged, and rendered semantically retrievable.

Dictionaries of synonyms and related terms are open-source (freely available) — or you can develop your own dictionaries specific to your domain or your data. You can build a spreadsheet with the root word and its corresponding related words, synonyms, and broader terms. The spreadsheet can be uploaded into a search tool such as IBM Content Analytics (ICA) to power the enterprise search and content analytics.