How to get your data without technical knowledge
In this time of digitalization, it is quite safe to say that every company has some type of database. It could be a small database to store some contact information, or big ones filled with huge amounts of data used by different teams in the company. It doesn’t matter if the database contains only a few entries or thousands: if you don’t speak “database,” you need help to query it for the data you would like to receive.
Depending on the size of the company you are working in, getting this data can be quite complicated and time-consuming – and you never know how many people are in between you and the answer, or how they will interpret your request. How easy it is to misinterpret a simple query! Take, for example, the following request: “Give me all cinemas in German cities with parking lots.” It is likely that the requester is searching for cinemas with parking lots (“Give me all cinemas…with parking lots”), but it’s possible that the requester means that the parking lot can be anywhere in the city (i.e. cinemas in German cities with public parking lots). If there is no clarification between the requester and the technical person whose job it is to query the database, this can lead to wrong data, and it’s probable that the requester won’t even realize the request was misunderstood. The scenario becomes even more problematic when domain-specific language is involved. For instance, you are researching scenes which were filmed using a fisheye lens (ultra-wide angle). If you ask for “all scenes using fish eye,” you may get your wide-angle scenes, or you may find yourself with a lot of fishy scenes, depending on the expertise of the database technician.
Natural language interfaces (NLI) for databases try to solve both issues: the time-consuming process which includes multiple persons, and the gap between the technical person and the requesters with their specific knowledge about the requested topic. NLIs are comparable to search engines, but specific for your company’s database: the requester is now able to write “all movies” into an input field and the NLI will translate this query into the SQL query (or any other technical query language) “SELECT movie.title FROM movie” and displays the result (e.g. a table with all movie titles). Today, almost all NLI also show the requester how they understood the question (e.g. the SQL query or some graphic representation) and provide possibilities to adapt the interpretation (e.g. decide if the filter “with parking lots” shall apply to the “German cities” or “cinemas”).
In addition, for some reasons, many NLI cannot handle negations. In the research benchmarks, this type of request is often not present at all, or only in such a small amount that it is ignored – the impact is too low. In real cases, like real companies, I cannot imagine that negations of any kind are not part of queries to the database. Even for movies, negations can be important. For example, Quentin Tarantino has his own style of movies, and if you are not a fan, you probably want to be able to search for “movies not directed by Tarantino.”
And last but not least, there is a difference between the technical language and the natural language. The request for “all movies with the same genres as Sin City” sounds easy enough, but to formulate this in a technical language, you need to create multiple nested subqueries – a painful and error-prone task. There are solutions for that too: instead of writing this in a single question, the requester is asked to only ask simple question but allowed to make follow-up questions based on the previous result. For example, you would first ask for the genres of Sin City, and then a follow up question for movies with the found genres.
In recent years, researchers have been trying to train neuronal networks to translate natural language into technical languages (e.g. SQL). The idea behind this approach is that neuronal networks can learn to handle the verbosity and ambiguous of natural language. In addition, they show promising results in being able to handle more complex questions with multiple subqueries. One of the big challenges of neuronal networks is the missing training data. This leads to the idea of hybrid approaches: enhancing traditional NLIs with neuronal networks to improve the robustness against natural language.
NLIs are usually able to help a lot – they are so helpful, in fact, that even database people are starting to use them to reduce their own time spent querying – but there are still some challenges to solve: natural language is verbose and ambiguous. Clarification dialogues are the usual tactic to solve these challenges, but the NLI needs to first recognize that a clarification is needed – which is not as easy as it sounds. Academic researchers usually focus on a single issue or challenge to solve (e.g. clarification or subqueries). This shows us something interesting: all problems are solvable, we just need to connect them into a single solution – and don’t forget about negations while doing so!