Brmson

Brmson

founder:	pasky
depends on:
interested:
software license:	-
hardware license:	-

~~META: status = active &relation firstimage = :project:brmson.png ~~

Our very own IBM Watson - approximated using open source technology, replicated in non-supercomputer environment.

The goal is to build a system that can chew on few open semantic databases, Wikipedia and Sbrm and then be able to answer general questions like “List the biggest nuclear explosions in Russia” or “When do I pay brmlab membership fees?” or “What was the best time travel movie?”. The primary language in this phase will be English. Then, we can take things further - start supporting more advanced inference (|What resistance do I need in series with a random red LED on 5V?“), add some autonomous goal-based processing etc. Only the Strong AI is the limit!

We already have working software stack with reasonable performance, currently focusing on reviewing it for bugs and wrapping it up for a milestone scientific paper publication. Speed has not been focus so far. Accuracy on our 430 trivia question testset is a little above 30% as of Jan 2015.

We aim to do as little coding as possible, at least initially, instead focusing on integration of existing technologies. Most impressive initial results in the shortest time!

Homepage: http://ailao.eu/yodaqa/

Live demo: http://live.ailao.eu/

Pre-print of the first paper on brmson: http://pasky.or.cz/dev/brmson/yodaqa-poster2015.pdf

(Original story on Pasky's blog: http://log.or.cz/?p=317)

Status and Planning

A question-answering engine ”YodaQA“ (custom-made from the ground up) by Pasky is set up at his home server (AMD FX-8350, 24G RAM), together with enwiki fulltext index and dbpedia. It is connected to IRC and hangs out at #brmson, freenode.

All our code incl. documentation and setup instructions is open source lives in the github brmson organization.

(Historical) Knowledge Base

Starting points:

The Free and Open Software Behind IBM’s Jeopardy Champion Watson; blogpost technology summary
The AI Behind Watson; nice high-level overview scientific article
This is Watson! Special issue of the IBM research journal on Watson; extremely detailed and technical
IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement
MiniDeepQA; technical summary of a student project replicating (in small scale) IBM Watson in cooperation with IBM Research!
SolrSherlock; another effort at recreating DeepQA, see also the qa-oss group; seems a bit unfocused and stalled to pasky, let's just watch them for now?

In bold are our current choices that we are running with.

Data Sources

All sources listed here must be freely available.

Structured:
- WordNet
- YAGO
- DBPedia
Unstructured:
- Wikipedia, Everything2 ?, Wikitionary
- TVTropes, Urban Dictionary
- News articles (theregister, /., bbc, cnn, reuters)
- Sbrm, laws, patents, …
- IRC logs, Bitcoin forums, transcripts (lectures, Tetra), …

Unstructured Data Sources Interfaces

UIMA http://uima.apache.org/
- Industry standard, used by the IBM Watson team as well
- Extremely unstructured data: Ubiquitous_Knowledge_Processing_Lab#DKPro
- Both unstructured data interface (with appropriate plugins) and a general processing framework
General Architecture for Text Engineering http://gate.ac.uk/
- A popular(?) alternative
Natural Language Toolkit http://nltk.org/
- Python-based, more tinkering/beginner friendly? but probably less advanced

In our architecture, we can probably try / mix multiple unstructured data architectures.

Question Answering Framework

In the long run, question answering may not be the only capability of the system, but it is an excellent starting point and benchmark.

Off-the shelf solutions:

OpenQA/OAQA http://oaqa.github.io/
- Opensource framework (on top of UIMA) that seems very close to actual IBM Watson tech; https://github.com/oaqa
- Some inspiration may come from the old website? See e.g. https://mu.lti.cs.cmu.edu/trac/oaqa/wiki/OAQADocumentation/Architecture
- Joint IBM-CMU paper - Open Advancement of Question Answering (some interesting problems defined! esp. “learning by reading”, “sustained investigation”)
- Full-fledged OAQA pipeline instances publicly available:
  - We have rolled our own, blanqa, loosely inspired by the DSO project codebase
  - helloqa-prototype (DSO project) - for setup instructions, see OAQA / OpenQA Setup Guide
  - rzhao-prototype
UIMA-based http://www.iiitb.ac.in/sites/default/files/uploads/IIITB-TR-2012-001.pdf http://sourceforge.net/projects/questnanswering/
- Small gradstudent project, but it may be a good prototyping base; got no response from affiliated people
OpenEphyra http://www.ephyra.info/
- Seems to be a kind of obsolete, old-style solution? but ready-to-use; we can (and do) use some of its components in OAQA-based solution, at least as placeholders for better solutions
QA component of the “Taming Text” book's codebase https://github.com/tamingtext/book
- Simple, a bit hackish, tightly integrated with solr

Custom (Watson-inspired) solution structure:

Parse the question in multiple independent ways, with assigned confidences
Process the question in multiple independent ways (variety of sources etc.), with assigned confidences
- Generating and verifying hypothetic answers
Pick the highest-confidence answer(s)
This can be heavily parallelized in the future. Confidences may be assigned using internal solvers' knowledge and feature-based machine learning methods (even naive bayes on a training set, with trivial NLP features for starters).

Scaling Up

Notes about clustering:

Embarassingly parallel
Common: UIMA-AS + Hadoop
https://github.com/DigitalPebble/behemoth

Table of Contents