====== Brmson ======

{{template>:project:infobox|
name=Brmson|
image=brmson.png?200|
sw=-|
hw=-|
founder=[[user:pasky]]|
interested=|
status=active
}}

~~META:
status = active
&relation firstimage = :project:brmson.png
~~

Our very own [[wp>IBM Watson]] - approximated using open source technology, replicated in non-supercomputer environment.

The goal is to build a system that can chew on few open semantic databases, Wikipedia and Sbrm and then be able to answer general questions like "List the biggest nuclear explosions in Russia" or "When do I pay brmlab membership fees?" or "What was the best time travel movie?". The primary language in this phase will be English.
Then, we can take things further - start supporting more advanced inference (|What resistance do I need in series with a random red LED on 5V?"), add some autonomous goal-based processing etc. Only the Strong AI is the limit!

We already **have working software stack** with reasonable performance, currently focusing on reviewing it for bugs and wrapping it up for a milestone scientific paper publication. Speed has not been focus so far. Accuracy on our 430 trivia question testset is a little above 30% as of Jan 2015.

We aim to do as //little// coding as possible, at least initially, instead focusing on integration of existing technologies. Most impressive initial results in the shortest time! :-)

**Homepage: [[http://ailao.eu/yodaqa/]]**

**Live demo: [[http://live.ailao.eu/]]**

Pre-print of the first paper on brmson: [[http://pasky.or.cz/dev/brmson/yodaqa-poster2015.pdf]]

(Original story on Pasky's blog: [[http://log.or.cz/?p=317]])

===== Status and Planning =====

A question-answering engine "[[https://github.com/brmson/yodaqa|YodaQA]]" (custom-made from the ground up) by Pasky is set up at his home server (AMD FX-8350, 24G RAM), together with enwiki fulltext index and dbpedia. It is connected to IRC and hangs out at #brmson, freenode.

All our code incl. documentation and setup instructions is **open source** lives in the [[https://github.com/brmson|github brmson organization]].

===== (Historical) Knowledge Base =====

Starting points:
  * [[http://www.heatonresearch.com/content/free-and-open-software-behind-ibm%E2%80%99s-jeopardy-champion-watson|The Free and Open Software Behind IBM’s Jeopardy Champion Watson]]; blogpost technology summary
  * [[http://www.aaai.org/Magazine/Watson/watson.php|The AI Behind Watson]]; nice high-level overview scientific article
  * [[http://www.andrew.cmu.edu/user/ooo/watson/|This is Watson! Special issue of the IBM research journal on Watson]]; extremely detailed and technical
  * [[https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7|IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement]]
  * [[http://www.slideshare.net/jahendler/watson-summer-review82013final|MiniDeepQA]]; technical summary of a student project replicating (in small scale) IBM Watson in cooperation with IBM Research!
  * [[http://debategraph.org/SolrSherlock|SolrSherlock]]; another effort at recreating DeepQA, see also [[https://groups.google.com/forum/#!forum/qa-oss|the qa-oss group]]; seems a bit unfocused and stalled to pasky, let's just watch them for now?

In bold are our current choices that we are running with.

==== Data Sources ====

All sources listed here must be freely available.

  * Structured:
    * WordNet
    * YAGO
    * DBPedia
  * Unstructured:
    * **Wikipedia**, Everything2 ?, Wikitionary
    * TVTropes, Urban Dictionary
    * News articles (theregister, /., bbc, cnn, reuters)
    * Sbrm, laws, patents, ...
    * IRC logs, Bitcoin forums, transcripts (lectures, Tetra), ...

==== Unstructured Data Sources Interfaces ====

  * **[[wp>UIMA]]** http://uima.apache.org/
    * Industry standard, used by the IBM Watson team as well
    * Extremely unstructured data: [[wp>Ubiquitous_Knowledge_Processing_Lab#DKPro]]
    * Both unstructured data interface (with appropriate plugins) and a general processing framework
  * [[wp>General Architecture for Text Engineering]] http://gate.ac.uk/
    * A popular(?) alternative
  * [[wp>Natural Language Toolkit]] http://nltk.org/
    * Python-based, more tinkering/beginner friendly? but probably less advanced

In our architecture, we can probably try / mix multiple unstructured data architectures.

==== Question Answering Framework ====

In the long run, question answering may not be the only capability of the system, but it is an excellent starting point and benchmark.

Off-the shelf solutions:
  * **OpenQA/OAQA** http://oaqa.github.io/
    * Opensource framework (on top of UIMA) that seems very close to actual IBM Watson tech; https://github.com/oaqa
    * Some inspiration may come from the old website? See e.g. https://mu.lti.cs.cmu.edu/trac/oaqa/wiki/OAQADocumentation/Architecture
    * [[http://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/d12791eaa13bb952852575a1004a055c?OpenDocument&Highlight=0,rc24789|Joint IBM-CMU paper - Open Advancement of Question Answering]] (some interesting problems defined! esp. "learning by reading", "sustained investigation")
    * Full-fledged OAQA pipeline instances publicly available:
      * We have rolled our own, **[[https://github.com/brmson/blanqa|blanqa]]**, loosely inspired by the DSO project codebase
      * [[https://github.com/oaqa/helloqa/tree/prototype|helloqa-prototype]] ([[https://github.com/oaqa/helloqa/wiki/DSO-Project|DSO project]]) - for setup instructions, see [[project/brmson/helloqa-prototype-howto]]
      * [[https://github.com/rzhao1/helloqa/tree/prototype|rzhao-prototype]]
  * UIMA-based http://www.iiitb.ac.in/sites/default/files/uploads/IIITB-TR-2012-001.pdf http://sourceforge.net/projects/questnanswering/
    * Small gradstudent project, but it may be a good prototyping base; got no response from affiliated people
  * OpenEphyra http://www.ephyra.info/
    * Seems to be a kind of obsolete, old-style solution? but ready-to-use; we can (and do) use some of its components in OAQA-based solution, at least as placeholders for better solutions
  * QA component of the "Taming Text" book's codebase https://github.com/tamingtext/book
    * Simple, a bit hackish, tightly integrated with solr

Custom (Watson-inspired) solution structure:
  * Parse the question in multiple independent ways, with assigned confidences
  * Process the question in multiple independent ways (variety of sources etc.), with assigned confidences
    * Generating and verifying hypothetic answers
  * Pick the highest-confidence answer(s)
  * This can be heavily parallelized in the future. Confidences may be assigned using internal solvers' knowledge and feature-based machine learning methods (even naive bayes on a training set, with trivial NLP features for starters).

==== Scaling Up ====

Notes about clustering:
  * Embarassingly parallel
  * Common: UIMA-AS + Hadoop
  * https://github.com/DigitalPebble/behemoth