This is an old revision of the document!


founder: pasky
depends on:
software license: -
hardware license: -
status: active

Our very own IBM Watson - approximated using open source technology, replicated in the brmlab hackerspace environment.

The first goal is to build a system that can chew on few open semantic databases, Wikipedia and Sbrm and then be able to answer general questions like “List the biggest nuclear explosions in Russia” or “When do I pay brmlab membership fees?” or “What was the best time travel movie?”. The primary language in this phase will be English.

Then, we can take things further - start supporting more advanced inference (|What resistance do I need in series with a random red LED on 5V?“), add some autonomous goal-based processing etc. Only the Strong AI is the limit!

The current primary focus is on working software stack (regardless of speed), then we can think of how to speed this up using cluster technologies.

We aim to do as little coding as possible, at least initially, instead focusing on integration of existing technologies. Most impressive initial results in the shortest time! :-)

(Story on Pasky's blog: http://log.or.cz/?p=317)

Status and Planning

A custom-made question-answering engine ”BlanQA“ by Pasky is set up at brmson.dyn.brm (virtual machine on sargon). It is connected to IRC as brmson@freenode, hanging out at #brmson.

All our code lives in the github brmson organization.

In order to set up brmson on your machine, please refer to BlanQA installation instructions. It's straightforward!

Next steps:

  • [WIP] Revamp BlanQA architecture to fit better with UIMA: there comes YodaQA. (Each artifact should live in its CAS, instead of all meshed together like in OpenQA. This will enable us to use third-party annotators for document analysis and rapidly expand BlanQA capabilities.)

brmson.dyn.brm Installation Notes

Pieces of the puzzle:

  • The virtual machine itself. If sargon is rebooted, it needs to be restarted. As root@sargon,
    • ifdown eth3; ifconfig eth3 up; brctl addbr br0; brctl addif br0 eth3; ifup br0 (you probably want to do this locally as it temporarily breaks sargon's network connectivity)
    • virsh → start brmson
  • blanqa: The initial QA system implementation within the Brmson project, based on the OAQA/OpenQA framework
    • irssi: Runs the brmson IRC gateway. blanqa's pipeline is accordingly modified to use FIFO communication rather than interactive mode. The top of contrib/irssi-brmson-pipe.pl contains setup instructions.
    • enwiki solr: The local configuration of blanqa at brmson.brm used enwiki as the dataset. This is huge and memory-hungry and currently set up at pasky's home machine pasky.or.cz. Local replication of this setup is possible, blanqa's README contains detailed instructions - TODO.
    • Other local changes should all live in the brmson.brm git branch of blanqa.
  • uima-ecd: A bugfixed version of upstream uima-ecd, the fundamental component of OpenQA.

Old and/or obsolete pieces of the puzzle:

  • baseqa etc.: OpenQA foundations that are just downloaded from an external repository and not customized or installed locally for now; you may ignore these for the time being, though they are undergoing active upstream development (in mysterious branches)
  • solr-provider: OpenQA components of UIMA providing fulltext search (we do use this in blanqa)
  • indri is a search engine supporting structured queries on top of freetext/structured data
  • helloqa-prototype: OpenQA pipeline instance that can actually (clumsily) answer terrorism-related questions based on pieces of free-text
  • rzhao-prototype: Another OpenQA pipeline instance to investigate
    • Unfortunately, some source code files are missing.

Knowledge Base

Starting points:

In bold are our current choices that we are running with.

Data Sources

All sources listed here must be freely available.

  • Structured:
    • WordNet
    • YAGO
    • DBPedia
  • Unstructured:
    • Wikipedia, Everything2 ?, Wikitionary
    • TVTropes, Urban Dictionary
    • News articles (theregister, /., bbc, cnn, reuters)
    • Sbrm, laws, patents, …
    • IRC logs, Bitcoin forums, transcripts (lectures, Tetra), …

Unstructured Data Sources Interfaces

In our architecture, we can probably try / mix multiple unstructured data architectures.

Question Answering Framework

In the long run, question answering may not be the only capability of the system, but it is an excellent starting point and benchmark.

Off-the shelf solutions:

Custom (Watson-inspired) solution structure:

  • Parse the question in multiple independent ways, with assigned confidences
  • Process the question in multiple independent ways (variety of sources etc.), with assigned confidences
    • Generating and verifying hypothetic answers
  • Pick the highest-confidence answer(s)
  • This can be heavily parallelized in the future. Confidences may be assigned using internal solvers' knowledge and feature-based machine learning methods (even naive bayes on a training set, with trivial NLP features for starters).

Scaling Up

Notes about clustering:

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki