founder: pasky
depends on:
software license: -
hardware license: -
status: active

Our very own IBM Watson - approximated using open source technology, replicated in non-supercomputer environment.

The goal is to build a system that can chew on few open semantic databases, Wikipedia and Sbrm and then be able to answer general questions like “List the biggest nuclear explosions in Russia” or “When do I pay brmlab membership fees?” or “What was the best time travel movie?”. The primary language in this phase will be English. Then, we can take things further - start supporting more advanced inference (|What resistance do I need in series with a random red LED on 5V?“), add some autonomous goal-based processing etc. Only the Strong AI is the limit!

We already have working software stack with reasonable performance, currently focusing on reviewing it for bugs and wrapping it up for a milestone scientific paper publication. Speed has not been focus so far. Accuracy on our 430 trivia question testset is a little above 30% as of Jan 2015.

We aim to do as little coding as possible, at least initially, instead focusing on integration of existing technologies. Most impressive initial results in the shortest time! :-)

(Original story on Pasky's blog: http://log.or.cz/?p=317)

Status and Planning

A question-answering engine ”YodaQA“ (custom-made from the ground up) by Pasky is set up at his home server (AMD FX-8350, 24G RAM), together with enwiki fulltext index and dbpedia. It is connected to IRC and hangs out at #brmson, freenode.

All our code incl. documentation and setup instructions is open source lives in the github brmson organization.

(Historical) Knowledge Base

Starting points:

In bold are our current choices that we are running with.

Data Sources

All sources listed here must be freely available.

  • Structured:
    • WordNet
    • YAGO
    • DBPedia
  • Unstructured:
    • Wikipedia, Everything2 ?, Wikitionary
    • TVTropes, Urban Dictionary
    • News articles (theregister, /., bbc, cnn, reuters)
    • Sbrm, laws, patents, …
    • IRC logs, Bitcoin forums, transcripts (lectures, Tetra), …

Unstructured Data Sources Interfaces

In our architecture, we can probably try / mix multiple unstructured data architectures.

Question Answering Framework

In the long run, question answering may not be the only capability of the system, but it is an excellent starting point and benchmark.

Off-the shelf solutions:

Custom (Watson-inspired) solution structure:

  • Parse the question in multiple independent ways, with assigned confidences
  • Process the question in multiple independent ways (variety of sources etc.), with assigned confidences
    • Generating and verifying hypothetic answers
  • Pick the highest-confidence answer(s)
  • This can be heavily parallelized in the future. Confidences may be assigned using internal solvers' knowledge and feature-based machine learning methods (even naive bayes on a training set, with trivial NLP features for starters).

Scaling Up

Notes about clustering:

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki