Sunday 15 April 2012

knowledge is always good, and certainly always better than ignorance

NLP (Natural Language Processing) course from Stanford mentions bootstrapping algorithm applied by Sergey Brin, a co-founder of Google. He used it to extract (author, book) pairs from the internet. It is very pretty algorithm. At the beginning you use a seed, for example (John Eldredge, The way of the wild heart). Then using this pair you find that word "by" appears many times between the book and author, afterwards appears word "is", for example in the sentence: "The way of the wild heart by John Eldridge is" - between book and author appears word "by" and afterwards appears word "is". Then you find other pairs (author, book), that appear with words "by" and "is". You can find: (Mickiewicz, Pan Tadeusz) or (Shakespeare, Romeo and Juliet). Using new pairs you find more words appearing between and after the pair (author, book). You continue the process to finally find all books and their authors ;)

Generally, we extract the words before, in between or after the pair (author, book) to find new such pairs.

1 comment: