Paraphrase algorithm
From Christoph's Personal Wiki
This article will describe my work in developing a paraphrase algorithm using Project Gutenberg as my corpora. It is a form of natural language processing (NLP) in computational linguistics.
The idea is to first extract comparable sub-corpora from my main corpus and use this as my training set. To start with, I will first build a basic sub-subset of parallel corpus and use this for sentence clustering. I am trying to collect data for inferring templates from sentences that appear to be similar on a word-by-word level.
Contents
[hide]Examples
Sem Experiment
See: Statistical Paraphrasing Project from the Cornell Natural Language Processing Group
Let A1 = Isaiah 2:4
, B1 = Micah 4:3
, and C1 = Joel 3:10
. With,
- A1
- And he shall judge among the nations, and shall rebuke many people: and they shall beat their swords into plowshares, and their spears into pruninghooks: nation shall not lift up sword against nation, neither shall they learn war any more.
- B1
- And he shall judge among many people, and rebuke strong nations afar off; and they shall beat their swords into plowshares, and their spears into pruninghooks: nation shall not lift up a sword against nation, neither shall they learn war any more.
- C1
- Beat your plowshares into swords, and your pruninghooks into spears: let the weak say, I am strong.
- sentence clustering:
A1a: And he shall judge among the nations, B1a: And he shall judge among many people, A1b: and shall rebuke many people: B1b: and rebuke strong nations afar off; A1c: and they shall beat their swords into plowshares, B1c: and they shall beat their swords into plowshares, A1d: and their spears into pruninghooks: B1d: and their spears into pruninghooks: A1e: nation shall not lift up sword against nation, B1e: nation shall not lift up a sword against nation, A1f: neither shall they learn war any more. B1f: neither shall they learn war any more.
- inducing patterns (arguments in square brackets):
{A1c,A1d,A1f} = {B1c,B1d,B1f} A1a: And he shall judge among [the nations], B1a: And he shall judge among [many people], A1b: and [shall rebuke] [many people]: B1b: and [rebuke] [strong nations] [afar off]; A1e: nation shall not lift up [sword] against nation, B1e: nation shall not lift up [a sword] against nation,
References
- Barzilay R, Lee L (2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL, pp 16-23.
See also
- wikipedia:Natural language processing
- wikipedia:Computational linguistics
- wikipedia:Corpus linguistics
- wikipedia:Part-of-speech tagging (POS tagging or POST also called grammatical tagging)
- Computational Linguistics (journal)
External links
- COMPUTATIONAL LINGUISTICS: Models, Resources, Applications — free online book
- An informal explanation of "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment"
- Javascript Diff Algorithm
- DIRT Paraphrase Collection
- Distributional Hypothesis
- Statistical Semantics
- VerbOcean