Building and verifying parallel corpora between Arabic and English

Alkahtani, Saad (2015) Building and verifying parallel corpora between Arabic and English. PhD thesis, Prifysgol Bangor University.

[img] Text
Signed Declaration Alkahtani.pdf
Restricted to Repository staff only

Download (2MB)

Download (1MB) | Preview


Arabic and English are acknowledged as two major natural languages used by many countries and regions. Reviews of previous literature conclude that machine translation (MT) between these languages is disappointing and unsatisfactory due to its poor quality. This research aims to improve the translation quality of MT between Arabic and English by developing higher quality parallel corpora. The thesis developed a higher quality parallel test corpus, based on corpora from Al Hayat articles and the OPUS open-source online corpora database. A new Prediction by Partial Matching (PPM)-based metric for sentence alignment has been applied to verify quality in translation between the sentence pairs in the test corpus. This metric combines two techniques; the traditional approach is based on sentence length and the other is based on compression code length. A higher quality parallel corpus has been constructed from the existing resources. Obtaining sentences and words from two online sources, Al Hayat and OPUS, the new corpus offers 27,775,663 words in Arabic and 30,808,480 in English. Experimental results on sample data indicate that the PPM-based and sentence length technique for sentence alignment on this corpus improves accuracy of alignment compared to sentence length alone.

Item Type: Thesis (PhD)
Subjects: Degree Thesis
Departments: College of Physical and Applied Sciences > School of Computer Science
Degree Thesis
Date Deposited: 21 Apr 2016 11:24
Last Modified: 18 May 2016 08:46
URI: http://e.bangor.ac.uk/id/eprint/6546
Administer Item Administer Item

eBangor is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.