Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience

Journal of Intelligent Systems 31 (1):393-406 (2022)
  Copy   BIBTEX

Abstract

Automatic text summarization extracts information from a source text and presents it to the user in a condensed form while preserving its primary content. Many text summarization approaches have been investigated in the literature for highly resourced languages. At the same time, ATS is a complicated and challenging task for under-resourced languages like Malayalam. The lack of a standard corpus and enough processing tools are challenges when it comes to language processing. In the absence of a standard corpus, we have developed a dataset consisting of Malayalam news articles. This article proposes an extractive topic modeling-based multi-document text summarization approach for Malayalam news documents. We first cluster the contents based on latent topics identified using the latent Dirichlet allocation topic modeling technique. Then by adopting vector space model, the topic vector and sentence vector of the given document are generated. According to the relevant status value, sentences are ranked between the document’s topic and sentence vectors. The summary obtained is optimized for non-redundancy. Evaluation results on Malayalam news articles show that the summary generated by the proposed method is closer to the human-generated summaries than the existing text summarization methods.

Other Versions

No versions found

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 100,063

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Analytics

Added to PP
2022-03-30

Downloads
13 (#1,310,210)

6 months
2 (#1,685,363)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references