Abstract
The ever-increasing volume of electronic legal documents calls for effective, language-specific summarization and headline generation techniques to make legal content more accessible and easy-to-use. In the context of Italian law existing summarization models are either extractive or focused on abstracting long-form summaries. As a result, the generated summaries have a low level of readability or are not suited to summarize common legal documents such as norms. This paper proposes LegItBART, a new abstractive summarization model. It leverages a BART-based sequence-to-sequence architecture that is specifically pre-trained on Italian legal corpora. To enable the generation of concise summaries and headlines, we release two new annotated datasets tailored to the Italian legal domain, namely LawCodes and LegItConcepts. To successfully handle input documents exceeding the maximum token length such as verbose norms, codes, or legal articles, we also extend BART by integrating a global-sparse-local attention mechanism. We empirically analyze the performance of different pretraining and fine-tuned model combinations. The results show that using a mix of general-purpose and domain-specific pre-training yields relevant summarization performance improvements. The fine-tuned version of LegItBART outperforms all the tested baselines even those characterized by a significantly higher number of model parameters.