Abstract
Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development of intelligent tools. In this work, we tackle the challenging task of organizing and summarizing the constantly growing collection of legal documents, uncovering hidden topics, or themes that later can support tasks such as legal case retrieval and legal judgment prediction. Our approach to this problem relies on topic discovery techniques combined with a variety of preprocessing techniques and learning-based vector representations of words, such as Doc2Vec and BERT-like models. The proposed method was validated using four different datasets composed of short and long legal documents in Brazilian Portuguese, from legal decisions to chapters in legal books. Analysis conducted by a team of legal specialists revealed the effectiveness of the proposed approach to uncover unique and relevant topics from large collections of legal documents, serving many purposes, such as giving support to legal case retrieval tools and also providing the team of legal specialists with a tool that can accelerate their work of labeling/tagging legal documents.