Abstract
The legal industry is characterized by the presence of dense and complex documents, which necessitate automatic processing methods to manage and analyse large volumes of data. Traditional methods for extracting legal information depend heavily on substantial quantities of annotated data during the training phase. However, a question arises on how to extract information effectively in contexts that do not favour the utilization of annotated data. This study investigates the application of Large Language Models (LLMs) as a transformative solution for the extraction of legal terms, presenting a novel approach to overcome the constraints associated with the need for extensive annotated datasets. Our research delved into methods such as prompt-engineering and fine-tuning to enhance their performance. We evaluated and compared, to a rule-based and BERT systems, the performance of four LLMs: GPT-4, Miqu-1-70b, Mixtral-8x7b, and Mistral-7b, within the scope of limited annotated data availability. We implemented and assessed our methodologies using Luxembourg’s traffic regulations as a case study. Our findings underscore the capacity of LLMs to successfully deal with legal terms extraction, emphasizing the benefits of one-shot and zero-shot learning capabilities in reducing reliance on annotated data by reaching 0.690 F1 Score. Moreover, our study sheds light on the optimal practices for employing LLMs in the processing of legal information, offering insights into the challenges and limitations, including issues related to terms boundary extraction.