LEGAL-BERT: The Muppets Straight Out of Law School

ABSTRACT: “BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications.”

Congrats to all of the authors on their acceptance in the Empirical Methods in Natural Language Processing Conference in November.

In the legal scientific community, we are witnessing increasing efforts to connect general purpose NLP Advances to domain specific applications within law. First, we saw Word Embeddings (i.e. word2Vec, etc.) now Transformers (i.e BERT, etc.). (And dont forget about GPT-3, etc.) Indeed, the development of LexNLP is centered around the idea that in order to have better performing Legal AI – we will need to connect broader NLP developments to the domain specific needs within law. Stay tuned!

Rethinking Attention with Performers (Important New Paper on arXiv)

Transformers (such as BERT, etc.) have suffered quadratic complexity in the number of tokens in the input sequence … which makes training incredibly laborious / expensive… so this is an important paper by researchers from Google, Cambridge and DeepMind …

ABSTRACT: “We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.” ACCESS THE PAPER from arXiv.

Back to Future in Legal Artificial Intelligence — Expert Systems, Data Science and the Need for Peer Reviewed Technical Scholarship

In the broader field of Artificial Intelligence (A.I.) there is a major divide between Data Driven A.I. and Rules Based A.I.  Of course, it is possible to combine these approaches but let’s keep it separate and easy for now.  Rules Based AI in the form of expert systems peaked in the late 1980’s and culminated in the last AI Winter.  Absent a few commercial examples such as TurboTax, the world moved on and Data Driven A.I. took hold.

But here in #LegalTech #LawTech #LegalAI #LegalAcademy – it seems more and more like we have gone ‘Back to the A.I. Future’ (and brought an IF-THEN back in the Delorean).  As even in 2020, we see individuals and companies touting themselves for taking us Back to the A.I. Future.

There is nothing wrong with Expert Systems or Rules Based AI per se.  In law, the first expert system was created by Richard Susskind and Phillip Capper in the 1980’s.  Richard discussed this back at ReInventLaw NYC in 2014.    There are a some use cases where Legal Expert Systems (Rules Based AI) are appropriate.  For example, it makes the most sense in the A2J context.  Indeed, offerings such as A2J Author and Docassemble are good examples. However, for many (most) problems (particularly those with a decent level of complexity) such rule based methods alone are really not appropriate.  

Data Science — mostly leveraging methods from Machine Learning (including Deep Learning) as well as Natural Language Processing (NLP) and other computational allied methods (Network Science, etc.) are the modern coin of the realm (both in the commercial and academic spheres).

As the image above highlights, the broader A.I. world faces challenges associated with overhyped AI and faux expertise. #LegalAI also faces the problem of individuals and companies passing themselves off as “cutting edge AI experts” or “offering cutting edge AI products” without an academic record or codebase to their name. 

In the academy,  we judge scholars on academic papers published in appropriate outlets.  In order for someone to be genuinely considered an {A.I. and Law Scholar, Computational Law Expert, NLP and Law Researcher} that scholar should publish papers in technically oriented Peer Reviewed journals (*not* Law Reviews or trade publications alone).  In the Engineering or Computer Science side of the equation, it is possible to substitute a codebase (such as a major Python package or contribution) for peer reviewed papers.  In order for this field to be taken seriously within the broader academy (particularly by technical inclined faculty), we need more Peer Reviewed Technical Publications and more Codebases. If we do not take ourselves seriously – how can we expect others to do so.

On the commercial side, we need more objectively verifiable technology offerings that are not in line with Andriy Burkov’s picture as shown above … this is one of the reasons that we Open Sourced the core version of ContraxSuite / LexNLP.

NLLP Workshop 2020 — Legal Text Analysis Session — Video of Natural Legal Language Processing Workshop is Now on YouTube

NLLP Workshop 2020 Session 1: Legal Text Analysis — Video of Natural Legal Language Processing Workshop is Now on YouTube.  

Unfortunately, I was not available to participate as I was teaching class at the time of the workshop. However, Corinna Coupette and Dirk Hartung represented us well !  

Copy of the paper presented is available here —
SSRN LINKhttps://papers.ssrn.com/sol3/papers.cfm?abstract_id=3602098
arXiv LINKhttps://arxiv.org/abs/2005.07646

2nd Workshop on Natural Legal Language Processing (NLLP) – Co-Located at the broader 2020 KDD Virtual Conference

Today is the 2nd Workshop on Natural Legal Language Processing (NLLP) which is co-located at the broader 2020 KDD Virtual Conference. Corinna Coupette is presenting our paper ‘Complex Societies and the Growth of the Law’ as a Non-Archival Paper. NLLP is a strong scientific workshop (I did one the Keynote Addresses last year and found it to be a very good group of scholars and industry experts). More information is located here.

GPT-3 and Another Chat About the End of Lawyers

Today I was Quoted in LegalIT Insider discussing GPT-3 and what it might mean for the future of Legal Tech / Legal AI …  “There a few demos out there which look promising (but lots of things look good in a demo but are not robust in the end). I think the open question is always to what extent we can project any general advance in NLP to the domain specific challenges here in law-law land … On the positive side, I think every major and even minor advances present opportunities for us here in legal.  It took a while but for example you see many of the products in our space taking advantage of Word Embedding [or transformer] methods (Word2Vec, BERT, ELMo, etc.)”