#WiECon19 Blog Series: The Next Frontier of Machine Learning
In modern eDiscovery, it’s always good when you can get data scientists talking about their craft and machine learning in ways that everyone can understand and relate to our day-to-day work. But don’t get me wrong, these ladies also went deep as the title of their session indicates. This session was a perfect illustration of making the complex understandable. It was also yet another example of the value derived from and the quality of the work that went into the inaugural Women in eDiscovery (WiE) conference, so please join me to gain a much deeper understanding of machine learning in current systems and those that will take us to the next frontier.
Jigna Dalal, eDiscovery Consultant, Squire Patton Boggs
Jyothi Vinjumur, Sr. Data Scientist, Walmart
Irina Matveeva, Head of Machine Learning, NexLP
DEFINITION: MACHINE LEARNING IN EDISCOVERY
Jigna Dalal, eDiscovery Consultant at Squire Patton Boggs, kicked off the session with the basics, a definition of machine learning in legal (shown below). While we could debate whether this definition is broad enough to encompass all machine learning in legal, I think we can agree that it at least covers most modern eDiscovery systems.
Machine Learning in Legal:
An algorithm (or set of algorithms) that organizes or classifies documents by analyzing their features.
Understanding machine learning at a basic level is really not that difficult, if you have a good analogy. Here’s mine. If you’ve ever ordered food or a cocktail by telling the waiter or bartender what you like or prefer and he has made suggestions or prepared the perfect blend, then you have a grasp of the basics of machine learning in legal. You feed information to the system, and it then “organizes or classifies documents by analyzing their features,” bringing documents to the fore that meet your specified criteria.
This is exactly what our VenioOne Continuous Active Learning (CAL) module does. A project manager gives the system basic criteria to use in a profile. Human reviewers begin reviewing documents and the system brings forward more of the items meeting their criteria, shaping a model of their preferences as it progresses, until it reaches the point where it can predict the rest.
“VenioOne’s Continuous Active Learning module classifies documents by analyzing their content and metadata and shows reviewers the ones that meet your specified criteria.”
UNSUPERVISED VS. SUPERVISED MACHINE LEARNING
Jigna included a diagram mapping out the different types of machine learning. The diagram is from the “Business Intelligence” article linked in the Additional Resources section below, which I highly recommend for additional reading. She focused on two types of machine learning – supervised and unsupervised – with relevant examples in eDiscovery.
Several examples of unsupervised machine learning exist in most modern eDiscovery software. Although, many may not think of them in terms of machine learning. Email threading, near duplicate clustering, and concept searching are relevant examples of unsupervised machine learning. The eDiscovery platform simply brings forth those items without any human interaction on the front end. Interfaces like the VenioOne Social Network Diagram, which allows you to see clusters of communication and delve deeper into them, is a good example of this. Our Email Threading View, which proactively lets you know which emails are missing and recreates them from strings in other emails is another example.
What most of us identify as machine learning is supervised machine learning. Supervised machine learning is exemplified by traditional technology assisted review (TAR) or more modern predictive coding like the CAL module I mentioned earlier. The machine infers from the human knowledge and builds the model based on human input.
WHY DO WE CARE ABOUT MACHINE LEARNING?
Case Law and Regulations Related to eDiscovery
After the basics of machine learning, Jigna defined the why for the audience by going through relevant case law, studies, and regulations that relate to eDiscovery and machine learning. The widely cited first ruling on the use of TAR in e-discovery was Federal Magistrate Judge Andrew Peck’s decision in Da Silva Moore v. Publicis Groupe (Southern District of New York, 2012). It is hard to believe that ruling was seven years ago, and we are still discussing why we should care about machine learning in eDiscovery.
If you need further encouragement to utilize machine learning in your eDiscovery workflows, the panel cited a 2018 case, Dynamo Holdings, which actually commended the parties for developing a “predictive coding protocol.” They also discussed In re Broiler Chicken, a 2016 case discussing the need for increased transparency and cooperation when using machine learning. For additional historical case law, see the Additional Resources below.
While this need for the parties to be on board is important during legal proceedings, I feel like it is important to note that machine learning can still be utilized for improved early case assessment and locating additional custodians or relevant data in produced content.
Studies Regarding Machine Learning in Legal
Jigna also cited several studies showing time-saving and greater accuracy achieved by using machine learning in legal. Again, additional details are included below. The most recent and perhaps most relevant study she cited was done by LawGeex. It pitted expert contract reviewers against a machine learning system. The machine learning system was able to achieve greater accuracy with a time of two plus hours by the humans compared to 26 seconds by the system. Wow, what a difference!
We humans are very attached to our ways of doing things and resistant to change. We also fail to recognize our biases and failings. This is a lethal combination in today’s legal environment. As the panel pointed out, it is time to get on board and change the discussion from whether to utilize machine learning in eDiscovery to how to use it.
The panel also pointed out that Texas, my home state and location of the WiE conference, is the 36th state to adopt the American Bar Association (ABA) model rule regarding the ethical duty for attorneys to be competent in technology. While rather generic in its language, clearly the ABA expects attorneys to keep up with relevant technology. Like the Da Silva decision, this has been around since 2012 – seven long years.
The panel also discussed how even the US government is now getting involved by introducing legislation designed to reduce biases in algorithms being used by social media and companies. Clearly, the time for understanding more about machine learning is upon us.
CONCEPT OF TECHNOLOGY ASSISTED REVIEW
Jyothi Vinjumur, Sr. Data Scientist at Walmart, led the next segment with the concept of TAR in general before getting more specific about how it works. She explained that TAR is a form of supervised machine learning in that it uses documents which humans have actually coded. You take that set of documents and put it in the “TAR black box,” which gives you a result in the form of ranked documents. Like all things digital, the rank is a probability from 0 to 1 with 1 being highest certainty and .5 being a 50% probability.
TAR does its predictions based on the pattern of the documents that you feed into the black box. Jyothi explained that the machine is simply “a steady learner listening for information.” The better trained it is, the better your results. An additional benefit of machine learning is that your models can be used to look for similar patterns in the future. You don’t have to train TAR, because your model is already built.
TAR 1.0 vs. TAR 2.0
With TAR 1.0, it was all about richness of your control set that was fed into the system, so it involved a lot of testing to perfect what gets fed to the black box. For example, you may have created a sample set of documents by using search terms. That filtration leaves a lot of data behind. She explained that is the problem with TAR 1.0 – it is most useful when you are sure about your case, what’s in the data, your search terms, who the custodians are, etc. It works best when you don’t need flexibility in the learning.
TAR 2.0 or CAL, on the other hand, starts with one document and works better when you:
- Know little about the case or data
- May be adding more custodians over time
- Need to identify custodian
- Are not certain what search terms should be used
- Have no defined control set
The work is done in iterations with human reviewers providing input into the black box. The machine learns until it has enough information and is able to predict the remaining documents.
What Happens Inside the Machine Learning Black Box?
As reviewers categorize the documents, they are telling the machine the following:
- Who the custodians are
- Which content is relevant or important
- What may be privileged
- Which domains are involved
- Which dates are important
Think of all of the bits of information that a document contains, both in its metadata and its content. The black box is looking at all of that information to bring what it thinks are more relevant documents forward. It is building its model based on the people, terms, and dates contained in each of those documents, and human reviewers are confirming the correct decisions.
“All the information that a document contains, both in its metadata and its content, is used by eDiscovery machine learning to bring forward the most relevant documents for reviewers.”
How Machine Learning Interprets Data
Data interpretation by the black box can be tricky in some ways. For example, relationships that we understand as humans may not be understood by the machine initially. People with multiple email addresses are seen as two separate entities, unless they have been combined, so early clean-up of your data pool helps tremendously.
Places may be important, but the machine only sees those as words initially. Relationships between individuals may also be important to your case. Some examples of that are the role of CEO (rank matters) or the relationship between a client and lawyer (privilege matters). These are the features that matter to the machine and your case, so it should start making those connections over time.
Jyothi gave a great example. Take three documents with the following text:
- I like deep learning.
- I like NLP.
- I enjoy flying.
Machines are binary. It only understands 0 and 1. 1 is true. 0 is false. “I” in the document equals 1. “I” not in the document equals 0. However, the machine does this for hundreds of vectors for each word or piece of metadata. Eventually, the machine will learn that “deep learning” and “NLP” may be related, because they appear frequently together in similar documents. Needless to say, the math gets very complicated very fast. That’s why the machine ultimately is better than we humans at processing all those bits of related data.
As you can imagine, the sophistication of the natural language processing (NLP) being used by the system is important to machine learning. Jyothi used the example of sentence diagramming that you may have done in elementary school. Verbs, nouns, adjectives are easy to identify, but the modifiers become very important, because they can completely change the meaning. She used these two sentences as an example:
- I see a girl using the telescope.
- I see a girl using the telescope, right?
They have very different meanings, right? There are thousands of language features to understand, but the model remembers patterns very well. That is the key to the computations being done. The more information that is fed to the black box and the greater the detail, the better it gets.
The final segment of the session was led by Irina Matveeva, Head of Machine Learning at NexLP. She dug deeper by explaining the importance of sentiment analysis for eDiscovery and how it works.
Why Sentiment Analysis is Important for eDiscovery
Investigations are looking for problems and sentiment analysis adds another layer to identify issues that may not be overt or clear in the words being used – assessing personality and tone. For example, you might start by looking more deeply at the time a conversation took place and then use sentiment analysis to go deeper. Weekend or night conversations could indicate negative sentiment or urgency. The goal with sentiment analysis is to look at a document in the broader sense, because the sentiment can be identified as either positive, negative, or neutral. Being able to understand at that level without the need for human review is a game changer for eDiscovery.
How Sentiment Analysis Works
Sentiment analysis is opening the next frontier by looking at different levels of text to reach a conclusion. To illustrate how sentiment analysis works, Irina used an ambiguous review of an iPhone by a young purchaser who loves their new phone. Sentiment analysis is going beyond the star rating and looking at why a review is positive, negative, or neutral. Things in the review like, “Mom was upset She thinks the phone is too expensive.” can throw off the model, because the context is not there if you’re only looking at the document level. In that case you need to go to the paragraph level or feature level versus just looking at the sentence level. Document level sentiment is too coarse. Sentence level classification is much more precise. As these systems become available, these are the questions you should be asking. What level of analysis is being done? Will this system be able to solve my problem with its capabilities?
For sentiment classification, you must know what important features of the text are. Words like “nice” and “amazing” are positive but adding words like “isn’t” or “don’t” completely change the meaning. Also, knowing different uses for words like the word “like” and being able to identify that from the context around the word makes a huge difference. Sentiment analysis also has to identify things like sarcasm. This is done by looking for contradictions within the sentence. Positive and negative in one sentence may indicate sarcasm.
Neutral classifications can be difficult, because they can throw off the model. Sentiment analysis is creating a mental model. Therefore, you have to ensure that you’re giving the model the correct features to teach it. Engineering is the key to the success of sentiment classification, but it is both an art and a science due to the numerous language and emotional components. For more reading, Irina suggested looking up Pedro Domingo, who is the leading researcher in sentiment analysis.
Generic Sentiment Classification Model
Another consideration when choosing a system is whether a generic sentiment classification model will work in your environment or can be adapted, if needed. You may need further training of a generic model to suit your specific data sets. Irina used a great example: “That vacuum cleaner really sucks”. The statement is positive in that context, but most of the time it is not. The key is to determine whether the patterns are the same.
Probably the most powerful idea is the ability to teach these models on one data set and then be able to reuse it on your next case or take knowledge from multiple previous cases and reuse it for future cases. As Irina advised, this allows you to leverage the knowledge and expertise from previous cases and solve the “cold start” problem with building a model from scratch. Law firms are already vast repositories of knowledge, so why not leverage it? For example, when investigating collusion, a firm that has been doing that for twenty years has probable identified the five things that usually happen, so you’re starting from a better knowledge point.
The biggest challenge with reusing models will be data privacy concerns, especially if you are looking at pooling resources between law firms or clients. You will have to identify how much private data is in the model before the model can be reused.
While going deeper, sentiment analysis alone does not always paint a complete picture. Across the numerous human emotions, there are ranges of both positive and negative sentiment and numerous causes for those emotions. If we are using sentiment analysis in internal investigations, looking at things that indicate stress is good, because that definitely affects behavior. However, you have to go deeper into the context, because it could be an indicator of bad management, the company culture, discrimination, personal issues, or any number of other things.
THE FINAL FRONTIER: BEHAVIOR ANALYTICS
That is when we start to move into the final frontier: behavior analytics. These are advanced models using multiple signals in concert to build a model based on patterns of behavior. With things like fraud and collusion, no one is using those terms and the indicators may not be obvious from the language, but there can be communication or behavior patterns that a good behavior analysis model could identify.
A good example is the “Fraud Triangle” by Donald Gressey, which looks at three different indicators as a framework for spotting fraud or collusion. The key is in the triangulation of the items in those three categories.
A person committing fraud might indicate or do some of the following:
- Poor financial situation
- Pressure to perform – quotas
- Transactions not audited
- Open systems
- “Everyone does it.”
- “I deserve…” or “They owe it to me.”
- “Fat cats get away with everything.”
Behavior analysis provides opportunities for much more complex situations. It goes well beyond the words in the document. It will be spread over multiple communications and not use specific words. This is the difference between looking at data as being confined to the four corners of a document versus a complex model of behavior over time. Privilege is another good example in the eDiscovery realm. It must be: (1) conversations with lawyers, (2) not just the words in the document, but also (3) the titles of the people communicating and (4) the nature of their communications.
Q & A ON MACHINE LEARNING CONCEPTS
A number of good subjects came up in the question and answer session after the formal presentation, so I’ve included those as well.
Use or Avoid Tricky Documents
The first question asked was whether you should avoid tricky documents or use them. The panelists indicated that if you would like to adapt your model, e.g. a case about vacuum cleaners, then use it. However, if it is one conversation that is different or irrelevant, leave it out. If it is just one document, then the model will eventually figure it out, but if it is multiple documents, then leave them out. The panel also indicated that being able to look at those documents that are at a neutral prediction versus definitely positive or negative may help.
Documents in Different Languages
Language requirements definitely need to be considered. There are some more advanced systems where the models do machine translation and those work very well. However, most of the time, you will need to train the model in a single language, because languages have such different structures. Therefore, documents must be converted to a single language and then fed into the system.
Spatial Model of Analysis
Someone indicated that they had been working in with a system that was advertised as being based on a spatial model of analysis vs language analysis. The panel explained that with spatial analysis, each word is put in a multi-dimension space to create the context. Each document is converted into all of the terms it contains and then related to documents within that same space. It creates a very advanced representation of the document collection that could be 300 or 1000 vectors long, depending on how the data scientists determine what is best for the data set. The panelists emphasized that while spatial analysis is being used in many advanced systems, it is more complex than what was described in the session content.
Transfer of Learning
Someone wanted to know if they could use a model on documents produced to them, especially when protection orders require the deletion or destruction of documents in a case. Is there a way to comply and still use the information? The panelists said that you don’t need the documents after the model is created. You save the model and the knowledge. If necessary, proprietary information could be redacted.
Crafting Better Search Terms
In the tradition of saving the best for last, this question was it! “How can attorneys craft better search terms?” Irina quickly fired back, “Well, use machine learning is the answer. You don’t need search terms.”
Jigna disagree and indicated that at Squire Patton Boggs they work it backwards. They use their systems to come up with some basic search terms based on hot documents, but then go above and beyond to train the system with a large number of documents, because the richness of their data sets is not usually 20% of the data. It is frequently only 2-3%, so they have to bring in a large diversity of data. If you have sophisticated text searching tools like those built into VenioOne, you can do this type of analysis. They then work with a data scientist to feed those terms back into the system, so they are marrying training the system with using search terms on the back end. In other words, they are developing the search terms from the data.
The panel further explained that getting to the next level is not just search terms, because that doesn’t give you the context. They suggested asking your vendors for a concept search or search term expansion model, which makes recommendations of search terms you may not have used.
Two great examples of this are VenioOne’s fuzzy and synonym search expansion features. The fuzzy search finds words with similar spelling, which can be helpful for finding misspelled words or slightly different spellings. Synonym search expansion is even more on point, because it finds different words that might be used for a single concept. Providing counts for both of these term hits and the ability to use these searches with other search criteria provides the ability to add the context needed. For example, finding “attorney client communication,” “privilege,” or “privileged” in correlation with an attorney or custodian’s name starts to get at context.
Big Data Defined
What Judges are really saying about Technology Assisted Review by James A. Sherer, David Choi, and Csilla Boga-Lofaro of Baker Hostetler
A deeper dive into the case law by those same authors is their article Court Guideposts for the Path to Technology Assisted Review Adoption published in Computer Science and Information Technology in early 2018 and originally presented as part of the proceedings of the 2017 Georgetown Advanced eDiscovery Institute
Technology Assisted Review & Predictive Coding — a Library is a resource from Fenwick & West containing case law summaries and links to cases through late 2018.
Why we’re training the next generation of lawyers in big data, October 2018 article by Georgia State University law professors, which details several studies and comparisons of humans to machine learning.
AI vs. Lawyers: The Future of Artificial Intelligence and Law, December 2018 article discussing studies and the use of AI in law.
ABA for Law Students site February 2019 article entitled Law students—avoid malpractice and embrace technology!