Lauren
Wood
Robustness and Replicability Concerns in LLM-Based Legal Scholarship
Abstract profile. Full document pending author claim.
Authors:
Lauren Wood, Aileen Nielsen
Date Created:
2025-01-01
Course Title:
Professor:
Not specified
About Paper:
Legal scholars are increasingly investigating the capabilities of conversation history, we re-run their LLM interactions and record large language models (LLMs) to serve as tools for legal practice the resulting outputs. To test robustness, we vary temperature and education. Current legal scholarship on LLMs reflects varying settings, rephrase input prompts, and run the queries on updated degrees of technical and epistemic understanding of models, with models. Where significant output variation is observed, we some scholars attributing advanced legal reasoning capabilities conduct ablation studies to determine how sensitive the model to LLMs. In several cases, low methodological transparency or is to specific prompt components. We expect to encounter limited empirical testing create serious problems for replicabilitysignificant difficulties with replication due to variability in model and robustness of findings. Concerns about this scholarship are notoutputs as well as model deprecation. These challenges raise merely academic: overstated claims regarding LLM performance broader questions about the stability and interpretive assumptions are already influencing how judges and other legal professionals underpinning Arbel and Hoffman’s methodology and prompt- integrate these tools into their work. Through an extensive based legal analysis as a whole. Through this exercise, the literature review, this research identifies common patterns and project also offers generalizable insights into replication and assumptions embedded in how legal academics discuss and deploy reproducibility standards for LLM-based legal research. Future LLMs. We then attempt to replicate one key study from Yonathan work should apply the replication framework to additional studies Arbel and David Hoffman’s influential 2024 article, Generative within this field of scholarship and develop clear guidelines to help Interpretation, which uses LLMs to analyze the contract dispute legal professionals understand the limitations and sensitivities of in Ellington v. EMI. Using the authors’ provided code and chat consumer-facing LLMs. Empirical Effects of Expungement and Methods to Close the Expungement Gap Audrey Yang, D. James Greiner Harvard College | Currier House | Computer Science | 2027 Approximately one in three American adults possesses some form which make expungement less likely. The study is ongoing, of criminal record. For these individuals, criminal-record clearing,nd all participants are being followed over a five-year period or expungement, holds the potential to facilitate a fresh start. to assess impacts on criminal justice involvement, employment Currently, over two-thirds of U.S. states offer some form of stability, and housing outcomes. In addition to quantitative adult criminal-record expungement. However, a persistent gap outcome measures acquired from government records such as tax remains between those who are eligible for expungement and information and wage rates, we collect quantitative participant those who ultimately pursue it, an issue commonly referred to survey data by asking for self-perception ratings on a scale from as the “ expungement gap. ” Despite policy interest in record- 0 to 10 to explore how expungement affects self-perception and clearing mechanisms, the causal effects of expungement remain identity. While preliminary quantitative findings from survey underexplored in the empirical literature. This study implements responses have not shown significant expungement effects on a randomized controlled trial (RCT) to estimate the effects of employment and housing, we hope to find more substantive expungement on key life outcomes, including recidivism, housing results from the ongoing quantitative data collection. This stability, and employment. In collaboration with Kansas Legal study is the first randomized controlled trial to examine the Services, a nonprofit legal aid provider, we randomly assigned effects of criminal-record expungement on housing stability and expungement-eligible individuals into two groups: one receiving recidivism. It also evaluates different interventions that may help intensive legal assistance through assignment to a specialized close the expungement gap and inform broader policy discussions expungement attorney (substantially increasing the likelihood of surrounding criminal justice reform and economic reintegration. record clearance), and the other receiving only self-help materials, Impact of Medicaid Expansion and Cuts on Pregnancy-Related Healthcare Costs Grace Yang, Danielle Braun, Amruta Nori-Sarma Harvard College | Adams House | Statistics | 2027 Recent policy changes to Medicaid eligibility, services, and expansion on overall medical costs associated with pregnancies. fundingthreatenthehealthandlivelihoodofmillionsofindividuals Finally, a cost analysis was conducted to translate the previous in the U.S., especially mothers and children. Since 41% of effects into economic terms. Current results suggest that states that births are currently covered by Medicaid, it is important to chose to expand Medicaid may have experienced fewer emergency study the impact of this new policy and quantify the costs, first visits and less long-term cost. When applying these findings to by investigating the changes in costs that occurred during the determine the impact of Medicaid policy changes, the projections Medicaid expansion under the Affordable Care Act and then by suggest that reversing Medicaid funding would likely increase applying the knowledge to quantify the costs of the new bill overall costs. Therefore, the evidence emphasizes that Medicaid passed under President Trump. To do this, a series of analyses expansion not only supported public health but also produced were conducted, including initial descriptive analyses and a two- cost-effective outcomes, meaning that its reversal could have way fixed effects model to determine the impact of Medicaid detrimental economic and health consequences. 174 Summer Program for Undergraduates in Data Science Quote Submission and Price Impact in Modern Limit Order Books Levi Backman, Neil Shephard Harvard College | Cabot House | Computer Science | 2027 Modern financial markets, including major digital exchanges such own learning curve with advanced econometric tools, a significant as the NYSE and NASDAQ, function primarily as limit order portion of this project focused on researching relevant literature books(LOBs). Inthisstructure,participantssubmitquotes—offers and exploring its application to real data. Drawing on models to buy (bids) or sell (asks)—which are stored in the book until outlined in foundational market microstructure research, we matched with a counterparty. When a bid meets or exceeds an ask, applied early frameworks such as the Roll model and asymmetric a trade is executed. Alongside limit orders, market orders allow information/sequential trading models to the TAQ data to identify for immediate execution at the best available price. On these high- potential signals of quote-driven price impact. Although our frequency exchanges, thousands of quotes and trades can occur results remain preliminary, this modeling work informed our for a single security each second. This fast-paced environment understanding of the data and clarified possible paths for more provides a rich foundation for studying market microstructure: the robust causal analysis. mechanisms through which trades, prices, and liquidity interact. Understanding the potential influence of quote activity on prices This project investigates the causal relationship between quote has important implications for both practitioners and regulators. submission and short-term price movements of the corresponding Tradersmayalreadybeexploitingthismechanismtomovemarkets security. Specifically, we ask: What is the causal effect of posting or extract profit, while large institutional investors could benefit a quote on the price of that security? To explore this question, from insights that minimize the market impact of their trades and we used the NYSE’s TAQ (Trade and Quote) dataset accessed via enhance liquidity strategies. the WRDS platform. While this data source offers high-resolution Though this project remains in an early stage, the groundwork informationonquotesandtrades,itdoesnotincludefullorderbook laid here—particularly in understanding how to adapt structural depth, limiting the scope of LOB-specific analyses. economicmodelstopartialTAQdata—positionsuswellforfurther Given the complexity of market microstructure modeling and my causalinferenceworkandexplorationofthefinancialmarketfield. Salata Summer Undergraduate Research Fund 178 Salata Summer Undergraduate Research Fund An Evaluation of Software-Assisted Digitization Methods for Environmental Research Applications Wissam Alghabra, Michael Foley, Peter Huybers Harvard College | Leverett House | Computer Science | 2026 Digitization is crucial to the preservation and accessibility of process non-searchable documents (such as scans). Tesseract historical archives across all disciplines, but manual data-entry isCR effectively handles well-formatted English tabular data extremely time-consuming and is subject to human error. This but requires longer processing times, approximately 5 minutes project investigates software-assisted digitization methods, such for the same 240-page document. However, OCR is more as optical character recognition (OCR), in the context of food- prone to error and requires more pre-processing and post- focused environmental research, with the aim of establishing processing time compared to PyMuPDF. OCR also failed to best practices for digitization and an evaluation of current interpret non-Latin characters when testing on Arabic documents, methods based on our systematic experiments. Two primary though language-specific training data may improve performance. methods were tested on tabular data from the Syrian Ministry Both methods are significantly less effective against documents of Agriculture statistical reports (1977-2023): the PyMuPDF that lack clear formatting, requiring manual interventions like Python library and the Tesseract OCR Engine. Evaluation metrics cropping extraneous information. Further challenges arise for focused on speed and accuracy, excluding post-processing. Both non-English and non-typed or handwritten documents. Future methods’ effectiveness correlates strongly with source document work includes improving OCR through appropriate training data, quality, performing optimally with English text and well-defined automating table boundary detection, and exploring large language table borders. In the case of searchable documents (e.g., models (LLMs) as potential solutions. While current software- PDF exports of Excel Spreadsheets), PyMuPDF demonstrated assisted digitization methods are a significant improvement over exceptional efficiency, processing a 240-page document within manual data-entry, substantial challenges remain for documents 16 seconds. The method shows promise for well-formatted in suboptimal conditions, warranting systematic investigation into documents that include non-Latin scripts and was proven to be LLM assistance. effective for Arabic documents, though it could not be used to Recognition through Work and Environmental Justice Sydney Black, Michèle Lamont Harvard College | Currier House | Social Studies | 2027 As inequality grows and economic opportunity wanes, social individualsperceivedtheirroleinsocietyandreceivedrecognition, inclusion has become a new marker of hope. How does affirming positive attributes of human groups, particularly in recognition appear across comparative contexts and vary by form? the context of sustained militarization in the CNMI and the Using computer software to analyze and code 60 interviews with construction of the Chalk River Project, a nuclear waste disposal Carolinian and Chamorro individuals (from the Commonwealth of site on Algonquin land. Combined with a separate case study the Northern Mariana Islands) and Algonquin Anishinaabe First on political recognition, Recognition Globally will demonstrate Nation members (from Pikwakanagan and Kebaowek Nations in the numerous ways people across economic and social contexts Quebec and Ontario), we examined the ways in which Indigenous perceive their value amidst dramatic global social change and peoples determine the meaning of work and the consequences uncertainty about the future. of environmental harm. Each interview focused on the ways Ivy League Survival: Characterizing Trait Plasticity of Boston Ivy to Atmospheric Drought Ludmila Blackappl, Hannes De Deurwaerder, Noel (Missy) Holbrook Harvard College | Quincy House | Human Developmental and Regenerative Biology | 2028 As drought conditions become more frequent and intense, plants characteristics (vessel size and density), and stem tissue faceincreasingpressuretoadapttosurviveandremaincompetitive composition (parenchyma fraction and hydraulic tissue fraction). within their ecosystems, relocate to more favorable areas, or perish altogether. Planttraitplasticity, aspecies-specificabilitytomodify We observed that Boston Ivy grown under high VPD developed physiological and anatomical traits in response to environmental smaller leaves with lower stomatal density, indicative of a strategy for reducing transpirational water loss. Additionally, high VPD stressors, outlines the potential and the limitations of a plants’ individuals also showed changes in stem anatomy, such as smaller competitive strategy to persist. Thus, understanding how different vessel diameters and increased parenchyma area, suggesting a plant species express trait plasticity, i.e., which and to what extent specific traits are modified, can help predict survival, competitive more embolism-resistant hydraulic strategy to maintain hydraulic success, and future shifts in species distributions. integrity under drought. We examine the trait plasticity of Boston Ivy (Parthenocissus Despite the limited duration of this project, Boston Ivy tricuspidata), initially grown under identical conditions and demonstrates significant changes in various leaf and stem traits, supporting its capacity to adapt. This high plasticity may allow subsequently subject to distinct atmospheric drought conditions, Boston Ivy to cope with projected future drought conditions, likely i.e., a high and low vapor pressure deficit (VPD). We analyzed givingthespeciesacompetitiveadvantageoverlessplasticspecies. leaf traits (stomatal density and size, leaf area), stem vessel
Abstract:
Legal scholars are increasingly investigating the capabilities of conversation history, we re-run their LLM interactions and record large language models (LLMs) to serve as tools for legal practice the resulting outputs. To test robustness, we vary temperature and education. Current legal scholarship on LLMs reflects varying settings, rephrase input prompts, and run the queries on updated degrees of technical and epistemic understanding of models, with models. Where significant output variation is observed, we some scholars attributing advanced legal reasoning capabilities conduct ablation studies to determine how sensitive the model to LLMs. In several cases, low methodological transparency or is to specific prompt components. We expect to encounter limited empirical testing create serious problems for replicabilitysignificant difficulties with replication due to variability in model and robustness of findings. Concerns about this scholarship are notoutputs as well as model deprecation. These challenges raise merely academic: overstated claims regarding LLM performance broader questions about the stability and interpretive assumptions are already influencing how judges and other legal professionals underpinning Arbel and Hoffman’s methodology and prompt- integrate these tools into their work. Through an extensive based legal analysis as a whole. Through this exercise, the literature review, this research identifies common patterns and project also offers generalizable insights into replication and assumptions embedded in how legal academics discuss and deploy reproducibility standards for LLM-based legal research. Future LLMs. We then attempt to replicate one key study from Yonathan work should apply the replication framework to additional studies Arbel and David Hoffman’s influential 2024 article, Generative within this field of scholarship and develop clear guidelines to help Interpretation, which uses LLMs to analyze the contract dispute legal professionals understand the limitations and sensitivities of in Ellington v. EMI. Using the authors’ provided code and chat consumer-facing LLMs. Empirical Effects of Expungement and Methods to Close the Expungement Gap Audrey Yang, D. James Greiner Harvard College | Currier House | Computer Science | 2027 Approximately one in three American adults possesses some form which make expungement less likely. The study is ongoing, of criminal record. For these individuals, criminal-record clearing,nd all participants are being followed over a five-year period or expungement, holds the potential to facilitate a fresh start. to assess impacts on criminal justice involvement, employment Currently, over two-thirds of U.S. states offer some form of stability, and housing outcomes. In addition to quantitative adult criminal-record expungement. However, a persistent gap outcome measures acquired from government records such as tax remains between those who are eligible for expungement and information and wage rates, we collect quantitative participant those who ultimately pursue it, an issue commonly referred to survey data by asking for self-perception ratings on a scale from as the “ expungement gap. ” Despite policy interest in record- 0 to 10 to explore how expungement affects self-perception and clearing mechanisms, the causal effects of expungement remain identity. While preliminary quantitative findings from survey underexplored in the empirical literature. This study implements responses have not shown significant expungement effects on a randomized controlled trial (RCT) to estimate the effects of employment and housing, we hope to find more substantive expungement on key life outcomes, including recidivism, housing results from the ongoing quantitative data collection. This stability, and employment. In collaboration with Kansas Legal study is the first randomized controlled trial to examine the Services, a nonprofit legal aid provider, we randomly assigned effects of criminal-record expungement on housing stability and expungement-eligible individuals into two groups: one receiving recidivism. It also evaluates different interventions that may help intensive legal assistance through assignment to a specialized close the expungement gap and inform broader policy discussions expungement attorney (substantially increasing the likelihood of surrounding criminal justice reform and economic reintegration. record clearance), and the other receiving only self-help materials, Impact of Medicaid Expansion and Cuts on Pregnancy-Related Healthcare Costs Grace Yang, Danielle Braun, Amruta Nori-Sarma Harvard College | Adams House | Statistics | 2027 Recent policy changes to Medicaid eligibility, services, and expansion on overall medical costs associated with pregnancies. fundingthreatenthehealthandlivelihoodofmillionsofindividuals Finally, a cost analysis was conducted to translate the previous in the U.S., especially mothers and children. Since 41% of effects into economic terms. Current results suggest that states that births are currently covered by Medicaid, it is important to chose to expand Medicaid may have experienced fewer emergency study the impact of this new policy and quantify the costs, first visits and less long-term cost. When applying these findings to by investigating the changes in costs that occurred during the determine the impact of Medicaid policy changes, the projections Medicaid expansion under the Affordable Care Act and then by suggest that reversing Medicaid funding would likely increase applying the knowledge to quantify the costs of the new bill overall costs. Therefore, the evidence emphasizes that Medicaid passed under President Trump. To do this, a series of analyses expansion not only supported public health but also produced were conducted, including initial descriptive analyses and a two- cost-effective outcomes, meaning that its reversal could have way fixed effects model to determine the impact of Medicaid detrimental economic and health consequences. 174 Summer Program for Undergraduates in Data Science Quote Submission and Price Impact in Modern Limit Order Books Levi Backman, Neil Shephard Harvard College | Cabot House | Computer Science | 2027 Modern financial markets, including major digital exchanges such own learning curve with advanced econometric tools, a significant as the NYSE and NASDAQ, function primarily as limit order portion of this project focused on researching relevant literature books(LOBs). Inthisstructure,participantssubmitquotes—offers and exploring its application to real data. Drawing on models to buy (bids) or sell (asks)—which are stored in the book until outlined in foundational market microstructure research, we matched with a counterparty. When a bid meets or exceeds an ask, applied early frameworks such as the Roll model and asymmetric a trade is executed. Alongside limit orders, market orders allow information/sequential trading models to the TAQ data to identify for immediate execution at the best available price. On these high- potential signals of quote-driven price impact. Although our frequency exchanges, thousands of quotes and trades can occur results remain preliminary, this modeling work informed our for a single security each second. This fast-paced environment understanding of the data and clarified possible paths for more provides a rich foundation for studying market microstructure: the robust causal analysis. mechanisms through which trades, prices, and liquidity interact. Understanding the potential influence of quote activity on prices This project investigates the causal relationship between quote has important implications for both practitioners and regulators. submission and short-term price movements of the corresponding Tradersmayalreadybeexploitingthismechanismtomovemarkets security. Specifically, we ask: What is the causal effect of posting or extract profit, while large institutional investors could benefit a quote on the price of that security? To explore this question, from insights that minimize the market impact of their trades and we used the NYSE’s TAQ (Trade and Quote) dataset accessed via enhance liquidity strategies. the WRDS platform. While this data source offers high-resolution Though this project remains in an early stage, the groundwork informationonquotesandtrades,itdoesnotincludefullorderbook laid here—particularly in understanding how to adapt structural depth, limiting the scope of LOB-specific analyses. economicmodelstopartialTAQdata—positionsuswellforfurther Given the complexity of market microstructure modeling and my causalinferenceworkandexplorationofthefinancialmarketfield. Salata Summer Undergraduate Research Fund 178 Salata Summer Undergraduate Research Fund An Evaluation of Software-Assisted Digitization Methods for Environmental Research Applications Wissam Alghabra, Michael Foley, Peter Huybers Harvard College | Leverett House | Computer Science | 2026 Digitization is crucial to the preservation and accessibility of process non-searchable documents (such as scans). Tesseract historical archives across all disciplines, but manual data-entry isCR effectively handles well-formatted English tabular data extremely time-consuming and is subject to human error. This but requires longer processing times, approximately 5 minutes project investigates software-assisted digitization methods, such for the same 240-page document. However, OCR is more as optical character recognition (OCR), in the context of food- prone to error and requires more pre-processing and post- focused environmental research, with the aim of establishing processing time compared to PyMuPDF. OCR also failed to best practices for digitization and an evaluation of current interpret non-Latin characters when testing on Arabic documents, methods based on our systematic experiments. Two primary though language-specific training data may improve performance. methods were tested on tabular data from the Syrian Ministry Both methods are significantly less effective against documents of Agriculture statistical reports (1977-2023): the PyMuPDF that lack clear formatting, requiring manual interventions like Python library and the Tesseract OCR Engine. Evaluation metrics cropping extraneous information. Further challenges arise for focused on speed and accuracy, excluding post-processing. Both non-English and non-typed or handwritten documents. Future methods’ effectiveness correlates strongly with source document work includes improving OCR through appropriate training data, quality, performing optimally with English text and well-defined automating table boundary detection, and exploring large language table borders. In the case of searchable documents (e.g., models (LLMs) as potential solutions. While current software- PDF exports of Excel Spreadsheets), PyMuPDF demonstrated assisted digitization methods are a significant improvement over exceptional efficiency, processing a 240-page document within manual data-entry, substantial challenges remain for documents 16 seconds. The method shows promise for well-formatted in suboptimal conditions, warranting systematic investigation into documents that include non-Latin scripts and was proven to be LLM assistance. effective for Arabic documents, though it could not be used to Recognition through Work and Environmental Justice Sydney Black, Michèle Lamont Harvard College | Currier House | Social Studies | 2027 As inequality grows and economic opportunity wanes, social individualsperceivedtheirroleinsocietyandreceivedrecognition, inclusion has become a new marker of hope. How does affirming positive attributes of human groups, particularly in recognition appear across comparative contexts and vary by form? the context of sustained militarization in the CNMI and the Using computer software to analyze and code 60 interviews with construction of the Chalk River Project, a nuclear waste disposal Carolinian and Chamorro individuals (from the Commonwealth of site on Algonquin land. Combined with a separate case study the Northern Mariana Islands) and Algonquin Anishinaabe First on political recognition, Recognition Globally will demonstrate Nation members (from Pikwakanagan and Kebaowek Nations in the numerous ways people across economic and social contexts Quebec and Ontario), we examined the ways in which Indigenous perceive their value amidst dramatic global social change and peoples determine the meaning of work and the consequences uncertainty about the future. of environmental harm. Each interview focused on the ways Ivy League Survival: Characterizing Trait Plasticity of Boston Ivy to Atmospheric Drought Ludmila Blackappl, Hannes De Deurwaerder, Noel (Missy) Holbrook Harvard College | Quincy House | Human Developmental and Regenerative Biology | 2028 As drought conditions become more frequent and intense, plants characteristics (vessel size and density), and stem tissue faceincreasingpressuretoadapttosurviveandremaincompetitive composition (parenchyma fraction and hydraulic tissue fraction). within their ecosystems, relocate to more favorable areas, or perish altogether. Planttraitplasticity, aspecies-specificabilitytomodify We observed that Boston Ivy grown under high VPD developed physiological and anatomical traits in response to environmental smaller leaves with lower stomatal density, indicative of a strategy for reducing transpirational water loss. Additionally, high VPD stressors, outlines the potential and the limitations of a plants’ individuals also showed changes in stem anatomy, such as smaller competitive strategy to persist. Thus, understanding how different vessel diameters and increased parenchyma area, suggesting a plant species express trait plasticity, i.e., which and to what extent specific traits are modified, can help predict survival, competitive more embolism-resistant hydraulic strategy to maintain hydraulic success, and future shifts in species distributions. integrity under drought. We examine the trait plasticity of Boston Ivy (Parthenocissus Despite the limited duration of this project, Boston Ivy tricuspidata), initially grown under identical conditions and demonstrates significant changes in various leaf and stem traits, supporting its capacity to adapt. This high plasticity may allow subsequently subject to distinct atmospheric drought conditions, Boston Ivy to cope with projected future drought conditions, likely i.e., a high and low vapor pressure deficit (VPD). We analyzed givingthespeciesacompetitiveadvantageoverlessplasticspecies. leaf traits (stomatal density and size, leaf area), stem vessel
Source:
Harvard / Joseph Wang, Mengyu Wang / 2025
Topics:
expungement, legal, model, document, medicaid, cost, effect, impact, llms, quote, price, trait