Final projects from last year (2024)

Title Members Abstract Link
Training and Benchmarking Neural Machine Translation Models Ethan Mathieu, Shankara Abbineni In this project, we ask two questions: what are the gains to fine-tuning general langauge models on translation; and can general language models, when fine-tuned, perform better on translation tasks than a model trained solely for translation. As such, we train the DeLighT transformer model for English-to-French translation and compare its BLEU performance to other neural machine translation models which we fine-tune. We find that fine-tuned general language models can perform better than language-specific models. Additionally, we build a NextJS web application to allow end users to experiment with the different models and view their performance. View Project
Improved Protein Function Prediction by Combining Persistent Cohomology and ProteinBERT Embeddings Anna Su, Jason Apostol Understanding the molecular function of proteins is extremely important in elucidating their biological mechanisms and in engineering new theraputics. We present a protein function classifier combining features from both sequence and structure, through embeddings generated by a pretrained ProteinBERT model trained on ~100 M proteins supplemented with structural generated on a molecular functionspecific implementation of PersLay trained on our smaller target dataset of ~6,000 human protein structures. We show that supplementing the sequence embeddings with structural embeddings improves classifier accuracy by approximately 4 % by using a relatively small number of parameters, and demonstrate that the H_1 homology group is the most important for performance. This work has applications to drug discovery, elucidation of biological pathways, and protein engineering, as it provides a high-fidelity estimate of the role of a protein in a biological system. View Project
Biomedical Lay Summarization Xincheng Cai, Mengmeng Du Biomedical research articles contain vital information for a wide audience, yet their complex language and specialized terminology often hinder comprehension for non-experts. Inspired by the BIONLP 2024 workshop, we propose a NLP solution to generate lay summaries, which are more readable to diverse audiences. We implemented two transformer-based models, specifically BART and BART-PubMed. Our study investigates the performance of these models across different biomedical topics and explores methods to improve summarization quality through definition retrieval from Webster Medical Dictionary. By enhancing the readability of biomedical publications, our work aims to promote knowledge accessibility to scientific information. View Project
Advancing AI Safety in LLMs through Dynamic Multi-Agent Debates Vincent Li, Anna Zhang, Lindsay Chen The safety and security of large language models (LLMs) has garnered significant attention with the advent of multi-agent frameworks. Our research expands on methodologies proposed in 'Combating Adversarial Attacks with Multi-Agent Debate'(1) by introducing dynamic role allocation and diversifying agent capabilities within multi-agent frameworks. These enhancements address key limitations, including static role allocation and agent homogeneity, which limit the adaptability of debates in uncovering adversarial strategies. Our proposed framework incorporates dynamic roles such as proposer, opposer, questioner, and mediator, alongside enhanced agent capabilities that allow for nuanced exploration of adversarial dialogues. The framework is implemented and trained using state-of-the-art LLMs and evaluated on existing datasets, demonstrating its effectiveness in identifying and mitigating adversarial threats in LLMs. This innovative approach advances AI safety by fostering more robust and versatile multi-agent interactions, contributing to secure and reliable LLM applications. View Project
Transfer Learning is All You Need for Sentiment Analysis Minyi Chen, Zishun Zhou, Bowen Duanmu Transfer learning is a crucial technique that helps us learn from external sources, thus improving model performance on small datasets. In this paper, we work on Twitter Sentiment Datasets with three categories: Neutral, Positive, and Negative, using models like Bert and Gemma, and explore the impact of transfer learning on classification performance. We experimented with various data preprocessing strategies, such as removing stop words and special characters like emojis. We pre-trained our model on different datasets with similar or different tasks. During fine-tuning, we tried various freeze strategies as well. Our best results get 93.5% accuracy, 93.1 % recall, and 93.4 % F1 score in test set. Experimental results indicate that the performance of transfer learning is influenced by various factors, including the model, dataset relationships, and freeze strategies. View Project
Deciphering Clinical Trial Reports: A Novel NLP Task and Corpus for Evidence Inference Xinyi Di, Chengxi Wang, Yun Yang In healthcare, accurate assessment of treatment efficacy is crucial but hindered by the complex and voluminous nature of clinical trial reports. Traditional methods fall short, highlighting the need for advanced automated solutions. Our research addresses this challenge by developing NLP models that utilize sophisticated attention mechanisms to improve the extraction and synthesis of evidence from these reports. By incorporating LoRA, we enhance the fine-tuning efficiency of our models, making large language models more accessible and effective. We evaluate our approach by comparing the performance of a BERT-based baseline model with advanced models constructed using BioBERT and ClinicalBert. This study not only advances the field of NLP in healthcare but also has the potential to revolutionize the way clinical evidence is processed, hence enhancing patient care. View Project
Llama3-8-Bing A sarcastic language model learns from Chandler Bing Yuntian Liu, Zihan Dong In this project, we explored various large language models and fine tuning or alignment techniques to classify and generate sarcasm dialogues. We adopted generative AI to boost the sarcasm study and trained a sarcastic chatbot based on llama3-8B model that learned from Chandler Bing. View Project
Biomedical Document Summarization Models (BDSM) Lleyton Emery, Diego Aspinwall In this writeup, we address the natural language processing task of BioLaySumm, which aims to generate layperson-friendly summaries of biomedical research articles. We implemented and compared various summarization approaches, including extractive and abstractive summarization models, large language models, and ensemble models. Through systematic evaluation of the relevance, readability, and factuality of summaries, we sought to identify the most effective summarization strategies for this domain, ultimately contributing to the advancement of health literacy and informed decision-making in the biomedical field. View Project
Models Understand Models: Predicting Unknown from What We Know Kaiyuan Guan Our research builds upon these existing frameworks with the aim of linking the assessment of abstract capabilities to the model's performance on problem sets with established ground truths. We propose a computing economical, easy to use and interpretable method to diagnose the inherent ability of any given LLMs, by leveraging the advantages of linear probes and the discoveries in self-consistency. If we want to build a model that is better than humans, it is crucial to know what leads to failure. Similar to a variety of research, we start with a crucial discovery: language models can produce well-calibrated predictions for token probabilities on-distribution (Guo et al. (2017)). Based on this, we train an MLP model based on the activations of the last token in CoT answers by the LLM, which is elicited by our compound strategy that digging the potential of few-shot, reasoning and model's intuition. View Project
Predicting Primary Sub-Categories of Statistics arXiv Papers Ali Aldous, Eugene Han, Elder Veliz We investigate the application of natural language processing techniques for automatic classification and category moderation within the arXiv repository, specifically for classifying statistics papers by primary sub-category using their titles and abstracts. Previous work has demonstrated the efficacy of fine-tuning BERT-based models for classifying arXiv papers but only on balanced datasets with broad categories such as biology and physics or distinct sub-categories under subjects other than statistics. Using a dataset of 60,648 arXiv papers within the statistics category, we experiment with TF-IDF embeddings combined with a Linear Support Vector Classifier, SPECTER2 embeddings combined with Logistic Regression, and RoBERTa models, extending past research to include imbalanced sub-categories with significant content overlap. Our results show that while fine-tuning RoBERTa substantially increases performance on unseen paper titles and abstracts, it underperforms compared to other baselines which may highlight potential shortcomings with this approach. Comprehensive details on the source code are available in the GitHub repository ehan03/arxiv-stat-nlp. Instructions for setup are provided to facilitate replication and verification by other researchers." View Project
Synthetic Data for Cross-Domain Uncertainty Analysis Stephanie Hu In many real-world applications of machine learning, obtaining labeled data in sufficient quantities can be a challenging and resource-intensive task. This project adapts a common approach in image generation and processing to address this issue. I design and train a Conditional Generative Adversarial Network (CGAN) for synthetic labeled data generation in a domain distinct from that of the training data. Unfortunately, my results show that the CGAN model architecture and finetuning methods I chose to use are not capable of generating high-quality synthetic data. They are not recognizably English and perform no better than random bag-of-words sampling. Furthermore, it appears that the conditional label has limited weight in the generator model, suggesting my model was unable to extract aspect-level features. I end by positing the limitations of my approach and suggesting further experimentation for model improvement. View Project
Reimplementation of Topic Modeling with Wasserstein Autoencoders Aryaan Khan, Yuhang Cui, Raymond Lee This project re-implements topic modeling with Wasserstein auto-encoders (WAE) from 2 , which have much faster training time compared to traditional topic matching using LDA, and allows direct Dirichlet distribution matching without Gaussian approximation compared to variational auto-encoders (VAE). Re-implementing WAE in the more popular PyTorch framework allows easier integration and better usability, and we will also be verifying the original papers claims by comparing the performance of WAE against LDA. Our implementation of WAE confirmed the original papers results that WAE can have performance on par or better than LDA, and have much faster training time. This project verifies the results of the original WAE paper and provides a PyTorch implementation for future use, which is available on GitHub View Project
Discrimination Risks in LLMs Conrad Lee, Irine Juliet Otieno We explore the risks of discrimination and bias in LLMs with a short survey paper and an experiment. We find that biases in LLMs have their roots in many sources, not just in training corpora. We categorize different manifestations and targets of LLM biases as well as types of debiasing solutions. We support these findings through direct experimentation following the procedure of Dhamala et al (2021). We validate that racial biases exist within the BERT LLM, particularly towards the African American population. View Project
Advancing Author-Specific Language Generation with a Custom Generative Pre-trained Transformer Emilia Liu, Jingjia Meng This project advances natural language processing by developing a custom Generative Pre-trained Transformer (GPT) model designed specifically to emulate Ernest Hemingway's distinctive writing style. Utilizing 'The First Forty-Nine Stories' as the training corpus, the model leverages a multi-head attention mechanism, inspired by the paper 'Attention Is All You Need'. The model's performance was evaluated using various metrics, including ROUGE, METEOR, and BERT scores, to assess its efficiency in style mimicry compared to traditional language models. Results indicated that while the model can generate text with lexical diversity and sentence complexity akin to Hemingway's style, challenges remain in capturing the full spectrum of his stylistic essence. Future work will focus on optimizing model architecture and training processes to enhance the fidelity of generated text to Hemingway's style. This approach not only demonstrates the capabilities of GPT models in personalized language modeling but also opens avenues for future research into author-specific language generation. Such developments hold significant promise for applications in digital humanities, authorial style emulation, and beyond. Code is available at:https://github.com/jjmeng08/CPSC_Project.git View Project
Instruction Tuning to Improve Multi-Document Processing Capabilities of LLMs Gabrielle Kaili-May Liu, Richard Luo Multi-document pre-training objectives are a strong approach to boosting LLM performance on downstream multi-document downstream tasks. Yet such approaches tend to be less general and scalable to broader model types and sizes. Additionally complicating multi-document task capabilities is the 'lost-in-the-middle' phenomenon, whereby the performance of long-context language models decreases significantly when relevant information is located in the middle of the context as opposed to the beginning or end. Recent work suggests instruction tuning as a scalable method for enabling automatic instruction generation/following in LLMs. In this project we therefore leverage such an approach to confer LLMs with improved long-context multi-document capabilities in a more scalable way. Preliminary results demonstrate promise for our proposed approach in the 0-shot setting. Our code is available at https://github.com/pybeebee/577_final_project. View Project
Multimodal ClinicalEDBERT: Predicting Hospital Admissions with MIMIC-IV ED Database Yufei Deng, Yihan Liu, Hang Shi View Project
Advancements in NLP for Autonomous Robotic Systems through Transformer Models and Neural Networks Liam Merz Hoffmeister, Stephen Miner View Project
Mixture-of-Experts Transformers: A Survey Yizheng (Jerry) Shi, Ginny Xiao, Abhisar Mittal, Ardavan (Harry) Abiri View Project
Multimodal Training of Transformers Leo deJong, Reese Johnson, Siva Nalabothu The transformer architecture by Vaswani et al. (2017) has replaced RNN-LSTM based approaches in state-of-the-art language modeling. In particular, decoder-only generative large language models between the range of 7 billion and 300 billion parameters (or more than 1 trillion for some mixture of experts models) have shown remarkable performance in mimicking human conversational abilities on a wide range of topics. One of the important directions for continuous improvement of these transformer models is natively adding support for multimodal input and output. To this end, this paper reviews state-of-the-art approaches in multimodality today and presents results related to replicating Lewkowycz et al.'s (2022) attempt to train language models to solve quantitative problems. Our code is available at https://github.com/snalabothu/multimodal-training-of-transformers. View Project
Contextual Embeddings for Sentiment Classification Accuracy Carl Viyar, Christopher Nathan Our project seeks to explore the benefits of contextual word embeddings for the task of sentiment classification. We base our approach on the findings of the 2017 paper 'Learned in Translation: Contextualized Word Vectors' by McCann et. al. that uses ELMo embeddings to achieve a 6.8 % increase in sentiment classification accuracy. In this paper, we compare performance using BERT-derived token embeddings with baseline performance using GloVe embeddings, finding that BERT-derived embeddings demonstrated a similar increase in improvement. View Project
Knowledge Distillation From Gemini to Mistral for Earnings Call Transcript Summarization Rohan Phanse, Joonhee Park Earnings call transcripts are invaluable to investors because they contain insights that can lead to profitable investments and optimal decision-making. However, these calls are often lengthy, making it difficult for investors to quickly identify key insights from them. Prior work with applying large language models to financial document summarization partly addresses this need, but still struggles to identify the most important information that should be included in summaries. In this project, we approach this challenge by finetuning Mistral 7B-Instruct upon an augmented version of Mukherjee et al.'s ECTSum benchmark, in which we replaced the bullet-point summaries in ECTSum with longer summaries. We used Gemini Pro to create this augmented ECTSum dataset and developed a quality ranking system to select the augmented summaries that best aligned with the information in ECTSum's bullet-point summaries. We then performed knowledge distillation by finetuning Mistral 7B-Instruct on the augmented dataset to align it with Gemini's outputs. After finetuning, we observed improvements in ROUGE performance across the board and an increase in ability to recall important statistics from the transcripts. View Project
Domain-Specific Value Alignment of Large Language Models Kevin Chan, Paul Lin, Jonah Sparling, Rami Pellumbi Large Language Models (LLMs) are increasingly being used in a wide range of applications, and organizations may wish to align these models with specific sets of values for particular use cases. In this project, we explore the feasibility of aligning an open-source LLM, Mistral, to create an educational chatbot suitable for young children. We generate a dataset of child-appropriate and inappropriate prompt-completion pairs using ChatGPT-4 and Claude-3 models. We then employ three approaches to align Mistral: 1) a prompt-based method using a safety prefix, 2) supervised fine-tuning using the LoRA technique, and 3) applying control vectors to model activations during inference. To evaluate the safety of the model outputs, we use GPT-4 to perform automatic evaluation on the prompt-completion pairs. Our results show that while the prompt-based approach and fine-tuning significantly improve the model's ability to provide child-appropriate responses, the control vector method underperforms in comparison. This work demonstrates the feasibility of aligning LLMs for specific use cases and highlights the importance of carefully evaluating alignment approaches to ensure they meet the desired safety criteria. View Project
Genre, Period, and Provenience Classification in Akkadian Cuneiform Documents Avital Romach Many under resourced languages still lag behind the promise that large language models have to offer. It is inefficient to train large language models on small amounts of data, and there are no clear guidelines or recommendations for transferability between languages and scripts [1,2]. For ancient and dead languages, such as Akkadian, written in the cuneiform script, the challenges are more pronounced as Akkadian texts are interpreted through the bias of modern scholars. Particularly, there are several levels of interpretation which can be used as input for machine learning models. This paper presents the first attempt to perform genre, period, and provenience classification in Akkadian cuneiform documents, while trying to assess the issues discussed above. I use two baseline models, Naive Bayes and Logistic Regression, and three BERT models fine-tuned to this task. Each model and classification task was trained and tested on four versions of the same Akkadian texts: lemmatized, normalized (phonetically reconstructed), segmented Unicode cuneiform, and unsegmented Unicode cuneiform. The best performing models for each classification task are multilingual BERT with normalization for genre (96 % weighted F1), Arabic BERT with segmented Unicode cuneiform for period ( 97 % ), and multilingual BERT with normalization for provenience ( 93 % ). I further assess how preprocessing and tokenization methods effect the models' accuracies, how modern editorial practices potentially contribute bias to certain identifications, and the specific difficulties in each type of classification task. The code is available in a GitHub repository. View Project
Noise and Nuance: Impact of Input Noise on Translation Accuracy in Transformer Models Andrew Pan, Nandan Sarkar, Aditya Kulkarni View Project
DPO Can Reveal Latent Hallucinations Nathan Shan Hallucinations in large language models, defined as convincing yet incorrect outputs, are a critical challenge that can arise during instruction tuning. Techniques adjacent to reinforcement learning from human feedback (RLHF) methods like Direct Preference Optimization (DPO) can theoretically reduce model tendencies to hallucinate by incorporating human preferences for appropriate certainty and accuracy. We empirically analyze DPO by investigating logit differences of model checkpoints in the TÜLU 2 family (Ivison et al., 2023) to better understand the effect of DPO on hallucinations. We introduce the concept of 'latent hallucinations' and demonstrate their prevalence in models tuned with DPO, suggesting that current alignment methods may not adequately capture human preferences for uncertainty over inaccuracy. Additionally, we show that DPO fails to fulfil its theoretical capabilities of reducing hallucinations expressed by IT models. We also bring attention to limitations of popular benchmarks in detecting hallucinations Our findings highlight the need for improved evaluation methods and understanding of alignment techniques to reduce hallucinations in LLMs. Our code is posted at https://github.com/nshan144/DPO Hallucination View Project
Enhancing Text Summarization of Biomedical Journal Papers with Domain-Specific Knowledge Integration in State-of-the-Art NLP Models Nilay Bhatt, Tom Shin, Luning Yang The exponential growth of biomedical literature over the past decade has not just created a need, but a pressing need for efficient summarization tools. These tools are crucial for researchers to stay informed about recent developments in their field. As the volume and complexity of scientific papers increase, automated summarization has become indispensable for researchers aiming to distill key information rapidly. Although modern Natural Language Processing (NLP) models like BERT and GPT have shown promising results in text summarization, they often need help to fully capture the nuances and domain-specific language inherent in biomedical texts. This results in summaries that lack accuracy or comprehensiveness, posing a significant challenge for researchers. To address these challenges, this project not only leverages state-of-the-art NLP models, including BART, T5, BioGPT, and LED but also supplements them with domain-specific biomedical knowledge. This unique approach is designed to enhance the summarization quality of biomedical journal papers. By integrating specialized knowledge with these advanced models, we aim to not just improve the accuracy and conciseness of summaries but also make them contextually relevant. This will enable researchers to navigate the rapidly expanding scientific literature more effectively. Our experimental design involves in-domain and cross-domain summarization tasks to rigorously assess and refine our models. Ultimately, our goal is to establish new benchmarks for summarization in this specialized field, a significant step towards advancing biomedical literature summarization. View Project
Retrieval augmented generation to improve text summarization of biomedical research articles Andrew Ton, Yuxuan Cheng View Project
Multi-Modal Data Augmentation for Radiology Report Generation Andrew Tran, Haroon Mohamedali, Howard Dai We approach the Stanford AIMI Radiology Report Generation challenge via finetuning RadFM, a pretrained and instruction tuned model on a variety of radiology tasks. The task is to write accurate and useful radiology reports given sets of x-ray images. Considering the relatively small scale of radiology datasets compared to general image-text datasets, our method involves the creation of synthetic data to supplement training data by directly interpolating between images and also text via GPT-3.5. We make our dataset generation, model training, and inference code publicly accessible on GitHub. View Project
Decoder-only Cognate Prediction Lasse van den Berg, Adnan Bseisu This project introduces a decoder-only transformer-based approach to cognate prediction, leveraging its capacity to handle sequential data and capture linguistic patterns. Cognates, words across different languages with shared origins, provide significant insights into language history and evolution. However, their prediction is challenging, primarily due to the subtle phonetic and semantic shifts over long periods of time. Our method employs a decoder-only architecture typically used in generative tasks. We adapt this architecture for the task of predicting a cognate in one language, given its cognate pair in a related language. We train our model on a dataset of Romance language and Germanic language cognates. The results demonstrate non-trivial performance with interesting generalization patterns. View Project
Enhancing Text Classification with GraphSAGE Shurui Wang, Lang Ding, Weiyi You This project explores the enhancement of text classification through the novel application of TextGraphSAGE, a graph-based neural network model that integrates textual and relational data. By constructing text graphs at a granular level with nodes representing individual words or phrases and edges reflecting adjacency, we aim to capture both local and global textual contexts more effectively. The project compares the performance of our TextGraphSAGE model with conventional deep learning models like CNNs and LSTMs, as well as another graph-based method, TextGCN, across two datasets: Reuters R8 and Twitter Asian Prejudice. Our results indicate that TextGraphSAGE outperforms the baseline models, demonstrating its potential to leverage relational information for superior text classification accuracy and efficiency. Our findings affirm the potential of graph-based methods in advancing text classification tasks. Code is available at https://github.com/JadenWSR/TextGraphSAGE. View Project
Evaluating the Efficacy of Two LLM Defenses Against Adversarial Prompting Feranmi Oluwadairo, Liam Varela, Bryan Wee Large Language Models (LLMs) are increasingly utilised in various user-facing settings. This opens them up to possible adversarial prompts that can be used to generate unaligned responses, jeopardising the safety of the system. This paper investigates the efficacy of two LLM defenses, SmoothLLM and Eraseand-Check, by utilising Greedy Coordinate Gradient attack and PAIR attacks to adversely prompt both a Llama2-7b model, and gpt-3.5-turbo. Our results reiterate the ongoing threat that GCG attacks continue to pose to LLMs, and also explore the feasibility of using SmoothLLM as a plug-and-play defense for closed source models. View Project
Data Augmentation for Machine-Generated Text Detection June Yoo, Helen Zhou Machine-Generated Text (MGT) Detection is the problem of classifying the author of a body of text as being either a human or a machine. With the advent of larger LLMs, this task has become increasingly difficult due to the risk of zero-day attacks. In this project, we test if we can increase robustness to unseen authors by fine-tuning a PLM (RoBERTa) against various Authorship Obfuscation (AO) methods for MGT detection in English, and investigate if using a mixture of these models can create a model which is robust to unseen data. Finally, we apply this to SemEval-2024 Task 8 subtask A, which deals with monolingual binary black-box machine-generated text detection. View Project
Multimodal NLP for Patent Documents Katherine He , Bill Qian, Aaron Yu Millions of patents applications are submitted every year, each one containing both extensive text and visual content. Our research integrated techniques from Natural Language Processing (NLP) and Computer Vision (CV) to improve the experience of referencing and understanding patent documents with a multimodal approach that reflects the nature and complexity of the documents. In this paper, we investigated the current capabilities of large vision language models in NLP tasks and finetuned an existing open source model on U.S. patent databases, using a multimodal dataset with a combination of text and image data that we constructed ourselves. We found that using multimodal approaches for patent classification outperforms both text-only and image-only approaches, in addition to demonstrating the gains in performance that can be achieved by finetuning on multimodal patent data. Our work seeks to improve the efficiency of patent work for inventors, patent attorneys, examiners, and the public, but also lay groundwork for future advancements in multimodal analysis techniques and applications in traditional expert domains such as law and intellectual property. View Project
HanScripter: Unveiling the Wisdom of Classical Chinese with Llama Ke Lyu, Abbey Yuan, Kai Gao The HanScripter project presents a specialized translation model aimed at accurately translating classical Chinese texts into English. Leveraging Meta's LLaMA 3 architecture, the model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) instruction tuning and parallel instruction-output datasets. This methodology involved constructing a comprehensive dataset containing parallel corpora and developing instruction-tuning templates for nuanced translation. Evaluation metrics, including sacreBLEU, chrF, METEOR, and BERTScore, were employed to assess translation quality. The results indicate that the HanScripter model significantly improves translation accuracy and preserves the meaning of classical Chinese texts, offering a robust framework for bridging the linguistic gap between classical and modern languages. View Project
In Sync with Stories: A Book Recommendation System Aligning with Reader Preferences Runqiu Zhang Our study introduces 'In Sync with Stories,' a recommendation system designed to match books with readers' narrative tastes. The system uses Latent Dirichlet Allocation (LDA) to analyze introductory book texts, aligning them with user-provided inputs. By understanding readers' preferences, this genreaware system recommends books that closely resonate with the requested themes. Utilizing a dataset of science fiction titles, the system identifies topics and matches introductions with user preferences through KL-divergence scoring. The results are promising, with the recommendations accurately reflecting relevant narrative genres. While the current implementation relies on LDA, future work will explore integrating deep learning models to enhance accuracy. GitHub Link is here: View Project
Adapting Transformer Model for Live EN to CN Translation Jason Zheng, Kenny Li In this paper, we construct the framework for a machine translation model designed to facilitate real-time, bidirectional communication between English and Chinese speakers, specifically addressing the unique linguistic challenges faced by multigenerational immigrant families. Our approach leverages an encoder-decoder architecture enhanced with multi-head self-attention mechanisms to ensure that both linguistic accuracy and cultural nuances are preserved in translations. By potentially integrating this model with a user-friendly interface in the future, we aim to provide a practical tool for immediate language translation, thereby reducing the emotional and practical challenges associated with language barriers within these communities. View Project
Unpacking Large Language Model's Performance on Quantitative Understanding:NumEval @ SemEval - 2024 Jielan Helen Zheng, Xiaomeng Miranda Zhu View Project
Supreme Court Verdict Prediction with LLMs Zachary Zitzewitz, Raja Moreno, Tom Sutter In this paper, we investigate several approaches to applying language models to predicting United States Supreme Court case outcomes. We use GPT-2 combined with a classification head to perform text classification on the facts and legal question of the case. We also use LLaMA-2 to perform open-ended generation based on the facts of the case that we then classify to make a final prediction. Despite testing several language models and architectures, we were unable to attain accuracy scores better than human evaluation, task-specific architectures, or even simple heuristics. However, fine-tuning LLaMA-2 on our dataset led to improved accuracy scores. We conclude that language models may not be natively well-suited to predicting Supreme Court outcomes, but that fine-tuning on tailored datasets can improve their capabilities in this task. View Project