About Me
I am an assistant professor at Drexel University focusing on Natural Language Processing and Artificial Intelligence. I'm interested in planning and reasoning using Large Language Models. I earned my PhD from the University of Pennsylvania, having the honor to be mentored by Prof. Chris Callison-Burch, with a thesis committee chaired by Prof. Dan Roth. I earned my BS from the University of Michigan in 2018, mentored by Prof. Rada Mihalcea and Prof. Dragomir Radev.
Harry.Zhang@drexel.edu Affiliations Mentorship and Teaching CIS 530: Computational Linguistics (Winter, Fall 2020) EECS 595: Natural Language Processing (Fall 2018) and EECS 280: Programming and Introductory Data Structures (Winter, Fall 2016) Service I have reviewed more than 50 papers of and chaired for many NLP conferences and workshops.
Potential PhD students should email me with [PhD 2025] in the subject line and apply online. It is necessary to demonstrate past experience in NLP/AI/ML research. I am also hiring paid research assistants/interns and looking for unpaid visiting students/collaborators with potential conversion to a PhD student starting in Jan 2026. Those interested should fill out this form; please do not email separately on this matter.
CV
University of Michigan
B.S.E.; Aug 2015 to Dec 2018
Shenzhen Middle School
High School Diploma;
Sept 2012 to Jun 2015
PhD Students
Cassie Huang
Research Assistants & Interns
Krystal Gong
Past Mentored Students
Tianyi Zhang
Hainiu Xu, King's College London
Zhaoyi Hou, University of Pittsburg
Young-Min Cho, University of Pennsylvania
Manni Arora, Apple
Teaching Assistant
2020
2016 - 2018
Area Chair of COLING 2025, ARR Aug 2024, ARR Jun 2024 / EMNLP 2024, ARR Feb 2024 / ACL 2024
Session Chair of ACL 2024, AACL-IJCNLP 2020
Reviewer of LREC-COLING 2024, EMNLP 2023, ACL 2023, ARR Mar 2022, DaSH Workshop @ EMNLP 2022, COLING 2022, LREC 2022, ARR Nov 2021, COLING 2020, Computer Speech and Language 2018
Program Chair of MASC-SLL 2023, MASC-SLL 2021
Co-organizer of CLUNCH 2020
Events and procedures play a major role in human language. Therefore, reasoning about them is crucial for AI and NLP. My work combines data-driven methods like large language models (LLM) and symbolic, structured representations of events to advance state-of-the-art on many downstream tasks, such as question answering, dialog, story generation, planning, etc. [PhD thesis] Roughly, I have looked into three types of methods:
1. (ongoing) Use LLMs to predict a fully structured and symbolic representation of an environment and problem (e.g., in PDDL or Python) that are processed by solvers (e.g., planners or interpreters), leading to an executable, verifiable, and interpretable reasoning process.
[29] PDDLEGO: Iterative Planning in Textual Environments; Li Zhang, Peter Jansen, Peter Clark, Chris Callison-Burch and Niket Tandon; in *SEM 2024.Paper BibTeX Repo
@inproceedings{zhang-etal-2024-pddlego, title = "{PDDLEGO}: Iterative Planning in Textual Environments", author = "Zhang, Li and Jansen, Peter and Zhang, Tianyi and Clark, Peter and Callison-Burch, Chris and Tandon, Niket", editor = "Bollegala, Danushka and Shwartz, Vered", booktitle = "Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.starsem-1.17", pages = "212--221", abstract = "Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43{\%} more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98{\%}) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4{\%}).", }
[28] PROC2PDDL: Open-Domain Planning Representations from Texts; Tianyi Zhang*Equal contribution^Mentored student, Li Zhang*Equal contribution, Zhaoyi Hou^Mentored student, Ziyu Wang^Mentored student, Yuling Gu, Peter Clark, Chris Callison-Burch and Niket Tandon; in ACL 2024 2st Workshop on Natural Language Reasoning and Structured Explanations.Paper BibTeX Repo
@inproceedings{zhang-etal-2024-proc2pddl, title = "PROC2PDDL: Open-Domain Planning Representations from Texts", author = "Zhang, Tianyi and Zhang, Li and Hou, Zhaoyi and Wang, Ziyu and Gu, Yuling and Clark, Peter and Callison-Burch, Chris and Tandon, Niket", booktitle = "Proceedings of the 2st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", }
[20] Faithful Chain of Thought Reasoning; Qing Lyu*Equal contribution, Shreya Havaldar*Equal contribution, Adam Stein*Equal contribution, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki and Chris Callison-Burch; in IJCNLP-AACL 2023.Paper BibTeX Repo
@inproceedings{lyu-etal-2023-faithful, title = "Faithful Chain-of-Thought Reasoning", author = "Lyu, Qing and Havaldar, Shreya and Stein, Adam and Zhang, Li and Rao, Delip and Wong, Eric and Apidianaki, Marianna and Callison-Burch, Chris", editor = "Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa", booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)", month = nov, year = "2023", address = "Nusa Dua, Bali", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.ijcnlp-main.20", pages = "305--329", }
2. Use LLMs to predict a semi structured and symbolic representation of events (specifically, entities), which helps their decision making and reasoning via in-context learning.
[23] OpenPI2.0: An Improved Dataset for Entity Tracking in Texts; Li Zhang, Hainiu Xu^Mentored student, Abhinav Kommula, Chris Callison-Burch and Niket Tandon; in EACL 2024.Paper BibTeX Repo
@inproceedings{zhang-etal-2024-openpi2, title = "{O}pen{PI}2.0: An Improved Dataset for Entity Tracking in Texts", author = "Zhang, Li and Xu, Hainiu and Kommula, Abhinav and Callison-Burch, Chris and Tandon, Niket", editor = "Graham, Yvette and Purver, Matthew", booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)", month = mar, year = "2024", address = "St. Julian{'}s, Malta", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.eacl-long.10", pages = "166--178", abstract = "Much texts describe a changing world (e.g., procedures, stories, newswires), and understanding them requires tracking how entities change. An earlier dataset, OpenPI, provided crowdsourced annotations of entity state changes in text. However, a major limitation was that those annotations were free-form and did not identify salient changes, hampering model evaluation. To overcome these limitations, we present an improved dataset, OpenPI2.0, where entities and attributes are fully canonicalized and additional entity salience annotations are added. On our fairer evaluation setting, we find that current state-of-the-art language models are far from competent. We also show that using state changes of salient entities as a chain-of-thought prompt, downstream performance is improved on tasks such as question answering and classical planning, outperforming the setting involving all related entities indiscriminately. We offer OpenPI2.0 for the continued development of models that can understand the dynamics of entities in text.", }
[19] Causal Reasoning of Entities and Events in Procedural Texts; Li Zhang*Equal contribution, Hainiu Xu*Equal contribution^Mentored student, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora and Chris Callison-Burch; in Findings of EACL 2023.Paper BibTeX Repo
@inproceedings{zhang-etal-2023-causal, title = "Causal Reasoning of Entities and Events in Procedural Texts", author = "Zhang, Li and Xu, Hainiu and Yang, Yue and Zhou, Shuyan and You, Weiqiu and Arora, Manni and Callison-burch, Chris", booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-eacl.31", pages = "415--431", abstract = "Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.", }
3. Finetune LLMs with language-based data (specifically, event relations) to improve performance on various downstream tasks.
[6] Reasoning about Goals, Steps, and Temporal Ordering with WikiHow; Li Zhang*Equal contribution, Qing Lyu*Equal contribution and Chris Callison-Burch; in EMNLP 2020.Paper BibTeX Repo
@inproceedings{zhang-etal-2020-reasoning, title = "Reasoning about Goals, Steps, and Temporal Ordering with {W}iki{H}ow", author = "Zhang, Li and Lyu, Qing and Callison-Burch, Chris", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.374", pages = "4630--4639", }
[15] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models; ... Li Zhang*Equal contribution, Qing Lyu*Equal contribution and Chris Callison-Burch; in TMLR.Paper
[10] Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data; Shuyan Zhou*Equal contribution, Li Zhang*Equal contribution, Yue Yang, Qing Lyu, Pengcheng Yin, Chris Callison-Burch and Graham Neubig; in ACL 2022.Paper BibTeX Demo Repo
@inproceedings{zhou-etal-2022-show, title = "Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data", author = "Zhou, Shuyan and Zhang, Li and Yang, Yue and Lyu, Qing and Yin, Pengcheng and Callison-Burch, Chris and Neubig, Graham", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.214", pages = "2998--3012", abstract = "Procedures are inherently hierarchical. To {``}make videos{''}, one may need to {``}purchase a camera{''}, which in turn may require one to {``}set a budget{''}. While such hierarchical knowledge is critical for reasoning about complex procedures, most existing work has treated procedures as shallow structures without modeling the parent-child relation. In this work, we attempt to construct an open-domain hierarchical knowledge-base (KB) of procedures based on wikiHow, a website containing more than 110k instructional articles, each documenting the steps to carry out a complex procedure. To this end, we develop a simple and efficient method that links steps (e.g., {``}purchase a camera{''}) in an article to other articles with similar goals (e.g., {``}how to choose a camera{''}), recursively constructing the KB. Our method significantly outperforms several strong baselines according to automatic evaluation, human judgment, and application to downstream tasks such as instructional video retrieval.", }
[8] Goal-Oriented Script Construction; Qing Lyu*Equal contribution, Li Zhang*Equal contribution and Chris Callison-Burch; in INLG 2021.Paper BibTeX Repo
@inproceedings{lyu-etal-2021-goal, title = "Goal-Oriented Script Construction", author = "Lyu, Qing and Zhang, Li and Callison-Burch, Chris", booktitle = "Proceedings of the 14th International Conference on Natural Language Generation", month = aug, year = "2021", address = "Aberdeen, Scotland, UK", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.inlg-1.19", pages = "184--200", abstract = "The knowledge of scripts, common chains of events in stereotypical scenarios, is a valuable asset for task-oriented natural language understanding systems. We propose the Goal-Oriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal. We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow, a website containing half a million how-to articles. For baselines, we consider both a generation-based approach using a language model and a retrieval-based approach by first retrieving the relevant steps from a large candidate pool and then ordering them. We show that our task is practical, feasible but challenging for state-of-the-art Transformer models, and that our methods can be readily deployed for various other datasets and domains with decent zero-shot performance.", }
[7] Intent Detection with WikiHow; Li Zhang, Qing Lyu, Chris Callison-Burch; in AACL-IJCNLP 2020.Paper BibTeX Repo
@inproceedings{zhang-etal-2020-intent, title = "Intent Detection with {W}iki{H}ow", author = "Zhang, Li and Lyu, Qing and Callison-Burch, Chris", booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing", month = dec, year = "2020", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.aacl-main.35", pages = "328--333", abstract = "Modern task-oriented dialog systems need to reliably understand users{'} intents. Intent detection is even more challenging when moving to new domains or new languages, since there is little annotated data. To address this challenge, we present a suite of pretrained intent detection models which can predict a broad range of intended goals from many actions because they are trained on wikiHow, a comprehensive instructional website. Our models achieve state-of-the-art results on the Snips dataset, the Schema-Guided Dialogue dataset, and all 3 languages of the Facebook multilingual dialog datasets. Our models also demonstrate strong zero- and few-shot performance, reaching over 75{\%} accuracy using only 100 training examples in all datasets.", }
[9] Visual Goal-Step Inference using wikiHow; Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch; In EMNLP 2021.Paper BibTeX
@inproceedings{yang-etal-2021-visual, title = "Visual Goal-Step Inference using wiki{H}ow", author = "Yang, Yue and Panagopoulou, Artemis and Lyu, Qing and Zhang, Li and Yatskar, Mark and Callison-Burch, Chris", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.165", pages = "2167--2179", abstract = "Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20{\%}. Our task will facilitate multimodal reasoning about procedural events.", }
[26] One Size Does Not Fit All: Customizing Open-Domain Procedures; Yash Kumar Lal, Li Zhang, Faeze Brahman, Bodhisattwa Prasad Majumder, Peter Clark, Niket Tandon; in Findings of ACL 2024.Paper BibTeX
@inproceedings{lal-etal-2024-tailoring, title = "Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization", author = "Lal, Yash Kumar and Zhang, Li and Brahman, Faeze and Majumder, Bodhisattwa Prasad and Clark, Peter and Tandon, Niket", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics: ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.921", doi = "10.18653/v1/2024.findings-acl.921", pages = "15597--15611", abstract = "How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user{'}s specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM{'}s ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200 WikiHow procedures each with a customization need. We find that a simple architecture with two LLM agents used sequentially performs best, one that edits a generic how-to procedure and one that verifies its executability, significantly outperforming (10.5{\%} absolute) an end-to-end prompted LLM. This suggests that LLMs can be configured reasonably effectively for procedure customization. This also suggests that multi-agent editing architectures may be worth exploring further for other customization applications (e.g. coding, creative writing) in the future.", }
[25] CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization; Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark; in COLM 2024.Paper BibTeX Repo
@misc{majumder2023clin, title={CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization}, author={Bodhisattwa Prasad Majumder and Bhavana Dalvi Mishra and Peter Jansen and Oyvind Tafjord and Niket Tandon and Li Zhang and Chris Callison-Burch and Peter Clark}, year={2023}, eprint={2310.10134}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[24] Choice-75: A Dataset on Decision Branching in Script Learning; Zhaoyi Joey Hou^Mentored student, Li Zhang, Chris Callison-Burch; in LREC-COLING 2024.Paper BibTeX Repo
@inproceedings{hou-etal-2024-choice-75, title = "Choice-75: A Dataset on Decision Branching in Script Learning", author = "Hou, Zhaoyi and Zhang, Li and Callison-Burch, Chris", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.285", pages = "3215--3223", abstract = "Script learning studies how daily events unfold. It enables machines to reason about narratives with implicit information. Previous works mainly consider a script as a linear sequence of events while ignoring the potential branches that arise due to people{'}s circumstantial choices. We hence propose Choice-75, the first benchmark that challenges intelligent systems to make decisions given descriptive scenarios, containing 75 scripts and more than 600 scenarios. We also present preliminary results with current large language models (LLM). Although they demonstrate overall decent performances, there is still notable headroom in hard scenarios.", }
[21] Human-in-the-Loop Schema Induction; Tianyi Zhang^Mentored student, Isaac Tham, Zhaoyi Hou^Mentored student, Jiaxuan Ren, Liyang Zhou^Mentored student, Hainiu Xu^Mentored student, Li Zhang, Lara J. Martin, Rotem Dror, Sha Li, Heng Ji, Martha Palmer, Susan Brown, Reece Suchocki, Chris Callison-Burch; in ACL 2023 Demos.Paper BibTeX Demo
@inproceedings{zhang-etal-2023-human, title = "Human-in-the-loop Schema Induction", author = "Zhang, Tianyi and Tham, Isaac and Hou, Zhaoyi and Ren, Jiaxuan and Zhou, Leon and Xu, Hainiu and Zhang, Li and Martin, Lara and Dror, Rotem and Li, Sha and Ji, Heng and Palmer, Martha and Brown, Susan Windisch and Suchocki, Reece and Callison-Burch, Chris", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-demo.1", pages = "1--10", abstract = "Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction (IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.", }
[17] Unsupervised Entity Linking with Guided Summarization and Multiple Choice Selection; Young Min Cho^Mentored student, Li Zhang and Chris Callison-Burch; in EMNLP 2022.Paper BibTeX Repo
@inproceedings{cho-etal-2022-unsupervised, title = "Unsupervised Entity Linking with Guided Summarization and Multiple-Choice Selection", author = "Cho, Young Min and Zhang, Li and Callison-Burch, Chris", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.638", pages = "9394--9401", abstract = "Entity linking, the task of linking potentially ambiguous mentions in texts to corresponding knowledge-base entities, is an important component for language understanding. We address two challenge in entity linking: how to leverage wider contexts surrounding a mention, and how to deal with limited training data. We propose a fully unsupervised model called SumMC that first generates a guided summary of the contexts conditioning on the mention, and then casts the task to a multiple-choice problem where the model chooses an entity from a list of candidates. In addition to evaluating our model on existing datasets that focus on named entities, we create a new dataset that links noun phrases from WikiHow to Wikidata. We show that our SumMC model achieves state-of-the-art unsupervised performance on our new dataset and on exiting datasets.", }
[13] QuakerBot: A Household Dialog System Powered by Large Language Models; Artemis Panagopoulou, Manni Arora^Mentored student, Li Zhang, Dimitri Cugini, Weiqiu You, Yue Yang, Liyang Zhou^Mentored student, Yuxuan Wang, Zhaoyi Hou^Mentored student, Alyssa Hwang, Lara Martin, Sherry Shi, Chris Callison-Burch and Mark Yatskar; in Alexa Prize Proceedings 2022.Paper BibTeX
@Inproceedings{Pennsylvania2022, author = {Panagopoulou, Artemis and Arora, Manni and Zhang, Li and Cugini, Dimitri and You, Weiqiu and Yang, Yue and Zhou, Liyang and Wang, Yuxuan and Hou, Zhaoyi and Hwang, Alyssa and Martin, Lara and Shi, Sherry and Callison-Burch, Chris and Yatskar, Mark}, title = {QuakerBot: A household dialog system powered by large language models}, year = {2022}, url = {https://www.amazon.science/alexa-prize/proceedings/quakerbot-a-household-dialog-system-powered-by-large-language-models}, booktitle = {Alexa Prize TaskBot Challenge Proceedings}, }
Modern large language models are pre-trained not only with text but with code. We explore ways to interface LLMs with code, either by superficially representing the problem as pseudo-code, or by having them generate code that can be executed to reach the final answer. We evaluate these methods by their performance as well as faithfulness to the reasoning process.
[27] Calibrating Large Language Models with Sample Consistency; Qing Lyu*Equal contribution, Kumar Shridhar*Equal contribution, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan and Chris Callison-Burch; in AAAI 2025.Paper BibTeX
@misc{lyu2024calibrating, title={Calibrating Large Language Models with Sample Consistency}, author={Qing Lyu and Kumar Shridhar and Chaitanya Malaviya and Li Zhang and Yanai Elazar and Niket Tandon and Marianna Apidianaki and Mrinmaya Sachan and Chris Callison-Burch}, year={2024}, eprint={2402.13904}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[22] Exploring the Curious Case of Code Prompts; Li Zhang*Equal contribution, Liam Dugan*Equal contribution, Hainiu Xu^Mentored student*Equal contribution and Chris Callison-Burch; in ACL 2023 1st Workshop on Natural Language Reasoning and Structured Explanations.Paper BibTeX Repo
@inproceedings{zhang-etal-2023-exploring, title = "Exploring the Curious Case of Code Prompts", author = "Zhang, Li and Dugan, Liam and Xu, Hainiu and Callison-burch, Chris", booktitle = "Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)", month = jun, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.nlrse-1.2", pages = "9--17", abstract = "Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.", }
Semantic role labeling answers the question of "who did what to whom, when and how", extracting important information about a predicate. While previous work has treated the semantic role labels as symbolic, we explicitly use their definitions and advance state-of-the-art with some limitations.
[11] Label Definitions Improve Semantic Role Labeling; Li Zhang, Ishan Jindal, Yunyao Li; in NAACL 2022.Paper BibTeX Repo
@inproceedings{zhang-etal-2022-label-definitions, title = "Label Definitions Improve Semantic Role Labeling", author = "Zhang, Li and Jindal, Ishan and Li, Yunyao", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.411", pages = "5613--5620", abstract = "Argument classification is at the core of Semantic Role Labeling. Given a sentence and the predicate, a semantic role label is assigned to each argument of the predicate. While semantic roles come with meaningful definitions, existing work has treated them as symbolic. Learning symbolic labels usually requires ample training data, which is frequently unavailable due to the cost of annotation. We instead propose to retrieve and leverage the definitions of these labels from the annotation guidelines. For example, the verb predicate {``}work{''} has arguments defined as {``}worker{''}, {``}job{''}, {``}employer{''}, etc. Our model achieves state-of-the-art performance on the CoNLL09 dataset injected with label definitions given the predicate senses. The performance improvement is even more pronounced in low-resource settings when training data is scarce.", }
Do large language models know that a "favorite new movie" is not necessarily a "new favorite movie"?
[12] Is "my favorite new movie" my favorite movie? Probing the Understanding of Recursive Noun Phrases; Qing Lyu, Hua Zheng, Daoxin Li, Li Zhang, Marianna Apidianaki and Chris Callison-Burch; in NAACL 2022.Paper BibTeX Repo
@inproceedings{lyu-etal-2022-favorite, title = "Is {``}My Favorite New Movie{''} My Favorite Movie? Probing the Understanding of Recursive Noun Phrases", author = "Lyu, Qing and Hua, Zheng and Li, Daoxin and Zhang, Li and Apidianaki, Marianna and Callison-Burch, Chris", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.388", pages = "5286--5302", abstract = "Recursive noun phrases (NPs) have interesting semantic properties. For example, {``}my favorite new movie{''} is not necessarily my favorite movie, whereas {``}my new favorite movie{''} is. This is common sense to humans, yet it is unknown whether language models have such knowledge. We introduce the Recursive Noun Phrase Challenge (RNPC), a dataset of three textual inference tasks involving textual entailment and event plausibility comparison, precisely targeting the understanding of recursive NPs. When evaluated on RNPC, state-of-the-art Transformer models only perform around chance. Still, we show that such knowledge is learnable with appropriate data. We further probe the models for relevant linguistic features that can be learned from our tasks, including modifier semantic category and modifier scope. Finally, models trained on RNPC achieve strong zero-shot performance on an extrinsic Harm Detection evaluation task, showing the usefulness of the understanding of recursive NPs in downstream applications.", }
Split and Rephrase is a text simplification task to rewrite a complex sentence into several simpler ones. We show that the existing benchmark is too simplistic, developing a rule-based model using no training data which performs on par with the current state-of-the-art neural model. We then propose two new crowdsourced benchmarks with improved quality. We also provide a study on the flaws of BLEU score, and the cost-efficiency of using crowd workers to evaluate models.
[5] Small but Mighty: New Benchmarks for Split and Rephrase; Li Zhang, Huaiyu Zhu, Siddhartha Brahma and Yunyao Li; in EMNLP 2020; a part of the GEM Benchmark.Paper BibTeX Repo
@inproceedings{zhang-etal-2020-small, title = "Small but Mighty: New Benchmarks for Split and Rephrase", author = "Zhang, Li and Zhu, Huaiyu and Brahma, Siddhartha and Li, Yunyao", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.91", pages = "1198--1205", }
[16] GEMv2: Multilingual NLG Benchmarking in a Single Line of Code; ... Li Zhang, Huaiyu Zhu, Siddhartha Brahma, Yunyao Li, ...; in EMNLP 2022.Paper
Recent advancement on neural sentence embeddings show highly competitive performance on semantic similarity tasks. However, the embeddings don't usually just work off-the-shelf, as we show that the transfer learning methodology is crucial to performance. We propose a fine-tuning approach and a multi-label approach which outperforms most alternative transfer learning approaches on semantic similarity tasks, achieving state-of-the-art performance on multiple datasets.
[4] Multi-Label Transfer Learning for Multi-Relational Semantic Similarity; Li Zhang, Steven R. Wilson and Rada Mihalcea; In *SEM 2019. Paper BibTeX Slides
@inproceedings{zhang-etal-2019-multi, title = "Multi-Label Transfer Learning for Multi-Relational Semantic Similarity", author = "Zhang, Li and Wilson, Steven and Mihalcea, Rada", booktitle = "Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*{SEM} 2019)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/S19-1005", pages = "44--50", abstract = "Multi-relational semantic similarity datasets define the semantic relations between two short texts in multiple ways, e.g., similarity, relatedness, and so on. Yet, all the systems to date designed to capture such relations target one relation at a time. We propose a multi-label transfer learning approach based on LSTM to make predictions for several relations simultaneously and aggregate the losses to update the parameters. This multi-label regression approach jointly learns the information provided by the multiple relations, rather than treating them as separate tasks. Not only does this approach outperform the single-task approach and the traditional multi-task learning approach, but it also achieves state-of-the-art performance on all but one relation of the Human Activity Phrase dataset.", }
[3] Direct Network Transfer: Transfer Learning of Sentence Embeddings for Semantic Similarity; Li Zhang, Steven R. Wilson and Rada Mihalcea; in arXiv pre-print; presented at IC2S2 2018.Paper BibTeX Poster
@misc{zhang2018direct, title={Direct Network Transfer: Transfer Learning of Sentence Embeddings for Semantic Similarity}, author={Li Zhang and Steven R. Wilson and Rada Mihalcea}, year={2018}, eprint={1804.07835}, archivePrefix={arXiv}, primaryClass={cs.CL} }
This work is a part of the UM-IBM Sapphire project. The goal is to build a dialog system able to answer questions about university course information. While tackling the task of translating natural language to SQL, we identified flaws in the current text-to-SQL evaluation scheme and proposed alternatives. I contributed to building the a text-to-SQL dataset and implementing named entitiy recognition as a preprocessing step.
[1] Improving Text-to-SQL Evaluation Methodology; Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan Dhanalakshmi Ramanathan, Sesh Sadasivam, Rui Zhang and Dragomir Radev; in ACL 2018.Paper BibTeX Repo Poster
@InProceedings{acl18sql, author = {Catherine Finegan-Dollak\* and Jonathan K. Kummerfeld\* and Li Zhang and Karthik Ramanathan and Sesh Sadasivam and Rui Zhang and Dragomir Radev}, title = {Improving Text-to-SQL Evaluation Methodology}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, shortvenue = {ACL}, month = {July}, year = {2018}, address = {Melbourne, Victoria, Australia}, pages = {351--360}, abstract = {To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.}, url = {http://aclweb.org/anthology/P18-1033}, software = {https://github.com/jkkummerfeld/text2sql-data}, data = {https://github.com/jkkummerfeld/text2sql-data}, }
This work is a part of the DARPA AIDA project. From the texts, audios and videos recounting the Russia-Ukraine conflict in 2014, the goal is to extract knowledge elements and generate hypotheses about real-life events. I used named entity recognition, keyword extraction and word embeddings to extract textual entities from the data and assign them with categories from the given ontology.
[2] Entity and Event Extraction from Scratch Using Minimal Training Data; Laura Burdick, Steven R. Wilson, Oana Ignat, Charles F. Welch, Li Zhang, Mingzhe Wang, Jia Deng and Rada Mihalcea; in TAC 2018.Paper BibTeX Poster
@article{Burdick2018EntityAE, title={Entity and Event Extraction from Scratch Using Minimal Training Data}, author={Laura Burdick and Steven R. Wilson and Oana Ignat and Charles F Welch and Li Zhang and Mingzhe Wang and Jia Deng and Rada Mihalcea}, journal={Theory and Applications of Categories}, year={2018} }
In each volume of the New Yorker magazine, there is a comic section where thousands of readers submit funny captions. The goal is to automatically divide them into clusters based on their theme of humor (what they are joking about) using unsupervised learning. Work had been done years ago but the codes were scattered and underdocumented. I as a freshman was in charge of this project, to bring the existing system up to date and to make optimization.
AAN encompases our corpus of resources on NLP and related fields and the research projects which build upon this corpus. We have collected around 6,500 surveys, tutorials and other resources and created a search engine which allows users to easily browse these resources. I helped build and maintain this power anthology with information regarding numerous papers included in top NLP venues. It features paper citation, author citation, and author collaboration, etc.
[18] Language Models are Drummers: Drum Composition with Natural Language Pre-Training; Li Zhang and Chris Callison-Burch; in AAAI 2023 Workshop on Creative AI Across Modalities.Paper Repo BibTeX
@InProceedings{gpt3drum, author = {Li Zhang and Chris Callison-Burch}, title = {Language Models are Drummers: Drum Composition with Natural Language Pre-Training}, venue = {AAAI 2023 1st workshop on Creative AI across Modalities}, month = {Feburary}, year = {2023}, address = {Washington, D.C., USA}, abstract = {Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.}, url = {https://arxiv.org/abs/2301.01162}, software = {https://github.com/zharry29/drums-with-llm}, }