Weize Liu

I am a first-year Ph.D. student in Computer Science at the University of Maryland, College Park, advised by Prof. Furong Huang.

My research focuses on large language models (LLMs), particularly on improving models’ reasoning, agentic capabilities, reliability, and efficiency by developing advanced post-training (SFT, RL) and data synthesis methods.

I am actively seeking a research internship for summer 2026 (based in the United States) and welcome any referrals or connections. I am also open to research collaborations; if you are interested in working together, please feel free to reach out via email.

News

Jan 2026	A paper was accepted to ICLR 2026! Thanks to all co-authors. See you in Brazil! Feel free to say hi and chat with me!
Sep 2025	Started the Computer Science Ph.D. program at the University of Maryland, College Park.
Jun 2025	Completed the M.Eng. in Computer Technology at Zhejiang University.
May 2025	Started a research internship at Alibaba Group (Foundation Model Training Team, Future Living Lab), enhancing the reasoning capabilities of Qwen3 models through data synthesis techniques.

Selected Publications

* Equal contribution

Agentic Critical Training

Weize Liu, Minghui Liu, Sy-Tuyen Ho, and 3 more authors

arXiv 2026

Paper Project Page
Show details
Training LLMs as agents often begins with imitation learning (IL), but IL only teaches agents what to do without understanding why certain actions are preferable. Recent approaches attempt to address this by introducing self-reflection, but the training paradigm fundamentally remains imitation learning. We propose Agentic Critical Training (ACT), an RL-based training paradigm that trains agents to identify the better action among candidates. By rewarding whether the model’s judgment is correct, ACT drives the model to autonomously develop critical reasoning about action quality, producing genuine self-reflection rather than imitating pre-constructed reflections.

Experiments show that ACT consistently improves agent performance when combined with different post-training methods (IL, RL), outperforming prior approaches across all benchmarks.

Moreover, ACT not only enables strong out-of-distribution generalization on agentic benchmarks, but also improves performance on general reasoning benchmarks (GPQA-Diamond, MATH-500) without any reasoning-specific training data. This suggests that ACT enhances both agentic and general reasoning capabilities, indicating that agentic RL environments may serve as a pathway for improving general reasoning.
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Weize Liu^*, Yongchi Zhao^*, Yijia Luo, and 8 more authors

ICLR 2026

Paper Project Page
Show details
Post-training and even mid-training rely heavily on exam-style data, yet many low-resource disciplines still lack sufficient high-quality questions. Existing data synthesis methods face two major challenges: query-centric approaches are limited by seed-question coverage and model bias, while document-centric approaches lack control over question difficulty. Therefore, we propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline for synthesizing multidisciplinary reasoning questions from raw corpora.

The central insight is the notion of "Design Logic", a form of reusable meta-knowledge that encapsulates how human experts transform knowledge points into complex exam questions. Design logic enables LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts, with explicit control over difficulty, diversity, and types. We extracted over 120,000 Design Logics from filtered human-authored multidisciplinary question banks using LLMs.

We designed a two-stage retrieve-and-generate mechanism to precisely match Design Logics with raw corpora that underwent our multi-dimensional labeling and filtering process, synthesizing two large-scale datasets spanning 75 diverse disciplines: DLR-Book (3.04 million questions from book corpora) and DLR-Web (1.66 million questions from web corpora).

Data analysis shows that questions synthesized by our method exhibit significantly greater difficulty and diversity compared to existing datasets. A series of SFT experiments on the Qwen3 and Llama3 model families demonstrate that our data substantially enhances LLMs’ multidisciplinary reasoning capabilities, outperforming baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their officially released post-trained final models.
Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

Weize Liu, Guocong Li, Kai Zhang, and 6 more authors

NAACL 2024

Paper Code
Show details
We proposed a novel data distillation (data synthesis) approach that distills the self-evaluation capability from LLMs into small language models (SLMs). By learning from the analysis and evaluation of CoT correctness, SLMs gain understanding of the potential reasons behind correct or incorrect reasoning, enabling deeper comprehension of problems and thus improving answer accuracy and reliability.

To overcome the randomness and limitations of generated synthetic data, we further proposed distilling diverse chains of thought along with their corresponding multiple self-evaluations from LLMs, enabling SLMs to learn more comprehensive reasoning paths and thinking spaces of LLMs.

Comprehensive experiments demonstrate that our method enables SLMs to successfully learn the self-evaluation capability and more comprehensive thinking of LLMs, significantly enhancing the performance and reliability of trained SLMs, and outperforming previous CoT distillation methods. This demonstrates that our method is essential for SLMs to achieve efficient, high-quality, and reliable reasoning, especially in resource-constrained environments.
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Weize Liu, Yinlong Xu, Hongxia Xu, and 3 more authors

EMNLP 2024

Paper
Show details
To analyze the similarities and differences in internal neuron activities when LLMs process different languages, we designed a method to convert dense LLMs into fine-grained MoE architectures, and visually analyzed multilingual activation patterns within LLMs through expert activation frequency heatmaps.

Through extensive experiments across different model families, model sizes, and variants, we analyzed the distribution of high-frequency activated neurons for different languages, the distribution of multilingual shared neurons, whether activation patterns of different languages relate to their language families, and the impact of instruction tuning on activation patterns.

We further explored leveraging the discovered differences in expert activation frequencies to guide sparse activation and pruning during model inference. Experimental results demonstrate that our method significantly outperforms random expert pruning and even exceeds the performance of original unpruned models in some languages. Additionally, we find that configuring different pruning rates for different layers based on activation level differences yields better results.
From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs

Guocong Li, Weize Liu, Yihang Wu, and 4 more authors

ACL 2025 (Findings)

Paper
Show details
Users often submit inaccurate queries when using LLMs, sometimes containing misleading information. LLM responses are susceptible to misleading information in queries. We proposed a three-stage post-training fine-tuning method that trains LLMs to detect and correct misleading information in queries, improving the accuracy and robustness of LLM responses when facing queries containing misleading information, and reducing the negative impact of misinformation on the model.

Specifically, the three stages include: (1) training LLMs to identify whether queries contain misleading information; (2) training LLMs to correct misleading information in queries using internal or external knowledge; (3) training LLMs to generate accurate and reliable answers based on corrected queries.

To validate our method, we constructed two datasets containing misleading information. Additionally, our trained model detected that some questions in commonly used benchmarks also contain misleading information; removing these misleading questions significantly improves model accuracy, while our method-trained model maintains robust responses and higher performance regardless of whether the query contains misleading information.

Experimental results across multiple datasets on different tasks demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while enhancing LLMs’ hallucination detection capabilities and reducing hallucinations in model outputs, especially when queries contain misleading information.