Research
One line of work: the reliability of retrieval and LLM systems. Knowing when a generated answer can be trusted, and getting there without overspending compute. It runs from a co-authored production RAG method to sole-authored work on calibrated abstention, cross-model reliability, evaluation, and model selection. Select any paper for the summary and a link.
Publications
Retrieval-Augmented Generation for Domain-Specific Question Answering
AAAI 2024 · SDU Workshop
Retrieval-Augmented Generation for Domain-Specific Question Answering
AAAI 2024 · SDU WorkshopA retrieval-augmented generation approach for closed, domain-specific question answering that uses user interaction signals to improve retrieval and reduce ungrounded answers. It underpins production question answering at Adobe and has been built on by independent academic and industry groups.
Co-author (one of eight). Cited 50+ times.
EVICT: Evidence-Sufficiency Verification via Counterfactual Dropout for Visually-Grounded Selective QA
CVPR · GRAIL-V
EVICT: Evidence-Sufficiency Verification via Counterfactual Dropout for Visually-Grounded Selective QA
CVPR · GRAIL-VVision-language models often answer confidently while relying on the wrong evidence. EVICT tests this directly: it masks the image region the model claims to depend on, then re-runs the same question. If the answer does not change, the model was not actually using the evidence it cited, and the answer is flagged as unverified.
The probe needs no training and no ground-truth labels, so it is cheap to run as a reliability guardrail on top of an existing model. Its honest limitation: it detects evidence-independence, not correctness. An answer can be genuinely grounded and still wrong.
Sole-authored.
PASC: Pipeline-Aware Conformal Prediction for Multi-Stage NLP Pipelines
ICML · EIML
PASC: Pipeline-Aware Conformal Prediction for Multi-Stage NLP Pipelines
ICML · EIMLIn a multi-stage system (NER → disambiguation → typing) errors compound, so calibrating each stage alone under-covers while a Bonferroni union bound over-covers. PASC reduces joint coverage to a single conformal problem on the pipeline's maximum nonconformity score.
On a three-stage pipeline over CoNLL-2003 it reaches 96.4% end-to-end coverage versus 93.4% (Bonferroni) and 86.5% (independent calibration), at the same prediction-set size — and empirically holds target coverage under distribution shift, where independent calibration collapses to 59%.
Sole-authored.
PromptPort: A Reliability Layer for Cross-Model Structured Extraction
preprint
PromptPort: A Reliability Layer for Cross-Model Structured Extraction
preprintFormalizes "format collapse," where one prompt yields clean JSON on one model and malformed output on another, and adds a canonicalization and verification layer so strict parsers stop rejecting correct extractions.
It repairs form, not meaning: it can rescue a malformed-but-correct extraction, but it will not catch a confidently wrong one.
Sole-authored.
Forecasting Model Success at Inference Time: Calibrated Probabilistic Forecasts for Cost-Optimal LLM Cascades
ICML · Forecast
Forecasting Model Success at Inference Time: Calibrated Probabilistic Forecasts for Cost-Optimal LLM Cascades
ICML · ForecastPredicts, per query, whether a smaller model will succeed, and uses that calibrated forecast to decide when to escalate. Lower cost for the same quality.
Sole-authored.
Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science
ICML · AI4GOOD
Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science
ICML · AI4GOODWhen LLMs annotate contested social and political text they fail in opposite directions at once — one model over-flags where another under-flags — and they can underestimate how much opposition a population holds by 24 to 40 points. Worse, aggregate accuracy can look near-perfect through “accidental cancellation” while both directional errors stay large.
Sole-authored.
Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval
ICLR · CAO
Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval
ICLR · CAOShows that rewriting a query before retrieval is strongly domain-dependent — it helped on TREC-COVID but hurt on FiQA — because rewrites that swap out domain-specific terms degrade queries that already matched well.
Sole-authored.
The Compositional Generalization Gap in Named Entity Recognition
ICML · CompLearn
The Compositional Generalization Gap in Named Entity Recognition
ICML · CompLearnStandard NER benchmarks reuse the same entities across train and test, so a high score can reflect memorization rather than generalization. This measures how far those scores overstate performance once entity types recombine into novel, unseen compositions — the conditions production systems actually face — and argues for evaluation that reflects them.
Sole-authored.
Architecture-Homogeneous Model Selection for Representational Alignment
ICLR · Re-Align
Architecture-Homogeneous Model Selection for Representational Alignment
ICLR · Re-AlignWhen you compare two models by their internal representations, which models you pick can drive the result. This selects architecture-homogeneous models so an alignment score reflects the representations themselves, not architectural confounds.
Sole-authored.