Research

One line of work: the reliability of retrieval and LLM systems. Knowing when a generated answer can be trusted, and getting there without overspending compute. It runs from a co-authored production RAG method to sole-authored work on calibrated abstention, cross-model reliability, evaluation, and model selection. Select any paper for the summary and a link.

Full list and citation metrics on Google Scholar →

Publications

Selected

Retrieval-Augmented Generation for Domain-Specific Question Answering

AAAI 2024 · SDU Workshop

Adobe's production RAG method. Cited 50+ times by independent groups. ▾

A retrieval-augmented generation approach for closed, domain-specific question answering that uses user interaction signals to improve retrieval and reduce ungrounded answers. It underpins production question answering at Adobe and has been built on by independent academic and industry groups.

Co-author (one of eight). Cited 50+ times.

arXiv:2404.14760 →

EVICT: Evidence-Sufficiency Verification via Counterfactual Dropout for Visually-Grounded Selective QA

CVPR · GRAIL-V

A training-free probe that catches answers not actually grounded in the evidence. ▾

Vision-language models often answer confidently while relying on the wrong evidence. EVICT tests this directly: it masks the image region the model claims to depend on, then re-runs the same question. If the answer does not change, the model was not actually using the evidence it cited, and the answer is flagged as unverified.

The probe needs no training and no ground-truth labels, so it is cheap to run as a reliability guardrail on top of an existing model. Its honest limitation: it detects evidence-independence, not correctness. An answer can be genuinely grounded and still wrong.

Sole-authored.

OpenReview →

PASC: Pipeline-Aware Conformal Prediction for Multi-Stage NLP Pipelines

ICML · EIML

Distribution-free coverage guarantees for the whole pipeline, not each stage in isolation. ▾

In a multi-stage system (NER → disambiguation → typing) errors compound, so calibrating each stage alone under-covers while a Bonferroni union bound over-covers. PASC reduces joint coverage to a single conformal problem on the pipeline's maximum nonconformity score.

On a three-stage pipeline over CoNLL-2003 it reaches 96.4% end-to-end coverage versus 93.4% (Bonferroni) and 86.5% (independent calibration), at the same prediction-set size — and empirically holds target coverage under distribution shift, where independent calibration collapses to 59%.

Sole-authored.

arXiv:2605.18812 →

PromptPort: A Reliability Layer for Cross-Model Structured Extraction

preprint

Keeps structured output valid when the same prompt behaves differently across models. ▾

Formalizes "format collapse," where one prompt yields clean JSON on one model and malformed output on another, and adds a canonicalization and verification layer so strict parsers stop rejecting correct extractions.

It repairs form, not meaning: it can rescue a malformed-but-correct extraction, but it will not catch a confidently wrong one.

Sole-authored.

arXiv:2601.06151 →

Workshop papers

Forecasting Model Success at Inference Time: Calibrated Probabilistic Forecasts for Cost-Optimal LLM Cascades

ICML · Forecast

Per-query success forecasts route work through cheap-then-expensive cascades. ▾

Predicts, per query, whether a smaller model will succeed, and uses that calibrated forecast to decide when to escalate. Lower cost for the same quality.

Sole-authored.

OpenReview →

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

ICML · AI4GOOD

LLM annotators can fail in two opposite directions at once. ▾

When LLMs annotate contested social and political text they fail in opposite directions at once — one model over-flags where another under-flags — and they can underestimate how much opposition a population holds by 24 to 40 points. Worse, aggregate accuracy can look near-perfect through “accidental cancellation” while both directional errors stay large.

Sole-authored.

arXiv:2606.12426 →

Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval

ICLR · CAO

Prompt-only query rewriting in RAG helps in some domains and hurts in others. ▾

Shows that rewriting a query before retrieval is strongly domain-dependent — it helped on TREC-COVID but hurt on FiQA — because rewrites that swap out domain-specific terms degrade queries that already matched well.

Sole-authored.

arXiv:2603.13301 →

The Compositional Generalization Gap in Named Entity Recognition

ICML · CompLearn

Static NER benchmarks overstate transferable performance under compositional shift. ▾

Standard NER benchmarks reuse the same entities across train and test, so a high score can reflect memorization rather than generalization. This measures how far those scores overstate performance once entity types recombine into novel, unseen compositions — the conditions production systems actually face — and argues for evaluation that reflects them.

Sole-authored.

OpenReview →

Architecture-Homogeneous Model Selection for Representational Alignment

ICLR · Re-Align

A method for choosing models when comparing internal representations. ▾

When you compare two models by their internal representations, which models you pick can drive the result. This selects architecture-homogeneous models so an alignment score reflects the representations themselves, not architectural confounds.

Sole-authored.

OpenReview →

Also

Co-authored conference paper on LLM-orchestrated, cross-cloud data-engineering pipelines (IEEE GCWCN 2025; proceedings record forthcoming). Sole-authored work under review at NeurIPS 2026, ACL 2026 (Industry), COLM 2026, and ACM Multimedia 2026 (Brave New Ideas).

Patent

Generating Answers to Contextual Queries within a Closed Domain

pending

First-named inventor of six. US Patent Application 2025/0252265 A1, Adobe Inc.; published August 2025 (pending, not granted). Google Patents →