RobustSight: AI Safety & Alignment Framework
Published:
A comprehensive AI safety evaluation framework investigating the intersection of adversarial robustness, model interpretability, and human-guided alignment. Benchmarks frontier LLMs on safety-relevant behavior across distribution shifts and adversarial perturbations.
Key work: Designed experiments measuring alignment stability across 6 semantically equivalent question variants per indicator · Statistical analysis over response distributions including Likert histograms, cross-model EDA, and chain-of-thought semantic analysis · Mathematical models characterizing ground-truth drift in LLM judgment across model generations · Open-sourced for longitudinal alignment drift tracking with human expert survey integration
Stack: Python, PyTorch, HuggingFace, statistical analysis, adversarial ML
