RobustSight: AI Safety & Alignment Framework

Published:

A comprehensive AI safety evaluation framework investigating the intersection of adversarial robustness, model interpretability, and human-guided alignment. Benchmarks frontier LLMs on safety-relevant behavior across distribution shifts and adversarial perturbations.

Key work: Designed experiments measuring alignment stability across 6 semantically equivalent question variants per indicator · Statistical analysis over response distributions including Likert histograms, cross-model EDA, and chain-of-thought semantic analysis · Mathematical models characterizing ground-truth drift in LLM judgment across model generations · Open-sourced for longitudinal alignment drift tracking with human expert survey integration

Stack: Python, PyTorch, HuggingFace, statistical analysis, adversarial ML

GitHub →