RobustSight: AI Safety & Alignment Framework

Published: August 01, 2025

A comprehensive AI safety evaluation framework investigating the intersection of adversarial robustness, model interpretability, and human-guided alignment. Benchmarks frontier LLMs on safety-relevant behavior across distribution shifts and adversarial perturbations.

Key work: Designed experiments measuring alignment stability across 6 semantically equivalent question variants per indicator · Statistical analysis over response distributions including Likert histograms, cross-model EDA, and chain-of-thought semantic analysis · Mathematical models characterizing ground-truth drift in LLM judgment across model generations · Open-sourced for longitudinal alignment drift tracking with human expert survey integration

Stack: Python, PyTorch, HuggingFace, statistical analysis, adversarial ML

GitHub →

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Aravind Kannappan

Share on