What Makes Safety Fine-tuning Methods Safe? A Mechanistic Study
2024
Conference Paper
ei
Author(s): | Jain, S. and Lubana, E. S. and Oksuz, K. and Joy, T. and Torr, P. H. S. and Sanyal, A. and Dokania, P. K. |
Book Title: | ICML 2024 Workshop on Mechanistic Interpretability (Spotlight) |
Year: | 2024 |
Month: | July |
Department(s): | Empirical Inference |
Bibtex Type: | Conference Paper (conference) |
Event Place: | Vienna, Austria |
State: | Published |
URL: | https://openreview.net/forum?id=BS2CbUkJpy |
BibTex @conference{Jainetal24, title = {What Makes Safety Fine-tuning Methods Safe? A Mechanistic Study}, author = {Jain, S. and Lubana, E. S. and Oksuz, K. and Joy, T. and Torr, P. H. S. and Sanyal, A. and Dokania, P. K.}, booktitle = {ICML 2024 Workshop on Mechanistic Interpretability (Spotlight)}, month = jul, year = {2024}, doi = {}, url = {https://openreview.net/forum?id=BS2CbUkJpy}, month_numeric = {7} } |