September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
newsletter.danielpaleka.com
Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Neurons in transformers are not interpretable by themselves. Neural networks want to represent more features than they have neurons
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
September/October 2023 safety news: Sparse…
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Neurons in transformers are not interpretable by themselves. Neural networks want to represent more features than they have neurons