But is it really in Rome? Limitations of the ROME model editing technique
I just published a new post on LessWrong. It's about the causal tracing and model editing paper (ROME).
An incomplete list of projects I'd like to work on in 2023
Wrote up a short (incomplete) bullet-point list of the projects I'd like to work on in 2023. Here's the link.
(Linkpost) Results for a survey of tool use and workflows in alignment research
In March 22nd, 2022, we released a survey with an accompanying post for the purpose of getting more insight into
How learning efficiently applies to alignment research
As we are trying to optimize for actually solving the problem, we should not fall into the trap of learning
Differential Training Process: Delaying capabilities until inner aligned
I've been ruminating on an idea ever since I read the section on deception in "The Core of the Alignment
Near-Term AI capabilities probably bring low-hanging fruits for global poverty/health
I'm an alignment researcher, but I still think we should be vigilant about how models like GPT-N could potentially be
Is the "Valley of Confused Abstractions" real?
Epistemic Status: Quite confused. Using this short post as a signal for discussion.
Here's a link to the LessWrong post
Foresight for AGI Safety Strategy
For discussion: Link to LessWrong post. Link to EA Forum post.
This post is about I think why we should
Notes on Cicero
Link to YouTube explanation by Yannic Kilcher:
Link to paper (sharing on GDrive since it's behind a paywall on Science)
Detail about factual knowledge in Transformers
This post is currently in the Appendix of a much longer post I'm currently editing and waiting for feedback.