Using LLMs at Oxide
TL;DR Highlight
Oxide, a systems software company, published their internal LLM usage principles — focused on accountability, rigor, empathy, and teamwork rather than 'use it fast and often.'
Who Should Read
Engineering managers and team leads thinking about how to set healthy norms around LLM use in their org, and individual devs wanting a framework for responsible AI use.
Core Mechanics
- Oxide's guidelines aren't about maximizing LLM usage — they center on values like accountability (own the output), rigor (verify before shipping), empathy (consider the reader), and teamwork (don't let AI erode collaboration).
- The core principle: LLM output is your responsibility once you use it. 'The AI wrote it' is not an acceptable excuse for incorrect, sloppy, or misleading content.
- They explicitly warn against using LLMs to pad or obscure — generated content should be as tight and precise as anything you'd write manually.
- On teamwork: LLMs can subtly reduce knowledge-sharing and mentorship if people stop asking colleagues questions and start asking models instead. The guidelines encourage preserving human interaction.
- Rigor includes not just fact-checking but also stylistic review — LLM output tends toward certain patterns (verbose intros, hedge words, passive voice) that need to be actively edited out.
- The document positions LLMs as tools that amplify your existing quality standards, not tools that establish new (lower) ones.
Evidence
- HN discussion was unusually positive — many commenters praised this as a rare example of an org thinking carefully about LLM use rather than just adopting it uncritically.
- Several engineers shared similar internal guidelines they'd written, suggesting this is a widespread but rarely published concern.
- A few commenters noted the irony that the guidelines were apparently drafted collaboratively by humans, not generated by LLMs — which itself demonstrates the values they're promoting.
- Some pushback: critics argued that overly restrictive guidelines risk making a company less competitive as AI-augmented peers become more productive.
How to Apply
- Use this as a template to draft your own team's LLM usage guidelines — adapt the values section to match your org's existing engineering culture.
- Add a step in your code review process where authors flag AI-generated sections, enabling reviewers to apply extra scrutiny.
- For technical writing: run a dedicated pass to strip LLM-isms (hedging phrases, unnecessary preamble, passive constructions) before publishing.
- Explicitly discuss LLM use in onboarding — set expectations early rather than letting norms drift organically.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.