Faster asin() was hiding in plain sight
TL;DR Highlight
While optimizing a ray tracer, the author hand-implemented asin() using Taylor series and Padé approximation — faster and more accurate than the system library version.
Who Should Read
Graphics programmers, game engine developers, and performance engineers who care about floating-point performance and numerical methods.
Core Mechanics
- For ray tracing performance, the author needed a fast asin() (arcsine) function and found the system math library version was slower than necessary for their precision requirements.
- Taylor series approximation: expanding asin(x) around x=0 gives a polynomial that's fast to compute but loses accuracy near x=±1.
- Padé approximation: a rational function (polynomial/polynomial) that provides better accuracy than Taylor series for the same number of terms, especially near the boundaries of the domain.
- The custom implementation was faster than the system asin() while meeting the ray tracer's precision needs — the key insight being that ray tracers often don't need full IEEE 754 double precision.
- This kind of micro-optimization matters when asin() is called millions of times per frame in tight inner loops.
Evidence
- The author shared benchmark results comparing their Padé approximation against std::asin() on multiple platforms, showing 2-3x speedup.
- HN commenters with numerical methods background discussed the tradeoffs between Taylor, Chebyshev, and Padé approximations for different domains.
- Some noted that SIMD-vectorized versions of these approximations (using AVX2/NEON) can push even larger speedups.
- Others pointed out that modern CPUs have fast hardware asin() in the FPU, and the wins depend heavily on the CPU microarchitecture.
How to Apply
- If you're calling math functions millions of times per frame, profile to confirm they're actual bottlenecks before optimizing — the standard library is often fast enough.
- When you do need a custom approximation, Padé approximations generally outperform Taylor series for the same computation cost when you need accuracy across a wider domain.
- Consider the precision requirements carefully: graphics often tolerates 1e-6 relative error, physics simulation might need 1e-10 — match the approximation to the requirement.
- Check if SIMD-vectorized math libraries (Intel SVML, Sleef) already provide what you need before writing your own approximation.
Code Example
// NVIDIA CG library-based fast_asin (Stack Overflow: https://stackoverflow.com/a/26030435)
// Original source: Hastings 1955 / Abramowitz & Stegun 4.4.45
float fast_asin(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293f;
ret *= x;
ret += 0.0742610f;
ret *= x;
ret -= 0.2121144f;
ret *= x;
ret += 1.5707288f;
ret = 3.14159265358979f * 0.5f - sqrt(1.0f - x) * ret;
return ret - 2 * negate * ret;
}
// Taylor series-based 4th-order approximation (author's own implementation, ~5% improvement, valid only in range -0.8 ~ 0.8)
double _asin_approx_private(const double x) {
if ((x < -0.8) || (x > 0.8)) {
return std::asin(x); // fallback
}
constexpr double a = 0.5;
constexpr double b = a * 0.75;
constexpr double c = b * (5.0 / 6.0);
constexpr double d = c * (7.0 / 8.0);
const double aa = (x * x * x) / 3.0;
const double bb = (x * x * x * x * x) / 5.0;
const double cc = (x * x * x * x * x * x * x) / 7.0;
const double dd = (x * x * x * x * x * x * x * x * x) / 9.0;
return x + (a * aa) + (b * bb) + (c * cc) + (d * dd);
}Terminology
Related Papers
Show HN: Smart model routing directly in Claude, Codex and Cursor
프롬프트마다 적합한 AI 모델을 50ms 이내에 자동으로 선택해주는 프록시 라우터로, API 비용을 40~70% 절감할 수 있다고 주장하는 오픈소스 도구다. 단, 프롬프트 캐싱 손실 문제로 커뮤니티 반응은 엇갈린다.
Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
단일 파일을 통째로 암기하도록 Transformer를 과적합(overfitting)시킨 뒤 arithmetic coding으로 압축하는 실험으로, 100MB CSV를 7MB(~0.5 bits/byte)까지 줄이는 데 성공했다. 모델이 '범용 이해' 대신 '특정 파일 완전 암기'를 목표로 한다는 점에서 전통적 ML 학습과 정반대 방향이라 흥미롭다.
Ask HN: Anthropic banned me from using Claude Code and I don't know what to do
VPN 사용 또는 동일 카드 재사용으로 Anthropic Claude Code 계정이 이유 불명으로 정지당한 사용자의 사례와, 커뮤니티에서 나온 대안 및 우회 방법 논의.
Moebius: 0.2B image inpainting model with 10B-level performance
FLUX.1-Fill-Dev(11.9B) 대비 2% 미만의 파라미터(0.22B)로 동급 또는 그 이상의 인페인팅 품질을 달성하면서 추론 속도는 15배 빠른 경량 모델. 소비자용 GPU나 엣지 디바이스에서도 고품질 인페인팅이 가능해진다.
AI Compute Extensions (ACE) Specification
x86 Ecosystem Advisory Group이 행렬 곱셈과 저정밀도 데이터 포맷을 하드웨어 수준에서 가속하는 새로운 x86 명령어 확장 스펙 ACE를 공개했다. ML 워크로드를 CPU에서 더 효율적으로 돌리기 위한 ISA(명령어 집합 구조) 수준의 변화라 향후 AI 추론 환경에 영향을 줄 수 있다.
Show HN: High-Res Neural Cellular Automata
EPFL과 Google Research가 공동 개발한 Neural Cellular Automata(NCA)를 고해상도로 확장하는 기법으로, 기존 NCA의 해상도 한계를 경량 신경망 디코더로 극복한 SIGGRAPH 2026 논문이다.
Related Resources
- Original: Faster asin() Was Hiding In Plain Sight
- Stack Overflow: Fast asin approximation (NVIDIA CG-based)
- Introduction to Chebyshev Approximation (embeddedrelated.com)
- Remez Algorithm (Wikipedia)
- Robin Green: Faster Math Functions (GDC slide deck 1)
- iquilezles: Avoiding Trigonometry (noacos)
- glibc asin implementation source
- Intel AVX-512 asin intrinsic guide