Faster asin() was hiding in plain sight
TL;DR Highlight
While optimizing a ray tracer, the author hand-implemented asin() using Taylor series and Padé approximation — faster and more accurate than the system library version.
Who Should Read
Graphics programmers, game engine developers, and performance engineers who care about floating-point performance and numerical methods.
Core Mechanics
- For ray tracing performance, the author needed a fast asin() (arcsine) function and found the system math library version was slower than necessary for their precision requirements.
- Taylor series approximation: expanding asin(x) around x=0 gives a polynomial that's fast to compute but loses accuracy near x=±1.
- Padé approximation: a rational function (polynomial/polynomial) that provides better accuracy than Taylor series for the same number of terms, especially near the boundaries of the domain.
- The custom implementation was faster than the system asin() while meeting the ray tracer's precision needs — the key insight being that ray tracers often don't need full IEEE 754 double precision.
- This kind of micro-optimization matters when asin() is called millions of times per frame in tight inner loops.
Evidence
- The author shared benchmark results comparing their Padé approximation against std::asin() on multiple platforms, showing 2-3x speedup.
- HN commenters with numerical methods background discussed the tradeoffs between Taylor, Chebyshev, and Padé approximations for different domains.
- Some noted that SIMD-vectorized versions of these approximations (using AVX2/NEON) can push even larger speedups.
- Others pointed out that modern CPUs have fast hardware asin() in the FPU, and the wins depend heavily on the CPU microarchitecture.
How to Apply
- If you're calling math functions millions of times per frame, profile to confirm they're actual bottlenecks before optimizing — the standard library is often fast enough.
- When you do need a custom approximation, Padé approximations generally outperform Taylor series for the same computation cost when you need accuracy across a wider domain.
- Consider the precision requirements carefully: graphics often tolerates 1e-6 relative error, physics simulation might need 1e-10 — match the approximation to the requirement.
- Check if SIMD-vectorized math libraries (Intel SVML, Sleef) already provide what you need before writing your own approximation.
Code Example
// NVIDIA CG library-based fast_asin (Stack Overflow: https://stackoverflow.com/a/26030435)
// Original source: Hastings 1955 / Abramowitz & Stegun 4.4.45
float fast_asin(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293f;
ret *= x;
ret += 0.0742610f;
ret *= x;
ret -= 0.2121144f;
ret *= x;
ret += 1.5707288f;
ret = 3.14159265358979f * 0.5f - sqrt(1.0f - x) * ret;
return ret - 2 * negate * ret;
}
// Taylor series-based 4th-order approximation (author's own implementation, ~5% improvement, valid only in range -0.8 ~ 0.8)
double _asin_approx_private(const double x) {
if ((x < -0.8) || (x > 0.8)) {
return std::asin(x); // fallback
}
constexpr double a = 0.5;
constexpr double b = a * 0.75;
constexpr double c = b * (5.0 / 6.0);
constexpr double d = c * (7.0 / 8.0);
const double aa = (x * x * x) / 3.0;
const double bb = (x * x * x * x * x) / 5.0;
const double cc = (x * x * x * x * x * x * x) / 7.0;
const double dd = (x * x * x * x * x * x * x * x * x) / 9.0;
return x + (a * aa) + (b * bb) + (c * cc) + (d * dd);
}Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.
Related Resources
- Original: Faster asin() Was Hiding In Plain Sight
- Stack Overflow: Fast asin approximation (NVIDIA CG-based)
- Introduction to Chebyshev Approximation (embeddedrelated.com)
- Remez Algorithm (Wikipedia)
- Robin Green: Faster Math Functions (GDC slide deck 1)
- iquilezles: Avoiding Trigonometry (noacos)
- glibc asin implementation source
- Intel AVX-512 asin intrinsic guide