Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters
TL;DR Highlight
This study measured the similarity of writing styles of 178 AI models by analyzing them in 32 dimensions, and found that even among models with significant price differences, over 78% similar writing patterns were discovered.
Who Should Read
Developers who are 고민 중인 which AI model to use, or ML engineers researching AI-generated text detection or model selection criteria.
Core Mechanics
- The writing styles of a total of 178 AI models were quantified into 32 dimensions and their similarities were measured and clustered.
- The analysis revealed cases where models with very large price differences had over 78% overlap in writing style. For example, Gemini 2.5 Flash Lite Preview 06-17 and Claude 3 Opus recorded a similarity of 78.2%.
- The research team argued that 'using a cheaper model with the same writing style as an expensive model is equivalent to only paying for the brand.'
- Cluster analysis suggests clues as to whether a model borrowed parameters from another model or underwent a distillation process (a technique for transferring knowledge from a large model to a smaller one).
- The original website is blocked by Vercel security checkpoints, making direct verification difficult, but key claims and some methodologies of the study have been shared through community comments.
- Discussions also arose regarding whether unique patterns appearing in the writing of models (e.g., the use of '--' symbols) are natural byproducts of the RL (reinforcement learning) process, or intentionally inserted watermarks.
Evidence
- There was strong opposition to the claim that models could be substituted simply because their writing styles are similar. Users who have actually used several models pointed out that 'even if the styles are similar, there is a clear difference in their ability to understand my intentions, and the real cost is intelligence, not writing style.'
- Doubts about the reliability of the methodology were also raised. It is unclear whether the 32 dimensions used were derived through principal component analysis (PCA) or arbitrarily selected, and the criterion of '75% similarity = same writing' was criticized as being arbitrary and without justification. There were also criticisms that there was a lack of linguistic theoretical basis.
- The lack of disclosure of prompts and actual response content was criticized for making it impossible to verify the numbers. There was an opinion that 'benchmarks should show both prompts and responses to be meaningful, but without them, it's just numbers.'
- Interesting speculation emerged about whether the frequent appearance of unique symbols like '--' when using Claude or ChatGPT is a byproduct of RL training, or an intentional fingerprint to prevent AI-generated text from being re-introduced into the training data (model collapse).
- Experiences comparing several models and comparing the actual frequency of hallucinations were also shared. One user had the personal experience that Gemini lies less than OpenAI or Anthropic paid models, and speculated that this may be due to Google's better training data or higher utilization of RAG (Retrieval-Augmented Generation).
How to Apply
- If you are reviewing several models and cost reduction is your goal, do not decide on a substitute model based solely on writing style similarity, but conduct direct A/B tests on actual tasks (inference, coding, fact retrieval, etc.). Even if the styles are the same, the actual capabilities may differ significantly.
- If you are creating an AI-generated text detection system, you can extract specific writing patterns of a model (repeating symbols, sentence structure, etc.) as features, as in this study, and use them for model fingerprinting.
- If you want to monitor how much the writing style of a model distilled or fine-tuned in your service has changed from the original base model, you can refer to the 32-dimensional style analysis method of this study and build a pipeline to quantify style drift.
- If you need to share model selection decisions within your team, 'writing style similarity' should be just one reference indicator, and it is best to create a comprehensive evaluation table with actual performance (accuracy, hallucination frequency, response speed) and use it as evidence-based decision-making material.
Terminology
fingerprintingA technique for numerically measuring uniquely repeating writing habits or patterns for each model, enabling identification of 'which model wrote this text'.
distillationA technique for training a small model (student) with the knowledge of a large model (teacher) to maintain similar performance while reducing model size.
RLReinforcement Learning. A method of training AI by rewarding good responses and penalizing bad ones. ChatGPT, Claude, and others are adjusted to human preferences in this way.
모델 붕괴A phenomenon in which diversity decreases and specific patterns become stronger as AI-generated text is re-uploaded to the internet and used as training data for the next model.
클러스터A group of models automatically grouped together with similar writing characteristics. Models in the same cluster are likely to have similar training data or training methods.
RAGRetrieval-Augmented Generation. A method in which the model searches external documents or databases in real time when answering. It can utilize the latest information not in the model weights and helps reduce hallucination.