Benchmarking Headphones

Whenever I publish a headphone or source equipment review, the most common question I get is something along the lines of “how much better is it than X?”, where X is either something the person already owns, or something the person is considering instead. For headphone reviewers, there’s nowhere to really begin with this type of question, because “better” is rarely defined by the person leaving the comment or question. But because of this, for anyone trying to get the most accurate evaluations, it must be frustrating to simply rely on a multitude of occasionally contrasting subjective impressions from reviewers.

Maybe certain things like detail retrieval or stage could be given a relative comparison (like “headphone A is better than headphone B at detail retrieval”), but there seems to be a consistent desire to be able to benchmark headphones on a numerically represented sliding scale or similarly satisfying conclusion. Most people want me to give an answer like “5% better”, or “20% better” and so on, so the person gets a sense of whether or not it’s worth it for them. I’ve even seen questions on source equipment to the effect of “can I expect a 20%, 30%, or higher improvement over my phone?”. While I’ve tried to give numerical values for my evaluations (at least for headphones), I’m still at a bit of a loss as to how to answer these questions.

I think this notion likely comes from the clearly demonstrable improvements that you can talk about and even represent in this way in other areas. We’re used to be able to do this with PC components - CPUs, GPUs, and so on. We can say things like “this laptop is 30% faster than that laptop” (in whatever benchmark tests we’ve run), and have it make sense the way that any similar statements in headphones wouldn’t. If it doesn’t come from benchmarks in the tech world, I’m sure you could point to any number of other hobbies (like cars for example) and interests where those kinds of statements make sense - where the improvements are measurable and tangible in much more straightforward ways.

But that also doesn’t mean there’s nothing we can anchor headphone and source evaluations to in some fashion. I’m not suggesting headphone appreciation is as close to matters of taste like the traditional view of food/wine appreciation maybe many of us think it is (this is a very interesting topic but it’s way too deep to go into here. I think there’s a more nuanced view of this stuff that maybe lines up more closely with headphone evaluations). There are tangible improvements that show up when going from an M50x to a HiFiMAN Susvara for example, that don’t really depend on a matter of taste - both when it comes to the experience and when it comes to what’s happening physically with the technology.

At the very least we can talk about deviations from frequency response targets. Certain review sites like RTings have devised index scores that are based on the frequency response’s adherence to or deviation from target curves. In my opinion, this hasn’t quite worked out so great in practice because too many other factors were included, and this skewed the results to counter-examples that showed crazy things like the Stax L300 being equivalent to Bose QC35ii. Their process has now been improved to have the score best reflect the use case, which yields much more useful scores - perhaps because of this problem. But the idea in principle - the idea of creating some kind of index score that you can use to get more than just a relative sense of sound quality - is a good one. It’s something that a lot of prospective headphone enthusiasts who don’t want to bother reading and comparing all the subjective reviews out there would find useful.

So at the very least, we could reduce frequency response deviations to a numerical value or index of some kind. But this also leaves out other important factors. At the moment, there’s no way to identify things like detail retrieval in measurements (Sean Olive may disagree, I’ll leave the door open to that). It may very well be the case that technical characteristics like detail retrieval and macrodynamics - even soundstage - could be captured by the more fine-grained frequency response measurements, but we don’t know where to reliably look for it yet. As I’ve said in the past, we can’t hold up a frequency response graph and say “here’s the detail”. And this unfortunately means our index scores (or any comparable percentage result) would be incomplete.

Many reviewers, myself included do provide numerical scores (or grades like the way Crinacle does) to help make comparing different equipment a bit easier. But this is at best still subjective benchmarking. It’s all just based on comparing different equipment and determining which sounds more ‘clear’ or identifying winners between compared equipment along given performance dimensions. Maybe this is all readers are looking for, but I think this type of evaluation still misses some important pieces to be able to say one headphone is a given percentage better or worse than another.

I may have provided answers to that effect in the past, but I tend to think any answer that reduces sound quality improvements to a percentage or some kind of scaled value depends on two important coefficients:

  1. What you’re used to
  2. How much you care

In other words, statements made in evaluations that include the subjective benchmarking that gets done by reviewers like myself and others, where we do provide numerical or graded scores, contain the implicit assumption “if you care about this stuff the way I do”. And I’d like to think that reviewers in general do care a lot about this stuff - even the ones we may disagree with. Whether it’s me, Josh, Metal, Max, Crin, etc., I tend think everyone has the same passion for getting their music represented in the best and most enjoyable way possible.

For prospective buyers who haven’t had a chance to experience the variety of equipment reviewers are fortunate enough to get a sense of, if you’re used to using an M50X or similar equipment, and that’s been your expectation of audio quality in general, going up to something like the Susvara or Focal Utopia is going to reveal massive differences. But that alone can’t determine the degree of improvement or “whether it’s worth it” for the listener.

If you care a lot, the difference between an M50x and a Susvara is going to be way more significant than for someone who doesn’t really care at all. I’ve met people who put on flagship headphones and go “it sounds fine”, but their reactions don’t really indicate they’re as impressed as someone who really cares a lot about how their music is represented. And this, to me, is the most important factor.

Perhaps a more realistic example is the difference between the Focal Clear and the Focal Utopia. To me (and many enthusiasts), this comparison reveals a substantial difference in terms of image clarity and overall fidelity, more so than between the Clear and the Elear. But there are those - even within the audiophile community - who aren’t as taken by that difference. This means they’re able to enjoy the Clear’s tonality over the Utopia, in spite of the Utopia’s technical advantages. In this case, it doesn’t necessarily mean they don’t care about their music as much, it just means they value tonality over image clarity and detail.

But I still think the bottom line for answering the initial question of “what percentage better is this headphone from that headphone”, or “how much improvement should I expect to hear” ultimately comes down to how much you care about how well your music is represented, and then identifying which dimensions of headphone performance are going to yield the best representation for you. Is it detail? tonality? soundstage? timbre? etc., The very real differences that exist may only be perceived as a small incremental improvement to some, but it might mean the world to others.

16 Likes

You’ve met my wife then? :wink:

7 Likes

This is a nice write-up. There are mountains of egg-head philosophy and consciousness theory papers trying to understand qualia, the role of the perceiver and the perceived, and aesthetics.

My take-away after years of struggle and pondering: enjoy life! It’s short!

7 Likes

Very interesting post, This is why I take things with a grain of salt and try to make my own conclusion base on the review, my experience, and what I like/or look for from said equipment… Kinda like conspiracy theory… It might all be BS, but at the end you are free to you make your own conclusion out the BS you heard.

I have learn to trust your reviews more than others because you share a similar signature of what I look for BUT I still look for that extra mile that one can only know too look for… At the end of the day, You really have to like what you are looking for, If not is just going to be… Mehhh

I been burn many time looking for something that has been misrepresented but is part of the learning curve and process.

We wouldn’t be here if weren’t looking for something in our Audiophile hobby :smiley:

6 Likes

I don’t think you can put any meaningful objective/numerical quantifier on something that you cannot directly map to a repeatable, objective, measurement.

Even among those that try, you nearly always find it’ll start at something like a 10 point scale, and then after a handful of assessments, it’s suddenly getting sub-divided into halves or tenths of a point … and maintaining consistency even on a 10 point scale it just about impossible due to the raw number of variables involved (physical and not).

Which isn’t to say there isn’t a way to map between a suitable sample of subjective assessments and preferences across a suitable sample of evaluators (formal or otherwise)*. Though this is ultimately constrained by the consistency of the evaluators response to a given item. And, of course, there is no current data set available that would support this.


*In fact I just completed paper work for a patent filing on just such a system and method, and means for accumulating the necessary data to operate it.

7 Likes

I wouldn’t call it all ‘egg-head’ necessarily haha. It gets a bit into the whole ‘matter of taste’ thing that I mentioned. But I tend to think aesthetics is all reducible as well - even if it’s not yet clear. So we might not know why we prefer one thing over another, but presumably there is a reason. Maybe that’s physiological, maybe cultural. But just because differences in preference exist, doesn’t mean they’re not grounded in something more substantial.

1 Like

Yeah and with headphones, when we are doing ‘subjective benchmarking’, it’s always relative to other headphones or experiences we’ve had. It’s not independently generated.

2 Likes

You could, fairly easily, control for a baseline headphone for all other comparisons (getting agreement on the baseline is vastly harder than the control for it).

What’s harder it is to control for, still, are the all the other variables - not the least of which is that people have significant diversions in their hearing responses.

An 18 year old that’s never been subjected to loud noises (e.g. has never even been to a live concert) will have a very different hearing response to a 50 year old that spent their teenage years moshed up against the speaker arrays at Death Metal concerts and goes shooting without hearing protection.

Though given the preponderance of Beyer Dynamic headphones as the “End Game” choice for “systems” on a certain site, and their high incidence of being positioned that way by pre-college-grad-age listeners, I’m fairly certain most of them are either already suffering from extreme high-frequency hearing loss or they all just like massive amounts of treble.

5 Likes

I think my quote I came up with this morning fits this…

“Objectively measured, subjectively enjoyed” ~TylerM

:face_with_monocle::nerd_face:

Also, things that make ones opinion up about something is pretty broad, then throw in group think or “sheep” mentality and it makes it very hard to measure what people actually like…or quantify it accurately

But I’m no scientist🙃

3 Likes

As a self-deprecating former total egg-head now in a “12 Step Program” to escape egg-headedness…I’d still call myself dangerously close to being an egg-head, but I use the term harmlessly. In essence, some people tend to obsess over perfect explanations when no perfect explanation is possible or will ever be possible. Most of the world simply calls them “professors” or “intellectuals.” :expressionless: :crazy_face:

Ohhhhh, fighting words in the egg-head land! Get your armor and your weapons ASAP! The pure reductionists think this possible, but there are arguments as to why it’s not (e.g., emergent properties that require a certain level of complexity to exist at all. Consider the societal changes following the introduction of: 1. the printing press, 2. radio, 3. movies (and talkies), 4. television, 5. the Internet, and 6. smart phones with 24/7 social media. Each era was characterized by quantum shifts and indescribable in the language of the prior era. Those born today see only the end product, without ever learning how to be alone, ever learning how to read a map, ever needing to choose how to spend that $15 dollars on a potentially garbage CD instead of having access to unlimited music. To ‘reduce’ social media natives is to destroy what defines them. — written in typical egg-head fashion where the footnote is longer than the statement).

This is the “McDonald’s is just fine” phenomenon in a new domain. Those on a budget and without experience cannot care, and literally cannot hear until their perceptual systems are trained to hear. See my philosophy links above. Biological differences play a substantial role – I myself reached a “good enough” plateau with my HD-600s that lasted for years. So many subjective/objective discussions don’t consider that people can and do change and grow with experience. Your views follow from your personal history, as impressed on either a good or bad version of human anatomy and physiology (aka cognitive psychology and the philosophy of mind).

So much egg-head residue in my head…and it’s Friday beer time!

5 Likes

It’s okay, I’ve got my egg-head armor and weapons out at all times haha.

My intuitions in general about this stuff align most closely to the views of a perceuptualist, externalist, physicalist, monist, subject-sensitive invariantist (to join a long string of ‘isms’). Although I’m somewhat in opposition to strong versions of representationalism. My take on theory of mind is that there’s something true about representationlaism, at least insofar as one is a phenomenal externalist as well. So in this case, we could say the enjoyment had from a given headphone experience is the mental representational content of some physical thing going on, regardless of whether we know what that is. And I don’t think this is as much of an emergent property as it is a necessary component of experience (not just for headphones).

But as much as that’s the line I tend to draw, I’ve always preferred answer questions of qualia more in terms of their epistemological content (kind of like Jackson’s “what Mary didn’t know”). We can look at graphs until we’re blue (pun intended…) in the face, learn everything there is to know about it, but when we actually experience the headphones, we do get something new. So I suppose for anyone who hasn’t had that experience, it’d be impossible to communicate it any way that didn’t require a kind of emergent property in the individual.

As for the McDonald’s aesthetics argument, I think that while what you say is true about new experiences changing people, this may be where the normative reductionist position gains a bit of ground back. Not only could the difference at least be explained in terms of the grounding properties (physiological, cultural etc.) - even in the absence of the “Mary didn’t know” experiential content, that doesn’t require the person on a budget to not be capable of those experiences , or even had them in other areas.

Apologies to everyone for the egg-headery here.

3 Likes

Just some random beer-inspired (or deformed) observations, not criticism here…

You speak with the voice of analytical philosophy. I’m from an empirical testing psychology/cognitive science background. This tradition takes the view that (1) introspection runs in circles, and (2) language/“isms” run in circles. It follows from the fact that early psychological research (circa 1900) fell flat on its face trying to bring philosophy into the sciences by reporting on introspection. Following that failure, they concluded that little insight would be gained without direct behavioral and measurable testing.

Psychologists are the KISS people of philosophy in the end; starting from the major concepts, accepting huge uncertainties given human inconsistencies, and then shift their efforts toward methods and probabilities. They have indeed made some solid contributions, but hit plenty of walls too. They won’t accept theoretical complications without data, and grind hard on formal linguistics/Chomsky too. [However, psych research was really bleak and narrow back in the Behaviorism era – B.F. Skinner.]

2 Likes

Yeah, the ‘isms’ got to me as well. But I think it was mostly just annoying. I found myself in subjects where I had no way of communicating information without using them, and then it required that other people have the same background, ultimately leading the analytic philosophy crowd to only talk to one another. Definitely fun, but less fruitful than I think it ought to be.

I remember meeting one of my philosophy heroes in grad school and I told him “you’re the reason I got into this”, he replied “I’m truly sorry for that”. But I’m glad to have the background now. It’s certainly helped me identify inconsistencies between intuition and ideological/fundamental commitments - along with a pretty good BS detector.

3 Likes

Not that you can boil a headphone (or any other device) down to one metric (or grade).
At the minimum, you would need three:

  • Comfort (or useability), what good is the best sound ever when it makes your ears fall off?
  • Sound - Good luck boiling all the aspects of that down to one score…
  • Worth the price? - Fit & Finish play a role, provided accessories, etc.

Benchmarking sound equipment will be expensive. Just as you have
Source → DAC → Amp → Headphone
you would have:
Waveform generator → Headphone → Measurement mic(s) → Frequency analyzer (or Mixed-Domain Oscilloscope)
The above measurement setup does not have built in compensation for measurement errors, which is another can of worms.
As of what to measure: No frequency sweeps, that is for sure.

1 Like

I remember going to more than a few happy hours in grad school with David “Take Consciousness Seriously” Chalmers during his Glorious Big Hair days:

With regard to benchmarking, I’m very much into probabilities following from measurement. This is what the big vendors in fact do for product research. It gets esoteric in the audiophile world as the samples and resources aren’t there. We probably aren’t that far apart regarding the underlying process.

2 Likes