Measurements: Charts, Graphs, Software & Methods

This isn’t targeted directly at you, but since you raise some questions/counters that I see being raised regarding measurements (that is, individual measurements taken by people, not the concept of using measurements in general) quite a bit, and have some opinions on, you have the poor luck of having to hear my thoughts :stuck_out_tongue:

With the caveat that we’re referring to “sufficiently good” measurement systems - which would include anything based on the standards bar IEC-60318-1 couplers, I suppose, as well as a number of ways of doing probe measurements on real ears - the disparity in results between two systems with the same headphone should not be very significant after accounting for difference between the transfer functions of those systems. As an example, I have done measurements of headphones on my HATS’ head with both its internal microphones, built into its ear sims, and with a small open canal probe microphone I constructed. If I plot a single headphone’s measurement on both, we’ll see a large difference.

Shock and horror, clearly one of these is an unsuitable measurement system, right? Well, maybe not - after all, measurements of SPL in our own ears should certainly correlate to what we hear, and measurements on standard fixtures like my HATS also correlate with perception. How do we bridge the gap, then? Well, it’s down to the difference in what’s impacting incoming sound here - my microphone bypasses the canal and takes its measurement at the entrance, so that part of the HRTF is absent, leaving a difference by frequency (made by subtracting one from the other) that looks like this. This looks pretty close to me to the delta between the eardrum and open canal measurements found by Hammershøi & Møller 2008 (HRTFs under three conditions, blocked canal, open canal, and eardrum measurements are found on table II).

So long as this difference is constant with respect to headphones (which, depending on what’s different about the transfer function, may not be the case - /u/oratory1990 on Reddit has written a fair bit about where he perceives acoustic impedance mismatch as misleading in this case), we can compensate for it universally, either by subtraction, or in this case by using the eardrum data for the ear sims and the open canal data for the probe mics. In the case of this particular mic, several headphones show pretty common differences - not complete agreement (there are, after all, other factors influencing FR variation, including simply slightly different placements and environmental noise), but close enough that with a compensation shaped like the delta between the systems for any one headphone, you’d get a pretty good prediction of what a headphone that measured a certain way on one measured on the other. And in that case, you have comparability - that’s the standard I set for a useful measurement system, indeed. Not everyone necessarily needs a HATS, so long as the system they use varies in a static way relative to different headphones - I’ve banged on about this here before, but I really do think it’s significant; if both systems are “human-accurate enough”, all you need to know is what the differences are, and of course for standards-compliant professional equipment, these differences will be quite small for ear simulators and pinnae.

In general, this is a worthy consideration “internally” to a measurement lab - when something looks “strange” in my measurements, the first step of the process of figuring out what’s up is doing some loopback tests, checking a backup DAC/ADC, etc. Sometimes you leave an EQ on, sometimes a USB device has decided it doesn’t like your ports and needs to be power cycled, etc. However unless whoever is doing the measurement simply does not care about what he’s measuring, the odds of an aberration induced this way making its way into a published measurement are pretty unlikely bar one case I can think of (noise messing with distortion measurements - a few published reviews have had that issue…). Barring defect or malfunction, the chain used to measure headphones simply shouldn’t substantially impact their frequency response - even a high Zout will only cause “big” differences in some truly outlier cases. If the FR looks like noise, that’s definitely a good sign that something’s gone wrong, but subtler issues are…niche, at minimum.

This is actually what made me respond here - I’ve seen this concern, that operator error is a significant factor, raised increasingly frequently lately, and I’m really torn on it. There are certainly cases where operator knowledge/experience is impactful on the results - for example, I’m now more consistent at getting a tight coupling of headphones to my 4128 than I was when I first got it; as another, I’ve known folks who were newer to measurements who didn’t calibrate SPL when testing distortion, making the measurements…not terribly handy.

But for the biggest factors we hear and care about, it’s heavily frequency response, and barring deliberate malfeasance, it’s not really that easy for an operator to mess up how a circumaural headphone measures for frequency response. You can place them a bit askew on the test fixture, sure, and you can have a leak between the pads and the head, but the former isn’t necessarily wrong (after all, how many of us are carefully centering our headphones on our pinna?), and the latter is pretty obvious when looking at the data (and also not necessarily wrong, depending on the amount of leak - some of us wear glasses!).

It’s worth thinking about operator effects, don’t get me wrong…but it’s important not to let that be a brush we allow to sweep away any data that we don’t intuitively agree with! There are areas where the measurement engineer’s impacts may make themselves known (particularly the relative extremes of frequency), and when things look a bit eccentric there, it’s worth pondering if that might be why…but barring sabotage, no operator error is going to turn an HD800 into a GS2000e, if you take my meaning :smile:

There’s…nuance to this one. I really, really recommend reading at least Toole or Olive’s summaries of Olive’s work in this area - and better yet some of the source papers - but while a measurement doesn’t necessarily allow me to predict with perfect certainty how one individual will hear a headphone, the variations we see aren’t as extreme as some people make them out to be. It’s true that “how close a headphone is to the 2017 Harman target” isn’t a perfect prediction of how you or I will like it relative to another headphone…but preferences cluster pretty close to something that’s close to it.

Olive, Welti, & Khonsaripour 2019 sliced up the variation in preferences a bit, and found that there were subsets of people who preferred relatively different headphones on average - which, you know, aligns with all of our experience, of course - but even among these subgroups there were pretty strong agreements about which headphones sounded better and which ones sounded worse. Older work has shown similar things with giving people control of equalizers - you see different adjustments from different individuals…but they don’t pull that far apart.

Mind you, the distance is still enough for things like my own differences with the Harman target’s high-frequency response (I prefer diffuse field, unaltered or with a slight shelf down) to IMO fit within the research, so I’d also be cautious of anyone who thinks he can predict exactly what you, personally, will think of a headphone from its frequency response…but I don’t see that claim made very often, and I do sometimes see people pushing back against the idea that, even for a specific individual, we can make some quite well-informed guesses about what FR features will not sound so good, which is a very reasonable claim.

It’s certainly possible that such a thing could be done - some would argue that some audio review publications do this to smooth over measured issues, as an example - but I think it’s key to separate what the measurements say and what we say about the measurements here. There are really only two ways to massage what the measurements say, themselves - choosing what measurements to conduct based on the sort of result you want to show, which will be fairly clear about what you like and dislike if you give a harder set of trials to some gear than others, or outright tampering with the data.

The latter is quite trivial to do - I can mess up the FR or distortion of any gear I measure essentially arbitrarily - but, well, there’s a reason that people make jokes about people lying on the internet. You can draw your own graphs in paint then digitize them with webplotdigitizer if that floats your goat, but I’ve never seen anyone doing that, and unless something is truly inexplicable otherwise, or the person seems like a bad faith actor, I’m generally very skeptical of fraudulent data as an explanation.

As far as what we say about the measurements, of course, it’s a whole different ballgame. I like pointing to three reviews of the Objective2 as a neat picture of this, as an example: NwAvGuy’s own, Tom Christiansen of Neurochrome, and Amirm of AudioScienceReview. If there’s a technical disagreement between their data, I’ve missed it or forgotten it since last reading (so it can’t have been that big), but the three make quite different narratives from the same product and much data about the same things. NwAvGuy was, naturally, quite keen on his creation, whereas Tom and Amir both aren’t as impressed with its performance - even though it is, indeed, the same performance, analyzed through pretty similar lenses.

Is the O2 a cheap, audibly transparent headphone amplifier that you can do yourself as a fun project? A marginally performing, mid-to-low power headamp due to its less-than-SOTA distortion performance and limited output current? Yes to both, but what story we tell depends on what we care about - and that, of course, is subjective. I don’t recommend taking other people’s narrative evaluations of measurements for much more than colour, generally - a really key reason that I point people back to the primary lit; measurements are useful when you can interpret them, but if you’re relying on someone else’s analysis, you could get a similar result just from a summary.

This said, the O2 reviews, IMO, highlight that when we see narratives diverging, it’s as often because people writing said narratives have different standards as it is because they’re trying to plump for different gear. Amir and Tom are very consistent about what they look for in good gear, and it’s different from what NwAvGuy did - their differing impressions of the same product start from a different criteria for what “good” is, rather than their definition of acceptability being changed to move where the product being reviewed fell. So long as this is the case, if you know what someone is looking for (and most of these folks will tell you quite up-front), IMO there’s no harm at all in that kind of fluff - hell, it’s fun to read!

Anyway, sorry to ramble at you, I’m just feeling a bit like the caveats to measurements need their own caveats stated these days!

11 Likes