This isnāt targeted directly at you, but since you raise some questions/counters that I see being raised regarding measurements (that is, individual measurements taken by people, not the concept of using measurements in general) quite a bit, and have some opinions on, you have the poor luck of having to hear my thoughts 
With the caveat that weāre referring to āsufficiently goodā measurement systems - which would include anything based on the standards bar IEC-60318-1 couplers, I suppose, as well as a number of ways of doing probe measurements on real ears - the disparity in results between two systems with the same headphone should not be very significant after accounting for difference between the transfer functions of those systems. As an example, I have done measurements of headphones on my HATSā head with both its internal microphones, built into its ear sims, and with a small open canal probe microphone I constructed. If I plot a single headphoneās measurement on both, weāll see a large difference.
Shock and horror, clearly one of these is an unsuitable measurement system, right? Well, maybe not - after all, measurements of SPL in our own ears should certainly correlate to what we hear, and measurements on standard fixtures like my HATS also correlate with perception. How do we bridge the gap, then? Well, itās down to the difference in whatās impacting incoming sound here - my microphone bypasses the canal and takes its measurement at the entrance, so that part of the HRTF is absent, leaving a difference by frequency (made by subtracting one from the other) that looks like this. This looks pretty close to me to the delta between the eardrum and open canal measurements found by HammershĆøi & MĆøller 2008 (HRTFs under three conditions, blocked canal, open canal, and eardrum measurements are found on table II).
So long as this difference is constant with respect to headphones (which, depending on whatās different about the transfer function, may not be the case - /u/oratory1990 on Reddit has written a fair bit about where he perceives acoustic impedance mismatch as misleading in this case), we can compensate for it universally, either by subtraction, or in this case by using the eardrum data for the ear sims and the open canal data for the probe mics. In the case of this particular mic, several headphones show pretty common differences - not complete agreement (there are, after all, other factors influencing FR variation, including simply slightly different placements and environmental noise), but close enough that with a compensation shaped like the delta between the systems for any one headphone, youād get a pretty good prediction of what a headphone that measured a certain way on one measured on the other. And in that case, you have comparability - thatās the standard I set for a useful measurement system, indeed. Not everyone necessarily needs a HATS, so long as the system they use varies in a static way relative to different headphones - Iāve banged on about this here before, but I really do think itās significant; if both systems are āhuman-accurate enoughā, all you need to know is what the differences are, and of course for standards-compliant professional equipment, these differences will be quite small for ear simulators and pinnae.
In general, this is a worthy consideration āinternallyā to a measurement lab - when something looks āstrangeā in my measurements, the first step of the process of figuring out whatās up is doing some loopback tests, checking a backup DAC/ADC, etc. Sometimes you leave an EQ on, sometimes a USB device has decided it doesnāt like your ports and needs to be power cycled, etc. However unless whoever is doing the measurement simply does not care about what heās measuring, the odds of an aberration induced this way making its way into a published measurement are pretty unlikely bar one case I can think of (noise messing with distortion measurements - a few published reviews have had that issueā¦). Barring defect or malfunction, the chain used to measure headphones simply shouldnāt substantially impact their frequency response - even a high Zout will only cause ābigā differences in some truly outlier cases. If the FR looks like noise, thatās definitely a good sign that somethingās gone wrong, but subtler issues areā¦niche, at minimum.
This is actually what made me respond here - Iāve seen this concern, that operator error is a significant factor, raised increasingly frequently lately, and Iām really torn on it. There are certainly cases where operator knowledge/experience is impactful on the results - for example, Iām now more consistent at getting a tight coupling of headphones to my 4128 than I was when I first got it; as another, Iāve known folks who were newer to measurements who didnāt calibrate SPL when testing distortion, making the measurementsā¦not terribly handy.
But for the biggest factors we hear and care about, itās heavily frequency response, and barring deliberate malfeasance, itās not really that easy for an operator to mess up how a circumaural headphone measures for frequency response. You can place them a bit askew on the test fixture, sure, and you can have a leak between the pads and the head, but the former isnāt necessarily wrong (after all, how many of us are carefully centering our headphones on our pinna?), and the latter is pretty obvious when looking at the data (and also not necessarily wrong, depending on the amount of leak - some of us wear glasses!).
Itās worth thinking about operator effects, donāt get me wrongā¦but itās important not to let that be a brush we allow to sweep away any data that we donāt intuitively agree with! There are areas where the measurement engineerās impacts may make themselves known (particularly the relative extremes of frequency), and when things look a bit eccentric there, itās worth pondering if that might be whyā¦but barring sabotage, no operator error is going to turn an HD800 into a GS2000e, if you take my meaning 
Thereāsā¦nuance to this one. I really, really recommend reading at least Toole or Oliveās summaries of Oliveās work in this area - and better yet some of the source papers - but while a measurement doesnāt necessarily allow me to predict with perfect certainty how one individual will hear a headphone, the variations we see arenāt as extreme as some people make them out to be. Itās true that āhow close a headphone is to the 2017 Harman targetā isnāt a perfect prediction of how you or I will like it relative to another headphoneā¦but preferences cluster pretty close to something thatās close to it.
Olive, Welti, & Khonsaripour 2019 sliced up the variation in preferences a bit, and found that there were subsets of people who preferred relatively different headphones on average - which, you know, aligns with all of our experience, of course - but even among these subgroups there were pretty strong agreements about which headphones sounded better and which ones sounded worse. Older work has shown similar things with giving people control of equalizers - you see different adjustments from different individualsā¦but they donāt pull that far apart.
Mind you, the distance is still enough for things like my own differences with the Harman targetās high-frequency response (I prefer diffuse field, unaltered or with a slight shelf down) to IMO fit within the research, so Iād also be cautious of anyone who thinks he can predict exactly what you, personally, will think of a headphone from its frequency responseā¦but I donāt see that claim made very often, and I do sometimes see people pushing back against the idea that, even for a specific individual, we can make some quite well-informed guesses about what FR features will not sound so good, which is a very reasonable claim.
Itās certainly possible that such a thing could be done - some would argue that some audio review publications do this to smooth over measured issues, as an example - but I think itās key to separate what the measurements say and what we say about the measurements here. There are really only two ways to massage what the measurements say, themselves - choosing what measurements to conduct based on the sort of result you want to show, which will be fairly clear about what you like and dislike if you give a harder set of trials to some gear than others, or outright tampering with the data.
The latter is quite trivial to do - I can mess up the FR or distortion of any gear I measure essentially arbitrarily - but, well, thereās a reason that people make jokes about people lying on the internet. You can draw your own graphs in paint then digitize them with webplotdigitizer if that floats your goat, but Iāve never seen anyone doing that, and unless something is truly inexplicable otherwise, or the person seems like a bad faith actor, Iām generally very skeptical of fraudulent data as an explanation.
As far as what we say about the measurements, of course, itās a whole different ballgame. I like pointing to three reviews of the Objective2 as a neat picture of this, as an example: NwAvGuyās own, Tom Christiansen of Neurochrome, and Amirm of AudioScienceReview. If thereās a technical disagreement between their data, Iāve missed it or forgotten it since last reading (so it canāt have been that big), but the three make quite different narratives from the same product and much data about the same things. NwAvGuy was, naturally, quite keen on his creation, whereas Tom and Amir both arenāt as impressed with its performance - even though it is, indeed, the same performance, analyzed through pretty similar lenses.
Is the O2 a cheap, audibly transparent headphone amplifier that you can do yourself as a fun project? A marginally performing, mid-to-low power headamp due to its less-than-SOTA distortion performance and limited output current? Yes to both, but what story we tell depends on what we care about - and that, of course, is subjective. I donāt recommend taking other peopleās narrative evaluations of measurements for much more than colour, generally - a really key reason that I point people back to the primary lit; measurements are useful when you can interpret them, but if youāre relying on someone elseās analysis, you could get a similar result just from a summary.
This said, the O2 reviews, IMO, highlight that when we see narratives diverging, itās as often because people writing said narratives have different standards as it is because theyāre trying to plump for different gear. Amir and Tom are very consistent about what they look for in good gear, and itās different from what NwAvGuy did - their differing impressions of the same product start from a different criteria for what āgoodā is, rather than their definition of acceptability being changed to move where the product being reviewed fell. So long as this is the case, if you know what someone is looking for (and most of these folks will tell you quite up-front), IMO thereās no harm at all in that kind of fluff - hell, itās fun to read!
Anyway, sorry to ramble at you, Iām just feeling a bit like the caveats to measurements need their own caveats stated these days!