Blind Testing Source Equipment - The limits of discernibility

This is likely going to part one in a blind testing series. We’ll see how it goes.


I’ve been wanting to do some blind testing for a while now, especially after reading Jason Stoddard’s excellent article on ‘blind listening’ in his Schiit Happened blog series. If anything, this indicates that when we don’t know what we’re listening to, concerns about perfect measurements and transparency drop out entirely. This is an important conclusion for many of us who may have only seen data points and haven’t had the opportunity to evaluate the equipment in question. But in this post I want to also point out something about the limits of blind testing - something that I discovered (confirmed?) partially as a result of trying it out. My takeaway at the moment after having tried it out is that while interesting, the process may not be as valuable as some may think.

One of the reasons I wanted to try out a version of blind listening is in part to try and test myself, to see if I could actually tell the difference and identify each source correctly. I’ve been a staunch defender of the notion that DACs and amps do make a difference for the experience beyond what measurements indicate, and that most people can recognize those differences when paying attention to specific things. This statement of course comes with the following disclaimers:

  1. How noticeable the differences are depends on how revealing the headphones are
  2. There isn’t a huge difference in DACs past a certain point.
  3. Headphones make the biggest difference overall

Setup
I have to state that this was not a scientific, definitive or ideal setup and I think I could improve it over time, however it was still effective at achieving the ‘blind’ part. It involved using two completely different source streams, volume matched (this is essential), playing the same file at the same time on repeat, and then being blindfolded so I couldn’t see anything. I used the Room Equalization Wizard (REW) and the EARS rig for volume matching. I then had my girlfriend plug the headphones I was using into each source at random and I had to guess which one was which. She even tried to throw me off deliberately by plugging the headphones into the same source several times in a row. For this test I used both the $2000 Audio Technica ATH-ADX5000, and the $350 HiFiMAN Sundara (just to test if lower end source equipment was distinguishable from high end gear on appropriately priced headphones as well). For this first experiment I tested them with the following sources to try comparing gear from three separate price categories:

  1. iFi Pro iDSD ($2500)
  2. Mytek Liberty DAC/amp ($995)
  3. iFi Hip DAC ($150)

Ideally I’d want to have a wall between myself and the sources being used, or some way of ensuring that there’s complete separation between the subject (me) and the test equipment, as well as some way of having additional source streams hooked up. At the moment I’m limited to only comparing two at a time. I’ve considered potentially using some form of switch or splitter to get an immediate changeover so I don’t need it unplugged and re-plugged, but that may also make it potentially easier as well.

The other limitation is that I wasn’t able to test just the DAC portions through a common amplifier, again because I didn’t use an RCA switch. This means that I was just evaluating two completely distinct source chains for differences, and not individual pieces of each chain. The main reason for doing this is because the sources I happened to be using were DAC/amp combos, and so they each had capable headphone outputs. I had tried testing them in the past by running them to a common amplifier, but found that this introduced too many additional variables, and the result would depend more on the way each source functioned. For example, whether one is being run as a pre-amp and the other not. Maybe I’ll try this again with different DAC units at some point.

Results
In any case, once I was blindfolded and got the test running I realized this was going to be harder than I thought. This is partly because I had spent all this time setting everything up and I wasn’t properly adjusted to what I should be listening for. This is probably the biggest factor that skewed my results, and I will explain why in a moment, but for each comparison here’s how well I was able to distinguish each source:

  1. Mytek Liberty vs iFi Pro iDSD (ATH-ADX5000): This had the biggest difference between the two, but I was only able to correctly choose which was which 7/10 times (there’s a reason for this).
  2. iFi Hip DAC vs Pro iDSD (Sundara): Comparing these two, there was less of a difference, but I only got it wrong once.

If you’re thinking that this seems counter-intuitive, that’s because it is. Why did I score better for the test that I thought didn’t show as significant of a difference? Why did I score worse for the one that I thought had a more significant difference?

There are a number of explanations, but the first is to confirm one of the caveats noted above: namely that how revealing the headphone is will determine discernibility of sources. The ATH-ADX5000 is noticeably more revealing of the sources than the HiFiMAN Sundara, and this explains why I found the biggest difference to be in the first test. This may sound like it should be obvious, however it doesn’t explain why I didn’t score as well for that test as I did on the one with less revealing headphones.

As mentioned earlier, I think I missed a few on the Mytek Liberty vs the Pro iDSD because I was coming at it cold. Basically, I wasn’t warmed up to either of the sources yet and therefore wasn’t able to properly understand what it was that I was hearing and where to look for differences. I found that as the test went on, it became easier and easier to tell them apart. For anyone wondering, I’ve spent quite some time evaluating each of these before this test, and the biggest differences that I’ve found over time is that the Liberty has a slightly more articulate and analytical representation of instruments that token certain parts of treble frequencies. The Pro iDSD is also quite well defined, but on the ATH-ADX5000 I found the elevated treble response to sound just a touch more relaxed. My pick fo the ATH-ADX5000 would probably be the Pro iDSD as a result.

With the second test, while these two sources are in vastly different price categories, I suspect I scored better because I had been warmed up to the Pro iDSD by this point and already had a solid imprint of what that sounded like to compare the Hip DAC to. The nice thing about this test not revealing as strong of a difference is that we can take comfort in knowing that our entry to mid-level headphones (even benchmark headphones like the Sundara) don’t require us to have crazy expensive source gear to get the most - or close to the most - out of them. In fact, I’d say for most headphones the iFi Hip DAC or other entry level sources are probably good enough, and it’s only when you get up into the flagship territory that the differences start to matter.

There’s also another explanation for these results that’s perhaps even more important to keep in mind, and it’s also why I start to think that maybe this blind testing approach isn’t as valuable for differentiating between equipment chains as many of its proponents imagine. For this test I was listening to a specific song, switching several times over the course of it, and for identifying differences I think this is actually not the right approach - even if it’s a realistic way for us to use the equipment in question.

Using a full song (on repeat) means that you may be thrown off due to a low or high energy part of the song. This is especially true for well-recorded music where there is a lot of dynamic range. So you may think you hear an extra bit of glare or definition in the treble at one part of the song, but it’s more to do with the way the drummer hit the cymbal at that particular part than anything else, and this can lead to guessing wrong if that happened to be the part you were listening to when the sources were switched. To get around this problem, we should probably be listening to specific parts of a song on repeat, or potentially even consistent test recordings that last 15-30 seconds at most.

Conclusion… for now
In my mind this calls into question whether or not we should care about blind testing/listening at all for anything beyond the fact that it’s an interesting experience. So by that I mean, it may be interesting to see how well you can identify your sources from one another, but it probably shouldn’t be involved in any equipment recommendations. Moreover, the ability to distinguish one source from another does not mean that the subject properly knows or understands how it sounds - and this is something that I think blind test proponents make a mistake about.

To make this point more strongly, I’m reminded of a problem found in epistemology that investigates whether knowledge attributions have a two-part or three-part structure (Jonathan Schaffer’s argument in favor of ‘contrastivism’ found here). The basic premise of this idea is that discernibility is an implicit yet essential component of knowledge statements that look something like “Resolve knows that the Mytek Liberty DAC sounds like this”. Supposedly these statements include the ability to distinguish this from something else (in this case the Pro iDSD). I imagine this is the attractive part about blind testing and discernibility - the ability to distinguish one from the other should indicate that A) there is a difference and B) the test subject knows what that difference is, or at worst, if we can’t distinguish them from one another there either isn’t a difference or we don’t know what the difference is.

But the problem with this line of thinking (and indeed one of the problems with Schaffer’s claim) is that discernibility on its own is not enough for the knowledge claim “Resolve knows X” to be true. Just because you can correctly guess which is which between two sources in the ideal test environment, using the same partial recording - even with more reliable certainty than I was able to do with the testing above - does not require that you’ve gained comprehensive insight into the way each source sounds. Or in other words, the assertion that “there’s a difference” isn’t a complete description of what each piece of source equipment sounds like.

While I was able to correctly guess the sources most of the time, it would become exponentially more difficult if more sources were included in the test. So rather than distinguishing between two sources, imagine if you had to distinguish among four different sources, or six. Failing to correctly guess which is which in this situation (which is likely) doesn’t mean the subject wasn’t able to hear a difference, it just means that hearing a difference wasn’t enough to correctly identify them. This is where discernibility starts to show its weaknesses, and it makes me wonder if blind testing proponents place too much value on it.

In any case, in spite of my reservations about the value of blind testing and discernibility, I want to see if I can improve the test conditions and making the setup easier to do. The main reason for this is just to be able to simply be a better listener more than anything else. The value I see in doing something like this is more along the lines of “oh neat” in most cases, unless there are examples where I score so poorly that it starts to show that I can’t hear a difference. I think there are a number of source comparisons like that, where we start to convince ourselves that we hear things that aren’t actually there.

Next steps
If anyone has done any blind testing, let me know your findings. I’ll be posting updates here from different tests, along with some additional methodological improvements. Ideally I’d like to find a way of doing this that yields more meaningful results - if that’s possible.

30 Likes

How can I like this more than once? Just one heart seems inadequate.

12 Likes

Ok so you can or might be be able to tell the differences…fine.

But which one is “correct”…

Or rather…which one did you prefer?? And did you need to do blind testing to determine this??

Alex

2 Likes

Yeah I don’t think you need to do blind testing to determine which one is best (for your headphones). Like I mentioned in the post, being able to identify a difference only goes so far. I actually think regular evaluations that involve sighted comparisons but over longer periods of time yield more useful results.

For the ADX5K I preferred the Pro iDSD, but for my Verite I prefer the Liberty.

10 Likes

I find interesting this “Blind Listening” approach over “Blind Testing” (Ignore the wrong “preview” text below):

7 Likes

Yeah, ‘blind listening’ - or some form of that - is what I’m looking to incorporate into some evaluations moving forward. There are so many entry level DACs and amps out there that perform incredibly well and will likely yield surprising results with this type of testing. It’s a bit more obvious in this one example I did, but I imagine it’ll get a lot more difficult when comparing similarly priced gear as well. The trick is going to be to ensure each piece of source equipment makes sense for the headphones that are being used to compare with, and that isn’t always the case for a variety of reasons.

6 Likes

This is the thing that I believe most audiophiles don’t understand about blind testing, double-blind testing, versus sighted comparisons, etc. It’s about eliminating bias. Further, as with any test, the objective needs to be stated before the test can be conducted.

The objectives “can I tell a difference between these DACs”, “one of these dacs are mine, can I pick it out” and “which of these do I prefer” are 3 wildly different objectives. Obtaining relevant data for one objective should yield nearly zero pertinent data for the others, and if it didn’t then the test wasn’t blind (or blind enough) as some knowledge about the test entered the tester’s awareness before, during, or after the test.

A truly blind test is where the test taker doesn’t know what the objective of the test is and has nothing to do with how it’s given.

In a double blind test, the person being tested doesn’t know what the test is about, and neither does the person giving the test. Both sides are blind. A third party creates the test, creates questions that won’t lead or guide either the test giver or the taker into suspecting what the test is about, collects the answers and data, and comes to a conclusion.

And there is the catch 22 - if you are preforming blind testing on yourself, using a test you created the conditions for, and where you know beforehand the possible outcomes… Then you cannot eliminate bias. Period. Even if you have someone else perform the test on you.

Concerning the objective where you try to see which one you prefer, the results will be specific to you and only you. I believe Schiit’s blind listening (or even sighted comparisons) are just as valid and much easier. It’s like saying people can’t tell the difference between an orange, a lime, and a grapefruit unless the test was blind. It’s absurd.

7 Likes

Is there a general consensus amongst a good number of this forums users that Amps and Dacs do sound different? I have been on a few forums cough Reddit where the ‘hive mind’ will say that there are no differences at all when volume matched. I don’t personally subscribe to this theory myself as I hear differences in my Amps and Dacs. Fair enough I haven’t volume matched them precisely but I am certain they do. Just wondering what your thoughts are on there being differences between Amps and Dacs.

I am not trying to disrespect the folk over on Reddit far from it as I go there myself. But it’s just an impression I’ve picked up over there. I know I am generalizing too but I am just trying to put a viewpoint across.

8 Likes

I believe there are indeed audible differences. I hear differences with the limited amount of equipment I own, and I’m relatively new to this hobby.

7 Likes

Me too. I don’t have the experience of @Resolve or @Torq or for that matter the gear. But even on relatively modest gear they seem to have different flavours so to speak. For instance I have an O2 amp that I would certainly say is on the bright side where as the Atom is less so. Sabre chipset Dacs are also on the bright and thin side in my experience when compared to AKM.

I am sorry if it isn’t on subject. Perhaps I should have posted elsewhere. If needs be perhaps Mods will move it if necessary.

3 Likes

As much as I know it’s all in the implementation, I do find I have an easier time identifying differences when comparing DACs that use different chips (that should probably be obvious). But I think we also have to be careful not to let volume differences or expectation biases lead to conclusions. I bet there’s probably a lot of stuff where the differences are so insignificant that we can’t actually discern it. I remember the first time I went “wow what a difference”, and it was when using the balanced out on the Cayin IHA-6 with the Mrspeakers Aeon. Of course now I know that it’s mostly just a case of those headphones benefiting greatly from high current.

6 Likes

I have done a few blind (and double blind) tests over the years, usually related to tracks recorded in a studio. The idea was to pick out the best track without knowing which take it was and without looking at waveforms on a screen.

However, one of the blind tests (or listens) that stuck with me was when I was invited to a factory tour of a large PA speaker manufacturer. They had a great demo room set up, probably around 800m2 with all their different ranges of speakers and systems set up around and above a virtual stage.

There was auditorium style seating at the back of the room and without knowing which speakers were playing, they randomly swapped rigs to see which we (it was me and two other colleagues) preferred. All the systems were volume matched.

They did three full rounds of all their different rigs, in random order and on each round we picked our favourite.

After the third round, we got to know which rigs we had picked. My suprise was that in all three rounds I had picked what was essentially a set of small cube speakers (they were running multiples of them at the same time to be able to match the volume of the big rigs) that were essentially 10% of the price of the next step up in their line.

To be honest I have never been a fan of that brand (which is why I am not sharing any names) but they are highly regarded by others and their big rigs get high praise for stadiums etc.

That taught me that no matter how many rave reviews something gets, it doesn’t mean I will like it and also that no matter the increase in price, it doesn’t mean I will like it more (or less).

Sorry for what has turned into a completely off topic post!

8 Likes

[quote=“SenyorC, post:12, topic:6204, full:true”]
(they were running multiples of them at the same time to be able to match the volume of the big rigs) [/quote]

But that makes sense since running multiples of that cheaper speaker decrease the driver excursion and distortion. What if they had run the same multiples of their larger/more expensive speakers, at much higher volumes? Results may have been different, no?
In the same vein, I am experimenting with arrays of multiple planar tweeters…

2 Likes

This is the great benefit and/or humiliation of blind testing. Emotion driven people hate that. Sometimes cheap stuff is fine, and sometimes not. As I’m primarily focused on comfortable, quality music I tend stop when it feels good (no pain/no ringing in my ears) and go back to the content.

A while back I was testing a cheap amp vs. a more expensive amp and forgot which one I was using. Turned the wrong volume knob and was perplexed… Oh well, the cheap one was good enough for the content.

5 Likes

The result would have been different yes, it would have been much louder than I would have been comfortable with (we are talking stadium level volumes in an empty, though well treated, 800m2 venue).

And I don’t know about limiting excursion, those small cubes were being pushed to the limit to reach those volumes.

I don’t see how there could not be differences, and probably audible ones. A DAC does not live entirely in the digital domain. Even if the ones and zeros are read identically, and converted identically to an analog form, that analog form is output using whatever choice of analog components the DAC manufacturer chose.

I’m also sure the same applies in reverse, when an analog sound is digitized. The quality of the input will affect the stream of ones and zeros produced.

4 Likes

A A/B Box may help you with that. I watched your video earlier and the discontinuous time your lady took to change between amps was (at least) twice the time she tried to trick you be reinserting in the same jack. Your mind probably figured that out.

You already have the tricky part, which is two computers. It’s a PITA with 1 computer (and the coordinate movement of switching the A/B box switch and OS sound card at the same time).

Good stuff.

2 Likes

Actually she was deliberately trying to throw me off by plugging it into the same jack more than twice. The test went on much longer than what I showed in the video. I could’ve shown the whole thing but that would have been a 35 minute long video that’s pretty boring haha. But yeah, next time I want to get to a point where even she doesn’t know which one is which.

5 Likes

If you are in to DIY and some simple programming, you could build a switch box using some relays and an Arduino board and run it to randomly switch between the sources and/or outputs each time you press a button.

I saw a guide to building one of these some time ago, I think it may have been on instructables.

3 Likes

In animal testing, researchers must employ extreme controls to prevent training from slight differences. Even rats and pigeons and dogs. All the cases of counting horses actually involved the horses spotting movement in the human observers.

2 Likes