Do DACs/Amps matter?

Placement of carved figurines on the Bifrost 2/64 is critical. Try reversing the aural polarity by having the figurine’s head face toward the connection. Also Orgone reception is reportedly better on the switch side, so place it more to the left. The right may be reserved for traditional Netsuke (depending of course, which net you use).

Yeah, we’re the most clearheaded and reliable forum.

6 Likes

I will get around to the blind testing after the holiday. I listened to the Spring 3 KTE all last night (~3 hours) through my stereo. I’ll save sound quality comments once I get the ability to A/B switch with the Bifrost 2/64, but one thing stood out from my sighted impressions last night: fatigue. 2 songs off the album Con Todo El Mundo by Khruangbin and the song Lateralus by Tool had my ears feeling something fierce. They’re still tender today. So not the best first impression. This unit is “burned in” since I purchased used from the original owner who got it January this year so that’s not an explanation. I never exceeded my normal listening levels (~76 dB average) either.

So how did the TT2 and Ferrum Oor work out? Were there noticeable/worthwhile differences compared to other DACs/amps you were testing?

I’m interested because my experience AB testing low-mid price solid state amps (up to the level of the Jotunheim2) was the same as yours.

Up to Jotunheim 2 or up to and including Jotunheim 2? I ask because my Schiit Lyr+ in SS mode is the first SS amp I’ve heard any difference in. It’s got me thinking about trying a Jotenheim 2 or maybe even something like the SPL Phonitor.

Up to and including the Jot 2. I did extensive blind AB testing against the THX 789 using my Focal Celar OG, DCA Ether 2 and HD 800S, and could hear no difference.

This was all single-ended (that’s all my switch box could do), so we can’t rule out differences in the XLR outputs. Sighted swapping between the balanced outputs suggested the Jot may have been a tiny bit wider, but I have no way of knowing if I could have distinguished the two blind.

Anyway, the testing I did was enough to convince me that I can’t personally hear differences between these reputedly very different-sounding amps, and I ultimately sold both (I prefer the aesthetics and ergonomics of the JDS El Amp and Element).

5 Likes

I should add that sighted ABing between the two single-ended outputs supported the general consensus that the Jot 2 is slightly warm and tubey (especially single-ended), and the THX is sterile. Blind, these differences disappeared.

This was enough to convince me that I am personally highly susceptible to framing effects, and these have a far greater impact on what I hear than any actual differences in the sound (at least as far as the Jot 2 and THX 789 are concerned).

5 Likes

Kudos for performing these tests for yourself. Also, the human interface, features and ease of use can be a deal breaker - one reason I’m not particularly interested in Chord or even considering HQ Player unless I have too much time on my hands.

I do find — and I think that @generic would agree that extended listening can reveal more than an AB test on a personal level. For him, I think it manifests as fatigue or pain. For me, as I acclimate to the sound of a setup, I’m able sometimes to resolve more than I did at first. Maybe I’m just filtering out the chaff perceptually.

6 Likes

My motivation for exploring audiophile headphones started as an effort to get beyond the absolutely nasty sounds I heard from my ancient Sony consumer headphones. Headaches. Ears ringing. Etc.

I’d absolutely, positively fail many blind AB or ABX tests if each setup is used for just a few seconds (but not all). It takes a while to adapt or acclimate. Sometimes I can’t hear differences for an hour, but then my ears start to ring after an album or two and the setup becomes intolerable. This can happen with compressed music sources too. Sometimes a setup has given me a headache for an entire day. This is not possible to fake. Finicky cat I am.

Most of my successful amp/DAC identifications came through treble issues. Some stereotypically “harsh” amps such as the THX AAA 789 are by no means the worst (and easily detected). They are antiseptic but often fine fatigue-wise. The worst have random treble sprays / glare / brightness (e.g., original Magni 3; a bad / worn tube) and make my ears hiss in seconds. The DACs I don’t like tend whine, whistle, and ring with my known problem test tracks, brass instruments, and female vocals. Frankly, one should use a stereotypically bright and harsh DAC (e.g., ESS, Cirrus Logic) if one wants to identify amps. Testing won’t be pleasant though.

1 Like

if you set both dacs to true 4v output you will be fine so long as you set the volumes in the amp the same. Alternatively, you can cut a hole in a piece of cardboard and shove the mic through it to measure spl. Still not truly ideal, but much better than open.

Also, if either dac is new to you, I would highly recommend using it for a week or so straight before jumping into ABX. It can take some time to kinda “get use to” a new dac.

Lastly, unless you are using the new BF2 true NOS board, you will want to oversample in roon. Without this you will end up with some treble roll off on the spring making the comparison not quite fair.

This is really interesting. TBH I think spring 3 is one of the least fatiguing dacs ive ever heard (even with the ahb2 which is a bit of a fatique monster tbh). I doubt its the issue given the HPA4 being a monster, but if your spring 3 has the pre it may be worthwhile lowering the output on the spring to see if you are clipping or something

1 Like

Worth checking how you were calculating your listening volume.

Spring 3 output is hotter than most DACs. And if you have the preamp module it can go much higher than other DACs.
You may have been listening a fair bit louder than you thought.

This Spring 3 is without the preamp. Volume was standardized between the 2 DACs using pink noise, REW, and UMIK-1 for my stereo. LCD-5 volume was done by sealing left cup of the 5 over a flat surface and sliding my phone’s mic underneath it and playing pink noise while having a SPL app running. I got both setups with both DACs within 0.1 dB of each other.

During sighted A/B testing I could not hear a difference between these DACs when rapidly switching. I’m not going to say everyone wouldn’t be able to hear a difference because telling people what they can or can’t hear is lame. However, when I listened to each DAC for a longer period of time (3+ songs), the fatigue I noticed with Spring the first night continued to rear its head. I really can’t explain it… Treble sounds the same with cymbals not sounding any different. It’s worth mentioning that in the past I found Chord DACs fatiguing (namely the Mojo 2 and Hugo 2).

With blind testing I covered the HPA4’s screen and held the input switch button for a random length of time so I didn’t know which input it had cycled to. I also listened for long periods of time (3+ songs) before looking and checking. I focused only on fatigue since there was no tonal or technical difference I could perceive between the DACs. First night I was 37.5% correct (3/8), second night I was 70% correct (7/10), third night I was 72% correct (8/11). Results likely would be different if it was blind A/B/X testing but this is enough of a conclusion for me.

Takeaways from this further cement my position as a centrist on the neverending debates between objectivists and subjectivists. I genuinely cannot hear a difference in timbre, resolution, space… anything between these DACs but I sure as hell feel a tingly sensation with the Spring 3 after a couple songs. I guess I’m a synergist? Also, this experiment showed me the Bifrost 2/64 is a great DAC and a steal at less than a third of the Spring 3 KTE to my ear.

Thanks for reading.

14 Likes

Sorry to necro this, but a note: -100±100 = -94 assumes that the signals sum coherently. For this to be the case, the changes must be identical for both, including in phase. Because noise is very unlikely (like, statistically impossible over a longer than minute timespan) to be in phase, it sums non-coherently, so two noise sources of equal level sum +3dB rather than +6.

If the changes were completely out of phase, the sum would attenuate rather than increase - you could consider ESS’s software controlled 2nd and 3rd order correction to be a case of this.

While I’m being annoying, it’s best to sum your trials for better robustness! In this case, you have 18 correct trials from 29 total, which has a cumulative probability of roughly 13%, or a bit over one in eight - not likely, but also far from impossible. The nice thing about adding trials is that it makes small but noticeable effects visible - 6/10 is not statistically significant, but the odds of getting 60/100 trials right are actually quite low!

7 Likes

Not if this task involves training effects. One might learn to hear/see/understand the differences over time. Many perceptual phenomena are a one-way street, such as the Stroop Effect for those fluent in a given language:

image

It takes people much, much longer to process column 3 than column 2.

6 Likes

Awww, you inserted a table. For a minute I wondered if you had figured out how to make text in colors.

2 Likes

Good point! That’s not necessarily to say that one should sum his or her lifetime results, but i see folks doing series of small trial runs in fairly close succession (because let’s be honest, who wants to sit down and AB an audio device 20 times in one evening?), and in that case you’ll get a more meaningful measure by aggregating. Not much to be learned from 4/4, but 17/18 is far more meaningful (although that then opens the question of what caused it, where level matching is a prime culprit).

2 Likes

Now it’s my turn to get annoying…we are heading into areas adjacent to my professional background…

Conducting just 4 trials may have little value in human testing. With 18 trials…validity remains a testable question. Hypothetically, performance could change over dozens of trials and change over weeks to months of regular use. Detection accuracy might change or fluctuate up and down, for greater exposure has all sorts of biological consequences and people do become periodically ill and age too.

Statistical testing (i.e., for human research) routinely uses power estimates to assess the validity of any research design or sample size. Regardless of whether testing is tedious or not, you’ve got to do what it takes to understand the probabilities of correct perception versus random chance (i.e., assess statistical validity). The appropriate test procedure will vary per the (1) effect size of the change to a system setup, (2) human variables such as training, fatigue, hearing thresholds, personal hearing loss, etc., (3) device calibration such as dB level matching, headphone pad pressure and thickness matching, cable weight matching, etc., and (3) double blind methods to control for preferences/biases.

The audio industry often has solid electrical engineering, but would greatly benefit from incorporating many decades of work on human research methods. But it’s indeed tedious. But it’s indeed costly. But it still won’t affect personal preferences and ‘buying with your eyes’ because some equipment looks cool or disgusting regardless of how it sounds.

6 Likes

Also as @Mad_Economist touched on, the percentage of runs you get right in and of itself often doesn’t tell you much when ABXing.

What’s actually important is the probability that you obtained that result by hearing a difference, and not chance.
In statistics this is called the ‘P-Value’

If we were to say that in an ABX comparison, you got 80% of the runs you did correct, if this was only 4/5 runs, there’s an 18.75% chance that you got that result just by guessing. So the result is not conclusive.

Do 10 runs, keeping the same 80% success rate, and the chance you got at least 8/10 right by guessing is now 5.47%.
Way lower (though still not low enough to be considered conclusive.)

Do 40 runs and get 80% (32 runs) correct, and the chance you got that by guessing is now 0.0091%

A lower % of correct runs is still a better result if it gives you a lower P-Value. Do more runs whenever you can

eg: Getting 70% of 50 runs (35/50) is a more significant result than getting 90% of 10 runs correct.

35/50 has a P-Value of 0.0033 (0.33% chance result was obtained by luck), 9/10 has a P-Value of 0.010742 (1.074% chance the result was obtained by luck)

3 Likes

p-values are indeed what’s reported for significant effects, but they are a stumbling block in understanding the full scope and relevance of what’s going on. You can often achieve “human research publishable” p=0.05 or 5% error rate significance by increasing the number of trials (e.g., 5 → 50 → 500 → 5,000). As the number of trials increases the necessary effect size for p=0.05 decreases (i.e, see the link on power above).

Believe it or not, psychic/telekinetic/magical research can achieve statistical significance and good p values. This typically follows from minor methodical issues across many trials rather than genuine phenomena. [Insert some joke about audiophile 100 hour item burn in.]

3 Likes

I think the key thing is to put a bit of subjective evaluation on the objective results.

Whilst 5% is often the ‘default’ value in many examples, when looking at something fairly clear cut like whether an ‘obvious’ audible difference exists, I don’t think it’d be unfair to expect/ask for P-Values of 0.001 for example.
There’s ‘statistically significant’ and then there’s ‘beyond reasonable doubt’.

Both as someone doing a test, I’d want to be providing P-Values that didn’t really leave much in the way of doubt.
And as someone looking at results from someone else, to be honest 5% still leaves things up for debate imo.

But yeah, results and stats can be easily misrepresented.
I worked in private equity before audio and decisions that were made on misleading statistics were not at all uncommon…

3 Likes