For audio purposes, as @ProfFalkin has said, we usually refer to “upscaling” as “upsampling” - and sometimes as “oversampling”. These terms are often used interchangeably, although a more proper application of them would be that “oversampling” means sampling at a higher rate than Nyquist for the source signal and “upsampling” means performing a sample-rate-conversion (SRC) from one already-sampled source to a higher rate.
While both “oversampling” and “upsampling” work to solve a similar problem, specifically to make it easier to implement the necessary filters for sampling (brick wall/anti-imaging) and replay (reconstruction), the first is applied at capture time in the ADC and the second by the DAC (though there are some DAC architectures which also do internal “oversampling” earlier in their conversion steps).
If you sample a normal audio signal at 44.1 kHz, which is the CD standard, you need a brick-wall filter that absolutely ensures no audio information is passed to the ADC with a frequency higher than 22,050 Hz (otherwise you’ll get images - i.e. false data - lower in the audio band). If you want a flat response from 20 Hz to 20 kHz, then that means you have to attenuate the input from 0 dBFS to -96 dBFS over just 2,050 Hz. If you oversample the input at, say, 176.4 kHz, for the same audio content, your filter now simply has to go from 0 dBFS to -96 dBFS over a span of 66.1 kHz (88.2. kHz - 20 kHz). Which is a much shallower curve and easier (and cheaper) to engineer reliably.
Remember that the input filter operates in the analog domain as it must occur prior to the signal reaching the ADC!
There’s a decent overview of it, with illustrations and examples, here. And I’m happy to get into a detailed discussion on specific aspects of it as needed/desired.
It is worth noting that many DACs, and in particular delta-sigma designs, already do their own upsampling - whether you want them to or not (though some allow you to choose if it happens, and sometimes by how much)!
Schiit’s entire multi-bit line over-samples (for Yggdrasil it is to 8x … or 8 fs - where “fs” is the base sample rate, so 44.1 kHz input gets upsampled to 352.8 kHz), Chord’s DACs do an even more extrema upsampling, in two stages, for example with DAVE first to 16 fs and then by a further 256 fs.
These DACs also use proprietary filters (“Super Combo Burrito” for Schiit’s line, “Watts Transient-Aligned” for Chord’s for example). A typical filter, built into a DAC chip, might use 256 “taps”. When you see references to “tap length” or “filter length”, each “tap” is a specific conversion coefficient, and the longer the filter the more likely you are to get to conversion coefficients of zero. Higher sample rates require longer filters (more taps) to do this. There is no benefit to having a million-tap filter on raw 44.1 kHz (non-upsampled) content, as the vast majority of the taps will have a zero coefficient.
From a less theoretical effect, let’s talk about actual application and software - per the questions in the original post.
Upsampling can, indeed, be done in software. In fact for both macOS and Windows, if you set the output rate to your audio device (e.g. via the Audio Midi Utility on macOS) to a higher rate than the source material being played, then the OS will upsample the content on the fly.
This is generally NOT a desirable thing as you have no control over how this upsampling is done, and there are multiple approaches, filters and levels of precision that can be applied, which have different implications and potential artifacts - the built-in OS upsampling generally isn’t as good as dedicated software.
Of note, here, is what happens by default in Android-based systems. Android’s standard audio-stack assumes a sample rate of 48 kHz. Any source material not at a multiple of 48 kHz undergoes sample-rate-conversion. For example, standard streaming content, CD content, and most compressed audio will be resampled from 44.1 kHz to 48 kHz. This is a non-integer conversion, which makes the math and precision much more involved (and critical) than a simple powers-of-two conversion (e.g. 48 kHz -> 96 kHz).
More precise conversions and filters (e.g. an ideal sinc filter) are more demanding in terms of power (batter) and CPU, than is ideal for a cellphone, and as a result those sample-rate-conversion implementations are optimized for power rather than quality. Thus we want to avoid that conversion in the device if we can, and this is one reason why Android-based DAPs sometimes tout having a custom-audio stack to bypass this process.
…
Going further …
On a Mac or a PC, there are myriad ways to do upsampling in software. Many high-end music-player applications allow you to enable upsampling, and they generally implement much more sophisticated schemes than you’ll find built into the OS.
Audirvana+, for example, allows you not only to specific many of the details of how the upsampling is performed, and to what degree, but even allows you to choose between two different upsampling engines, “SoX” (open source) and “iZotope”.
If you want more control, and even more sophisticated approaches, including control over things like filter type, tap-length, noise-shaping (required by all 1-bit, delta-sigma and DSD conversions), then you want to look at "“HQPlayer”.
Most conversions, at sane upsampling rates, can be done easily on the fly. However, extreme upsampling and the resulting long, complex, filters and noise-shapers you want to apply there, are VERY processing-power intensive. HQPlayer, for example, converting 44.1 kHz PCM to DSD512, and then using the highest fidelity poly-sinc filter and high-order noise-shaping, will required a dedicated multi-core computer (or significant GPU compute capacity) to work, and even then can have significant startup-latency.
Hardware up-samplers/filters originated when the required processing was more than was easily accommodated on reasonably priced general purpose hardware/computers. Most of that is now handled by software in the real-world (either on the computer, on a basic DSP chip in the DAC).
Extreme hardware up-sampling, and in particular the necessary filtering and noise-shaping you must apply to get the benefits of it, still requires serious processing power (as per the HQPlayer example above). This is where things like Chord’s M-Scalers come in … as they use a massively-parallel DSP approach to do both the upsampling and then the complex filtering and noise-shaping over very long tap length filters.
The Chord Hugo M-Scaler, which is to my knowledge the most advanced and extreme hardware audio upsampled/filter available, uses an FPGA that provides 740 DSP cores, and utilizes 528 of those in parallel to upsample to 4096 fs before applying a 1,015,808 tap implementation of Rob Watt’s “WTA” filter, and reducing the final output rate to something the DAC can handle (upto 768 kHz in the case of Chord’s newer DACs). And even with such powerful hardware on tap, this incurs about a 1.4 second latency. And the result of this is effectively an ideal implementation of a sinc-filter that optimally recovers the originally sampled data for material up to 44.1 kHz and 16-bits, and gets closer than anything else I’m aware of for higher rates and bit-depths.
So, short version - you can experiment with upsampling (and filtering) in software. Doing so to a high degree requires special software and powerful hardware. And otherwise you can look at various hardware options, the highest-spec of which is, today, the M-Scaler. From there the rubber-meets-the-road as you start to consider the audible effects of this processing vs. what it means in terms of math, theory and the demands/easements it enables on the actual hardware implementation.