[SSE][AVX][SIMD] Horizontal Sum (sum simd vector – intrinsic)

When playing with SIMD intrinsic, it is a matter of finding the right instructions to do what you want.
But sometimes it is tricky because there are various possibilities to do it.
And sometimes, I forgot that this or this exists and that if I combine with this it will do the job.
Therefore, I put here a possibility regarding the horizontal sum of a simd data type.
There are other possibilities, but I would like to avoid hadd since it is expensive!

Note that the AVX functions call _m128(d) but from AVX standard

SSE – Single

    inline float HorizontalSumSse(const __m128 val) final {
        const __m128 val02_13_20_31 = _mm_add_ps(val, _mm_movehl_ps(val, val));
        const __m128 res = _mm_add_ss(val02_13_20_31, _mm_shuffle_ps(val02_13_20_31, val02_13_20_31, 1));
        return _mm_cvtss_f32(res);
    }

SSE – Double

    inline double HorizontalSumSse(const __m128d val) final {
        const __m128d res = _mm_add_pd(val, _mm_shuffle_pd(val, val, 1));
        return _mm_cvtsd_f64(res);
    }

AVX – Single

    inline float HorizontalSumAvx(const __m256 val) final {
        const __m128 valupper = _mm256_extractf128_ps(val, 1);
        const __m128 vallower = _mm256_extractf128_ps(val,0);
        const __m128 valval = _mm_add_ps(valupper,
                                         vallower);
        __m128 valsum = _mm_add_ps(_mm_permute_ps(valval, 0x1B), valval);
        __m128 res = _mm_add_ps(_mm_permute_ps(valsum, 0xB1), valval);
        return _mm_cvtss_f32(res);   
    }

AVX – Double

    inline double HorizontalSumAvx(const __m256d val) final {
        const __m128d valupper = _mm256_extractf128_pd(val, 1);
        const __m128d vallower = _mm256_castpd256_pd128(val);
        const __m128d valval = _mm_add_pd(valupper, vallower);
        const __m128d res = _mm_add_pd(_mm_permute_pd(valval,1), valval);
        return _mm_cvtsd_f64(res);
    }