ThreadedDenseSparseMul.jl

ThreadedDenseSparseMul.jl is a Julia package that provides a threaded implementation of dense-sparse matrix multiplication, built on top of Polyester.jl.

This package addresses the need for fast computation of C ← C + D × S, where D and S are dense and sparse matrices, respectively. It differs from existing solutions in the following ways:

SparseArrays.jl doesn't support threaded multiplication.
MKLSparse.jl doesn't support dense × sparsecsc multiplication directly.
ThreadedSparseCSR.jl and ThreadedSparseArrays.jl support only sparse × dense multiplication.

ThreadedDenseSparseMul.jl shows significant performance improvements over base Julia implementations, especially for large matrices.

Performance

`fastdensesparsemul_threaded!` outperforms `MKLSparse` by 2x:

`fastdensesparsemul!` outperforms MKLSparse by 2x.

`fastdensesparsemul_outer_threaded!` outperforms `SparseArrays` by 4x:

`fastdensesparsemulmul!` outperforms SparseArrays for outer product by 4x.

To use ThreadedDenseSparseMul.jl, simply install and import the package, and launch Julia with some threads (e.g., julia --threads=auto). Then, you can use any of the following accelerated functions:

using ThreadedDenseSparseMul
# ^ will call ThreadedDenseSparseMul.set_num_threads(Threads.nthreads) during `__init__`
using SparseArrays

A = rand(1_000, 2_000)
B = sprand(2_000, 30_000, 0.05)
buf = similar(A, size(A,1), size(B,2))  # prealloc

fastdensesparsemul!(buf, A, B, 1, 0)
fastdensesparsemul_threaded!(buf, A, B, 1, 0)
fastdensesparsemul_outer!(buf, @view(A[:, 1]), B[1,:], 1, 0)
fastdensesparsemul_outer_threaded!(buf, @view(A[:, 1]), B[1,:], 1, 0)

The interface is adapted from the 5-parameter definition used by mul! and BLAS.

To instead override mul! directly so you can run C = A*B, see the next section.

Overriding `mul!` with Type Piracy ‍☠️

If you are aware of the consequences, you can also commit type piracy (harrrr!) and override mul! and (*) by calling

ThreadedDenseSparseMul.override_mul!()  # threaded
# or
ThreadedDenseSparseMul.override_mul!(false)  # non-threaded

Then, you can just run e.g.

D = rand(100, 200); S = sprand(200, 500, 0.1);
C = D*S
C .+= D[:, 1] * (S[:,1]')

and the two multiplication call will dispatch to use this library (through the mul! and (*)*dispatch, respectively). Since the outer product doesn't use the mul! syntax by default though, you can still manually call threadeddensesparse_outer!(C, D[:, 1], S[1, :]) to avoid an extra allocation.

API Reference

ThreadedDenseSparseMul.fastdensesparsemul! — Method

fastdensesparsemul!(C, A, B, α, β)

BLAS like interface, computing C .= β*C + α*A*B, but way faster than Base would.

Also see fastdensesparsemul_threaded! for a multi-threaded version using Polyester.jl.

source

ThreadedDenseSparseMul.fastdensesparsemul_outer! — Method

fastdensesparsemul_outer!(C, a, b, α, β)

Fast outer product when computing C .= β*C + α * a*b', but way faster than Base would.

a is a dense vector (or view), b is a sparse vector, C is a dense matrix (or view).

Also see fastdensesparsemul_outer_threaded! for a multi-threaded version using Polyester.jl.

source

ThreadedDenseSparseMul.fastdensesparsemul_outer_threaded! — Method

fastdensesparsemul_outer_threaded!(C, a, b, α, β)

Threaded, fast outer product when computing C .= β*C + α * a*b', but way faster than Base would, using Polyester.jl.

a is a dense vector (or view), b is a sparse vector, C is a dense matrix (or view).

Also see fastdensesparsemul_outer! for a single-threaded version.

source

ThreadedDenseSparseMul.fastdensesparsemul_threaded! — Method

fastdensesparsemul!(C, A, B, α, β)

Threaded, BLAS like interface, computing C .= β*C + α*A*B, but way faster than Base would. Also see fastdensesparsemul! for a single-threaded version.

source

ThreadedDenseSparseMul.get_num_threads — Method

ThreadedSparseCSR.get_num_threads()

Gets the number of threads used in sparse csr matrix - vector multiplication.

source

ThreadedDenseSparseMul.set_num_threads — Method

ThreadedSparseCSR.set_num_threads(n::Int)

Sets the number of threads used in sparse csr matrix - vector multiplication.

source

Implementation

The core implementation is quite simple, leveraging Polyester.jl for threading. The result is simply something similar to

function fastdensesparsemul_threaded!(C::AbstractMatrix, A::AbstractMatrix, B::SparseMatrixCSC, α::Number, β::Number)
    @batch for j in axes(B, 2)
        C[:, j] .*= β
        C[:, j] .+= A * (α.*B[:, j])
    end
    return C
end

Contributing

Contributions to ThreadedDenseSparseMul.jl are welcome! Please feel free to submit issues or pull requests on the GitHub repository.

License

ThreadedDenseSparseMul.jl is licensed under the MIT License. See the LICENSE file in the package repository for more details.

fastdensesparsemul_threaded! outperforms MKLSparse by 2x:

fastdensesparsemul_outer_threaded! outperforms SparseArrays by 4x:

`fastdensesparsemul_threaded!` outperforms `MKLSparse` by 2x:

`fastdensesparsemul_outer_threaded!` outperforms `SparseArrays` by 4x: