AI News spoke with Damian Bogunowicz, a machine learning engineer at Neural Magic, to discuss their approach to deep learning model optimisation and inference on CPUs. Compound sparsity is a concept that combines techniques such as unstructured pruning, quantisation, and distillation to reduce the size of neural networks while maintaining accuracy. Neural Magic’s sparsity-aware runtime leverages CPU architecture to accelerate sparse models, allowing practitioners to overcome the limitations and costs associated with GPU usage. Enterprises can benefit from using sparse models as they can remove up to 90 percent of parameters without impacting accuracy. LLMs are a particularly exciting development, with potential applications in AI agents and natural language processing.
