LLM Inference Engines Architecture

AMD Details Single-Node and Distributed Inference Performance on Instinct MI355X

AMD has published new technical details outlining how its AMD Instinct MI355X accelerator addresses the growing inference ...

Semiconductor Engineering

Scheduling Architecture Integrated With M3D BEOL Memories For LLM Inference (Georgia Tech, Samsung)

A new technical paper titled “Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories” was published by researchers at Georgia Institute of ...

Zacks.com on MSN

How SoundHound's Hybrid AI Model Beats Pure LLM Players

SOUN's hybrid AI model blends speed, accuracy, and cost control-outpacing LLM-only rivals in real-world deployments.

VentureBeat

How attention offloading reduces the costs of LLM inference at scale

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Rearranging the computations and hardware used to serve large language ...

NextBigFuture

Defeating Nondeterminism in LLM Inference by Thinking Machines

A research article by Horace He and the Thinking Machines Lab (X-OpenAI CTO Mira Murati founded) addresses a long-standing issue in large language models (LLMs). Even with greedy decoding bu setting ...

Semiconductor Engineering

Novel NorthPole Architecture Enables Low-Latency, High-Energy-Efficiency LLM inference (IBM Research)

A new technical paper titled “Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole” was published by researchers at IBM Research. At the IEEE High Performance ...

Seeking Alpha

Red Hat Launches the llm-d Community, Powering Distributed Gen AI Inference at Scale

Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university ...

Business Wire

VeriSilicon’s Ultra-Low Energy NPU Provides Over 40 TOPS for On-Device LLM Inference in Mobile Applications

SHANGHAI--(BUSINESS WIRE)--VeriSilicon (688521.SH) today announced that its ultra-low energy and high-performance Neural Network Processing Unit (NPU) IP now supports on-device inference of large ...

Forbes

Cerebras Now The Fastest LLM Inference Processor; Its Not Even Close

The company tackled inferencing the Llama-3.1 405B foundation model and just crushed it. And for the crowds at SC24 this week in Atlanta, the company also announced it is 700 times faster than ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results