PLENA: Asymmetric Precision Configurable Transformer Inference Acceleration Framework
Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao
Abstract
LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference --- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5$\times$ higher utilization than existing accelerators, and delivers 2.24$\times$ higher throughput than the A100 GPU and 3.85$\times$ higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced. % does this comparison need to be qualified a little, one is real silicon one is a simulation etc.? What are we keeping equal here, silicon area / power? etc.
Introduction
Transformer models have revolutionised AI across numerous fields, including language, vision, and science [34, 65, 69]. Decoder-only transformer-based autoregressive large language models (LLMs), like GPT [50] and LLaMA [63], are now widely deployed in many applications, such as real-time chatbots [49], code generation [32] and agentic tool-use and computer-use workflows [48].
The rapid rise of agentic LLM capabilities, e.g. computer use [41], tool use [27, 46], and command-line agents [1], relies heavily on their ability to process and reason over very long contexts. For instance, command-line agents need to both comprehend and generate large-scale codebases [30, 55, 71], while tool- and computer-use agentic workflows must keep track of multiple pieces of information across prolonged inputs-such as an entire web page DOM-which typically require very long contexts [12, 20, 35]. Figure 1(a) shows that, compared with chatbot workloads, agentic workloads consume 100 × more tokens per inference on average and up to 1,000 × at the maximum case. In response, modern LLMs have expanded their context windows: the original GPT-3 [11] supports roughly 2K tokens, whereas GPT-4 [50] reaches up to 32K tokens, and LLaMA4-Maverick [2] extends the context window to 1M tokens.
To clarify the computational impact of agentic workloads, Figure 1(b) analyzes a LLaMA 3.3 70B model with longcontext capability and shows that, when the number of generated tokens is low, the Feed-Forward Networks (FFNs) account for most of the total inference FLOPs, whereas the attention layers become dominant as the number of tokens generated grows. Notice these two phases can happen in a single inference run since we are performing autoregressive decoding. For instance, in the Longwriter [8] workload, the
Xuan Guo
Imperial College London London, UK
Cheng Zhang
Jianyi Cheng
University of Edinburgh Edinburgh, UK
Background
Model Quantization
Quantization compresses LLMs by mapping high-precision floating-point parameters X into lower-bit representations. Following the standard integer quantization definition [47], weformalize the process over the arbitrary target data format under a single-level scaling scheme using three elements: the data format ( 𝜏 ) , the scale factor ( 𝑠 ) and the zero point ( 𝑧 ) .
A data format is defined as a tuple 𝜏 = ( 𝑑,𝑏 ) , where 𝑑 denotes the numerical datatype and 𝑏 is the bit-width specifying its precision. For a datatype 𝜏 the values it can represent are restricted to a finite interval. We denote this interval as the representable set:
$$
$$
with min 𝜏 and max 𝜏 as the representable bounds. The scale factor 𝑠 maps the dynamic range of X into Ω ( 𝜏 ) , typically defined as:
$$
$$
while the zero-point 𝑧 shifts the range for alignment (with 𝑧 = 0 in symmetric quantization). Quantization then maps X into the target format as:
$$
$$
where RTN (·) denotes round-to-nearest. To approximate the original tensor, the quant-dequant operator is:
$$
$$
In Equation 3, values exceeding the representable range are clipped, introducing the clipping error. In this work, we address this with a novel adaptive clipping search, described in Section 4.2.
As the tensor size grows, the probability of having these outliers increases, widening the dynamic range and amplifying clipping error. Prior work mitigates this by varying the granularity at which scale and zero-point parameters are shared: from per-tensor, to per-channel, and to vector-wise schemes. In this work, we adopt block-wise micro-scaling datatypes (MXINT and MXFP), with both software and hardware implementation to support our dataformat-aware codesign, which we defer in Section 4.2.
Table 1: A Comparison of LLM accelerators: most lack cycle-accurate simulators for RTL-level timing, omit accurate HBM simulation in evaluation, are constrained by a lack of an ISA with compiler support, and accelerate only a subset of kernels - resulting in restricted flexibility, the need to offload to GPUs/CPUs and frequent host-device transfers and significant data-movement overheads.
L1: functional simulator; L2: cycle-accurate simulator; L3: cycle-accurate simulator with HBM enabled.
- : partial or planned open-source. Full inference coverage ∗ : all Transformer computations executed on-accelerator.

Figure 3: A typical setting of the MX data formats in this design. A scale is shared by a group of elements. Scale is in power of two quantization and elements can be quantized to integer or minifloat.
Additionally, quantization approaches generally fall into two categories: quantization-aware training (QAT), which integrates quantization during fine-tuning, and post-training quantization (PTQ), which applies quantization directly to a pretrained model. Our PTQ method achieves accuracy competitive with full-precision baselines, even under aggressive low-bit, full-system quantization, as demonstrated in Table 3.
Microscaling. Microscaling (MX) data formats, proposed in prior work [56], define a standardized format that enables block-wise scaling sharing. These formats support multilevel scaling schemes. We adopt only the level-1 scaling strategy, as illustrated in Figure 3. The scaling factor in the MX format can be computed similarly to Equation (2), after which it is quantized using power-of-two (PoT) quantization. The data elements in MX formats can be represented either as integers or as minifloats. In our design, we include both representations in the search space to evaluate software performance.
Quantization Comparison. In long-context scenarios, KV cache size is a key challenge [76], but hardware support for efficient quantization of KV cache remains limited [25, 54]. Existing frameworks often treat quantization in isolation rather than as part of full-system design, leaving gaps in non-GEMM operations and causing a mismatch between algorithmic advances and practical deployment on hardware.
Microscaling
Transformer models have revolutionised AI across numerous fields, including language, vision, and science [34, 65, 69]. Decoder-only transformer-based autoregressive large language models (LLMs), like GPT [50] and LLaMA [63], are now widely deployed in many applications, such as real-time chatbots [49], code generation [32] and agentic tool-use and computer-use workflows [48].
The rapid rise of agentic LLM capabilities, e.g. computer use [41], tool use [27, 46], and command-line agents [1], relies heavily on their ability to process and reason over very long contexts. For instance, command-line agents need to both comprehend and generate large-scale codebases [30, 55, 71], while tool- and computer-use agentic workflows must keep track of multiple pieces of information across prolonged inputs-such as an entire web page DOM-which typically require very long contexts [12, 20, 35]. Figure 1(a) shows that, compared with chatbot workloads, agentic workloads consume 100 × more tokens per inference on average and up to 1,000 × at the maximum case. In response, modern LLMs have expanded their context windows: the original GPT-3 [11] supports roughly 2K tokens, whereas GPT-4 [50] reaches up to 32K tokens, and LLaMA4-Maverick [2] extends the context window to 1M tokens.
To clarify the computational impact of agentic workloads, Figure 1(b) analyzes a LLaMA 3.3 70B model with longcontext capability and shows that, when the number of generated tokens is low, the Feed-Forward Networks (FFNs) account for most of the total inference FLOPs, whereas the attention layers become dominant as the number of tokens generated grows. Notice these two phases can happen in a single inference run since we are performing autoregressive decoding. For instance, in the Longwriter [8] workload, the
Xuan Guo
Imperial College London London, UK
Cheng Zhang
Jianyi Cheng
University of Edinburgh Edinburgh, UK
Quantization Comparison
Quantization Comparison
FlashAttention
FlashAttention optimizes memory I/O in the standard attention layer [16]. In a standard attention layer, computing 𝑄𝐾 ⊤ produces a prohibitively large square matrix, often thousands by thousands in size. Because on-chip memory cannot hold this intermediate result, it must be written to off-chip memory and later reloaded for the subsequent softmaxand 𝑃𝑉 steps, which significantly degrades performance. FlashAttention avoids this round trip by tiling and fusing the attention computation (GEMM-Softmax-GEMM) so that all intermediate results fit on-chip.
LLM-related Accelerators
Our Rust-based cycle-accurate simulator offers significant advantages over the functional-level simulators used in most published accelerators:
· Supports full cycle-accurate emulation. · Event-driven simulation that directly executes the generated machine code from the compiler. · HBM-enabled simulation, incorporating realistic HBM timing and bandwidth characteristics (via Ramulator [42]).
This simulator supports the same data types and precisions as the PLENA accelerator, and we verified that it could generate closely matching results as the RTL simulation for the accelerator.
Quantization
PLENA Hardware System
The overall configuration of PLENA is shown in Figure 4. It is designed to support instruction-level pipelining and mainly consists of three compute units: the Matrix Unit, the Vector Unit, and the Scalar Unit. All units are highly configurable, supporting multiple data types and precisions, enabling the application of different quantization methods to the accelerator. PLENA also includes two main on-chip SRAM blocks. The Vector SRAM acts as a scratchpad for computation, storing frequently used data such as activations, which do not need to be written back to HBM, thereby reducing memory access overhead. The custom Matrix SRAM is dedicated to loading weights and KV tensors and supports reading data in either transposed or untransposed layouts with no additional overhead.
Overall Architecture
Hardware Support for Asymmetric Arithmetric Types
To support asymmetric quantization strategies (Section 4.1), PLENA natively offers multiple numeric formats-covering different data types and precisions-across its compute and memory units (Table 14). This innovative asymmetric datahandling configuration has the following characteristics:
(i) Activations are stored in a high-precision floating-point (FP) format on-chip in the Vector SRAM, as they are more sensitive to quantization errors than KV or weights. (ii) KV and weights, being less accuracy-sensitive, can be more aggressively quantized and staged in the Matrix SRAM using lower-precision MX formats (MX-FP or MX-INT). (iii) An optional on-chip rotation step can suppress outliers before quantization to preserve accuracy.
Figure 5 illustrates the precision formats used by each unit and the dataflow between them. When appending newly computed 𝐾 and 𝑉 to the KV cache, we optionally apply a

Figure 4: PLENA architecture overview. Execution is controlled by the decoder's system-pipeline controller, which derives control signals from decoded instructions and monitors memory dependencies. For example, if the current instruction needs to read from a Vector SRAM row that is still being updated by the vector or matrix unit, the controller inserts a stall to ensure correctness. Vector SRAM acts as the on-chip scratchpad, providing data to the matrix and vector units and accepting their results.
selective rotation (Hadamard transform) to suppress outliers before quantizing to MX-INT. Because 𝐾 and 𝑉 are consumed only by the attention layer's GEMM, they are loaded exclusively into the Matrix SRAM. Before use, the matrix unit applies the inverse Hadamard transform to de-rotate 𝐾 and 𝑉 . These rotation/de-rotation stages can be selectively applied per tensor; for example, weights loaded into the matrix unit bypass the inverse transform.
![Figure 5: Asymmetric-precision datapath example. Vector SRAM stores FP4 values, whereas Matrix SRAM stores MX-INT4 values. Green paths denote the selective rotational quantization flow: a fast Walsh-Hadamard transform is applied, with its inverse used to map back [51]. Blue paths indicate the data flow for the remaining computation.](2509.09505-figure_007.png)
Figure 5: Asymmetric-precision datapath example. Vector SRAM stores FP4 values, whereas Matrix SRAM stores MX-INT4 values. Green paths denote the selective rotational quantization flow: a fast Walsh-Hadamard transform is applied, with its inverse used to map back [51]. Blue paths indicate the data flow for the remaining computation.
Computational Units
All compute units are optimized for feed-forward (FFN) and attention computations in transformer inference, with particular emphasis on long-context workloads. As shown in Figure 2(b), long-context workloads frequently involve fat

Figure 6: Processing flow for the weight-activation GEMM. Because memory capacity constrains batch size, the 𝑀 dimension remains small. Setting BLEN = 𝑀 on the flattened systolic array yields near-100% utilization.
GEMMs , where the batch-related dimension (typically 𝑀 in ( 𝑀,𝐾 ) × ( 𝐾, 𝑁 ) ) is much smaller than the others, resulting in uneven matrix shapes (Figure 6). The reduction dimensions 𝐾 tend to be very long. For example, the weight-activation GEMM reduces over the model's hidden size (e.g., 4 , 096 for LLaMA-8B and 8 , 192 for LLaMA-70B). In addition, a variety of arithmetic operations-such as elementwise addition, summation, and special functions like the exponential-are required across long-dimension tensors.
Matrix Unit. To optimize GEMM in long-context workloads involving fat GEMMs , we propose flattened systolic arrays, enabling higher utilization across the entire fat GEMM computation flow. The unit computes a ( BLEN , MLEN )×( MLEN , BLEN ) GEMM and produces results of shape ( BLEN , BLEN ) , and normally BLEN is set to be much smaller than MLEN to match the workload characteristics of long-context LLM inference.

Figure 7: The flattened systolic array is composed of a series of smaller square-shaped systolic arrays arranged in a row to form the desired fat GEMM shape. Each receives inputs distributed from the MLEN vector buffers W and X, as shown in Figure 4.
This flattened systolic array is designed for output-stationary dataflow in order to maintain high utilization and avoid frequent reads/writes of partial sums-and the bubbles associated with streaming operands into the systolic array. As shown in Figure 6, operands stream along the large reduction dimension 𝐾 while partial sums remain resident in the PEs. The array is fully pipelined, eliminating bubbles between consecutive GEMM tiles.
The microarchitecture of the flattened systolic array is shown in Figure 7. It is built from a series of small squareshaped systolic arrays ( sub-arrs ), each consisting of a grid of processing elements (PEs). Each PE repeatedly performs multiply-accumulate operations and passes data to its neighboring PEs below and to the right across the array. As described in Section 3.1, the systolic array is designed to natively accept data in the MX format. The detailed PE configuration is provided in the Figure 13.
On each cycle, the flattened systolic array fetches two MLEN -wide inputs, one from the Matrix SRAM (top) and one from the Vector SRAM (left). These inputs are buffered and reordered, then partitioned into MLEN / BLEN vectors (assuming MLEN is divisible by BLEN ), each of length BLEN . Each vector is then fed to a corresponding sub-arrs from the top and left direction.
However, a matrix unit composed solely of sub-arrs is insufficient to complete a ( BLEN , MLEN ) × ( MLEN , BLEN ) GEMM. Each array accumulates only partial sums for a fragment of the final result; producing a complete ( BLEN , BLEN ) output requires a cross-array reduction that sums the partial sums held in the PEs across the tiled row. To address this, we integrate an output adder tree (see Figure 7) that performs the cross-array summation efficiently. This unit is invoked via a dedicated instruction, as only one cross-array summation is required when computing GEMM along the large reduction dimension. This could prevent bubbles and improve computational efficiency.
Vector Unit. This unit supports all vector operations required during LLM inference, including elementwise computations (e.g., addition, multiplication, and exponential) and reduction operations (e.g., summation, maximum). The vector dimension is parameterised by VLEN . A complete list of vector-unit instructions is provided in Table 12.
Scalar Unit. The scalar unit has two separate ALU units supporting the two data types of computations: Integer (INT) and Floating Point (FP). Both the INT and FP units are connected to their respective SRAMs and register files and operate independently.
INT operations are used primarily for on-chip address generation and indexing, and run on a control path decoupled from the FP datapath. In contrast, the FP unit implements basic arithmetics and the non-linear functions required by transformer workloads (e.g., exponential, reciprocal, and reciprocal square root (rsqrt)). To accommodate future models that may require additional special functions, we also include a look-up table (LUT) unit so new functions can be realized via table lookups without introducing additional logic.
textbf{Matrix Unit
All compute units are optimized for feed-forward (FFN) and attention computations in transformer inference, with particular emphasis on long-context workloads. As shown in Figure 2(b), long-context workloads frequently involve fat

Figure 6: Processing flow for the weight-activation GEMM. Because memory capacity constrains batch size, the 𝑀 dimension remains small. Setting BLEN = 𝑀 on the flattened systolic array yields near-100% utilization.
GEMMs , where the batch-related dimension (typically 𝑀 in ( 𝑀,𝐾 ) × ( 𝐾, 𝑁 ) ) is much smaller than the others, resulting in uneven matrix shapes (Figure 6). The reduction dimensions 𝐾 tend to be very long. For example, the weight-activation GEMM reduces over the model's hidden size (e.g., 4 , 096 for LLaMA-8B and 8 , 192 for LLaMA-70B). In addition, a variety of arithmetic operations-such as elementwise addition, summation, and special functions like the exponential-are required across long-dimension tensors.
Matrix Unit. To optimize GEMM in long-context workloads involving fat GEMMs , we propose flattened systolic arrays, enabling higher utilization across the entire fat GEMM computation flow. The unit computes a ( BLEN , MLEN )×( MLEN , BLEN ) GEMM and produces results of shape ( BLEN , BLEN ) , and normally BLEN is set to be much smaller than MLEN to match the workload characteristics of long-context LLM inference.

Figure 7: The flattened systolic array is composed of a series of smaller square-shaped systolic arrays arranged in a row to form the desired fat GEMM shape. Each receives inputs distributed from the MLEN vector buffers W and X, as shown in Figure 4.
This flattened systolic array is designed for output-stationary dataflow in order to maintain high utilization and avoid frequent reads/writes of partial sums-and the bubbles associated with streaming operands into the systolic array. As shown in Figure 6, operands stream along the large reduction dimension 𝐾 while partial sums remain resident in the PEs. The array is fully pipelined, eliminating bubbles between consecutive GEMM tiles.
The microarchitecture of the flattened systolic array is shown in Figure 7. It is built from a series of small squareshaped systolic arrays ( sub-arrs ), each consisting of a grid of processing elements (PEs). Each PE repeatedly performs multiply-accumulate operations and passes data to its neighboring PEs below and to the right across the array. As described in Section 3.1, the systolic array is designed to natively accept data in the MX format. The detailed PE configuration is provided in the Figure 13.
On each cycle, the flattened systolic array fetches two MLEN -wide inputs, one from the Matrix SRAM (top) and one from the Vector SRAM (left). These inputs are buffered and reordered, then partitioned into MLEN / BLEN vectors (assuming MLEN is divisible by BLEN ), each of length BLEN . Each vector is then fed to a corresponding sub-arrs from the top and left direction.
However, a matrix unit composed solely of sub-arrs is insufficient to complete a ( BLEN , MLEN ) × ( MLEN , BLEN ) GEMM. Each array accumulates only partial sums for a fragment of the final result; producing a complete ( BLEN , BLEN ) output requires a cross-array reduction that sums the partial sums held in the PEs across the tiled row. To address this, we integrate an output adder tree (see Figure 7) that performs the cross-array summation efficiently. This unit is invoked via a dedicated instruction, as only one cross-array summation is required when computing GEMM along the large reduction dimension. This could prevent bubbles and improve computational efficiency.
Vector Unit. This unit supports all vector operations required during LLM inference, including elementwise computations (e.g., addition, multiplication, and exponential) and reduction operations (e.g., summation, maximum). The vector dimension is parameterised by VLEN . A complete list of vector-unit instructions is provided in Table 12.
Scalar Unit. The scalar unit has two separate ALU units supporting the two data types of computations: Integer (INT) and Floating Point (FP). Both the INT and FP units are connected to their respective SRAMs and register files and operate independently.
INT operations are used primarily for on-chip address generation and indexing, and run on a control path decoupled from the FP datapath. In contrast, the FP unit implements basic arithmetics and the non-linear functions required by transformer workloads (e.g., exponential, reciprocal, and reciprocal square root (rsqrt)). To accommodate future models that may require additional special functions, we also include a look-up table (LUT) unit so new functions can be realized via table lookups without introducing additional logic.
textbf{Vector Unit
textbf{Scalar Unit
All compute units are optimized for feed-forward (FFN) and attention computations in transformer inference, with particular emphasis on long-context workloads. As shown in Figure 2(b), long-context workloads frequently involve fat

Figure 6: Processing flow for the weight-activation GEMM. Because memory capacity constrains batch size, the 𝑀 dimension remains small. Setting BLEN = 𝑀 on the flattened systolic array yields near-100% utilization.
GEMMs , where the batch-related dimension (typically 𝑀 in ( 𝑀,𝐾 ) × ( 𝐾, 𝑁 ) ) is much smaller than the others, resulting in uneven matrix shapes (Figure 6). The reduction dimensions 𝐾 tend to be very long. For example, the weight-activation GEMM reduces over the model's hidden size (e.g., 4 , 096 for LLaMA-8B and 8 , 192 for LLaMA-70B). In addition, a variety of arithmetic operations-such as elementwise addition, summation, and special functions like the exponential-are required across long-dimension tensors.
Matrix Unit. To optimize GEMM in long-context workloads involving fat GEMMs , we propose flattened systolic arrays, enabling higher utilization across the entire fat GEMM computation flow. The unit computes a ( BLEN , MLEN )×( MLEN , BLEN ) GEMM and produces results of shape ( BLEN , BLEN ) , and normally BLEN is set to be much smaller than MLEN to match the workload characteristics of long-context LLM inference.

Figure 7: The flattened systolic array is composed of a series of smaller square-shaped systolic arrays arranged in a row to form the desired fat GEMM shape. Each receives inputs distributed from the MLEN vector buffers W and X, as shown in Figure 4.
This flattened systolic array is designed for output-stationary dataflow in order to maintain high utilization and avoid frequent reads/writes of partial sums-and the bubbles associated with streaming operands into the systolic array. As shown in Figure 6, operands stream along the large reduction dimension 𝐾 while partial sums remain resident in the PEs. The array is fully pipelined, eliminating bubbles between consecutive GEMM tiles.
The microarchitecture of the flattened systolic array is shown in Figure 7. It is built from a series of small squareshaped systolic arrays ( sub-arrs ), each consisting of a grid of processing elements (PEs). Each PE repeatedly performs multiply-accumulate operations and passes data to its neighboring PEs below and to the right across the array. As described in Section 3.1, the systolic array is designed to natively accept data in the MX format. The detailed PE configuration is provided in the Figure 13.
On each cycle, the flattened systolic array fetches two MLEN -wide inputs, one from the Matrix SRAM (top) and one from the Vector SRAM (left). These inputs are buffered and reordered, then partitioned into MLEN / BLEN vectors (assuming MLEN is divisible by BLEN ), each of length BLEN . Each vector is then fed to a corresponding sub-arrs from the top and left direction.
However, a matrix unit composed solely of sub-arrs is insufficient to complete a ( BLEN , MLEN ) × ( MLEN , BLEN ) GEMM. Each array accumulates only partial sums for a fragment of the final result; producing a complete ( BLEN , BLEN ) output requires a cross-array reduction that sums the partial sums held in the PEs across the tiled row. To address this, we integrate an output adder tree (see Figure 7) that performs the cross-array summation efficiently. This unit is invoked via a dedicated instruction, as only one cross-array summation is required when computing GEMM along the large reduction dimension. This could prevent bubbles and improve computational efficiency.
Vector Unit. This unit supports all vector operations required during LLM inference, including elementwise computations (e.g., addition, multiplication, and exponential) and reduction operations (e.g., summation, maximum). The vector dimension is parameterised by VLEN . A complete list of vector-unit instructions is provided in Table 12.
Scalar Unit. The scalar unit has two separate ALU units supporting the two data types of computations: Integer (INT) and Floating Point (FP). Both the INT and FP units are connected to their respective SRAMs and register files and operate independently.
INT operations are used primarily for on-chip address generation and indexing, and run on a control path decoupled from the FP datapath. In contrast, the FP unit implements basic arithmetics and the non-linear functions required by transformer workloads (e.g., exponential, reciprocal, and reciprocal square root (rsqrt)). To accommodate future models that may require additional special functions, we also include a look-up table (LUT) unit so new functions can be realized via table lookups without introducing additional logic.
Memory System
Our memory system is characterized by two key properties:
· Support for asymmetric precisions, variable-length memory transfers, and strided loads/stores to HBM. · Latency hiding for HBM accesses via a hardware prefetcher, enabling high bandwidth utilization.
To make more effective use of HBM capacity, as discussed in Section 3.1, all data stored in HBM is kept in MX format. However, due to address alignment constraints, it is impractical to concatenate each data block with its associated per-block scales. This is because the resulting combined size seldom matches a (2 𝑛 ) multiple, making it inefficient for the memory system.
To address this problem, we store the blocks and their scales separately - laying out all blocks contiguously, followed by the corresponding scales at the end of the block region. With this technique, the memory address alignment is preserved while locality is maintained. The resulting layout is shown in Figure 8.
To support variable-length transfers, the HBM controller integrates two data-packing units. MX-format blocks fetched via TileLink [62] (the on-chip interconnect used to access the HBM controller) are repacked into (i) MLEN -wide vectors for the Matrix SRAM and (ii) VLEN -wide vectors for the Vector SRAM. The controller automatically locates and fetches the corresponding per-block scales based on the active precision and the requested transfer size. On the write path, dedicated units accept vectors from the Matrix and Vector SRAMs, partition them into MX blocks, attach the appropriate perblock scales, and commit the aligned layout back to HBM.
The loading logic is critical to help us fully utilize the HBM memory bandwidth. The hardware load unit resides in both

Figure 8: Data layout and interaction in HBM. Data of different precisions can be stored simultaneously according to the defined storage pattern in HBM. Strided load and store operations are managed by the address remap unit, which generates and passes strided addresses to the TileLink channel.
the Matrix and Vector SRAMs and is connected directly to the HBM controller. This enables background fetching and streaming into each SRAM while the rest of PLENA executes other instructions, sustaining full utilization of the matrix unit and avoiding stalls on HBM accesses. The two load units are controlled directly by instructions, with the amount of data to be load encoded in each instruction. For example, during weight-activation GEMMs, where GEMM operations are invoked repeatedly while streaming data across the hidden dimension, the loaded amount is set to this dimension, so the load instruction only needs to be issued once.
PLENA ISA
Our customized ISA is designed to cover all operations required for transformer inference. The instructions are structured to balance efficiency with flexibility and are built to support multiple transformer-based models and computation optimizations. In addition to FlashAttention, the ISA also supports different transformer variants, such as MHA, MLA [43], and MoE [5]. A brief summary is provided in Table 11, with the detailed specification given in Table 12.
To achieve the efficiency and flexibility balance, the ISA is designed to minimize overhead while maximizing utilization of compute and memory resources. This is achieved through features such as tile-level scheduling, which enables finegrained control of computation and memory instructions at the tile granularity. Furthermore, the ISA defines dedicated instruction classes (Matrix, Vector, Scalar, Memory, and Control) that decouple responsibilities, simplify scheduling, and allow flexible mixing across different computation types.
The instructions (32bits per instr) are dynamically passed from the CPU to the instruction buffer via PCIe. The scalar unit contains an integer register file storing on-chip addresses. Vector- or matrix-related instructions control reads and writes to the matrix and vector SRAMs using the specified integer registers. Simple arithmetic operations in the scalar unit are used for address manipulation.
Flash Attention
Most current accelerators cannot execute FlashAttention natively because (i) they expose only GEMM primitives and lack in-line, row-wise reductions and nonlinear operations ( max/sum , exp , div ) required for the online softmax; (ii) they lack memory-layout support such as transpose-on-read and efficient strided/blocked streaming; and (iii) they rely on rigid ISAs with fixed scheduling and coarse-grained kernel boundaries, preventing tile-by-tile flexible execution.
In PLENA, we address (i) with tightly coupled vector and scalar units that implement the required reductions and elementwise operations; the vector unit's width is configurable to match the tile dimensions used by FlashAttention. For (ii), we introduce a Matrix SRAM that can be read in either standard or transposed order without extra data movement. In the 𝑄𝐾 ⊤ step, explicitly transposing large tiles on the fly is costly in area, energy, and latency, and storing 𝐾 ⊤ in HBMis impractical because it complicates appending new 𝐾 vectors to the existing 𝐾 cache during decoding. The Matrix SRAM avoids both issues by banking the storage across multiple sub-SRAMs and using lightweight address remapping to present a transposed view at read time (implementation details in Figure 12). For (iii), our custom ISA offers composable, fine-grained control that enables persistent, tile-by-tile scheduling of the fused attention pipeline. This allows each operation in FlashAttention to be controlled individually at the tile level. Combined with the above capabilities, this allows PLENA to support FlashAttention natively.
Quantization
Asymmetric Quantization
The proposed quantization framework supports a wide range of datatypes and precisions. As shown in Figure 9, to accurately reflect hardware behavior in LLM architectures, the framework must satisfy two key requirements: 1) different operands within the same operation can be quantized to different datatypes and precisions, and 2) all operations in the model must be quantized. Table 2 summarizes existing quantization methods. Most of these approaches focus on GEMMs only, several support mixed precision, while none of them support mixed data types. In contrast, our quantization flow allows both mixed precision and mixed data types in GEMMs, with all intermediate data between GEMM operations quantized.
For GEMM operations (e.g., linear layers and matrix multiplications between activations), the two operands can have two different precisions; e.g., INT4 activations multiplied
Table 2: Comparison of post-training quantization methods for LLMs across key features. (QW, QACT, QKV) denote quantization of weights, activations, and key-value cache, respectively. Each decoder layer in LLaMA contains nine matrix multiplications, as outlined in Algorithm 2. PLENA introduces the first accuracy evaluator supporting mixed MX datatypes, providing software emulation for MXINT, MXFP, and MiniFloat formats. Unlike prior approaches, it fully simulates hardware-precision behavior in software, extending quantization beyond matrix multiplications to also include RMSNorm, embedding layers, LM output heads, and nonlinear operations such as RMSNorm, softmax, and SiLU (see Algorithm 2).
✓ ∗ denotes partial quantization support ; ∗ At the time of writing, MicroScopiQ has not yet released its code; the comparison is based on information obtained directly from the authors.

Figure 9: Adataflow graph of LLM workloads in PLEANA. The blue lines indicate data represented in MX datatypes, and the orange lines indicate minifloat datatypes. The GEMM and projection layers are executed on the PLENA matrix unit, which takes inputs in MX and produces outputs in minifloats. All other operations are executed on the vector unit in minifloats.
with INT8 weights. The operands may also use two different datatypes, e.g., MXFP activations multiplied with MXInt weights. Besides, the GEMM operation also models the behaviour that the output will be casting to minifloat. For non-GEMM operations, which are executed on the vector machine in hardware, the data is stored as minifloats. When data flows from a non-GEMM operation to a GEMM, a cast module is required to convert minifloats to the corresponding target formats, and this is also modelled in the quantization framework. Beyond this basic setup, we adopt and refine advanced quantization tactics for a more aggressive quantization scheme than plain casting, including Hessianbased quantization optimization (GPTQ) and selective online activation rotation (QuaRot).
Fusing Output-Guided Blockwise Clipping into GPTQ
GPTQ was initially designed for integer quantization. When adapting GPTQ for MXFP/MXINT, we observe that the clipping range within each microscaling block significantly affects overall model performance. To address this problem, we propose a blockwise clipping range search method that minimizes the quantization error of each output block.
Algorithm 1 outlines the quantization process of PLENA. PLENA uses per-microscaling-block quantization error to guide the search of the clipping range, and fuses this clipping range optimization into GPTQ's iterative error propagation. This also mitigates the outlier problem of weights, which later on affects the value of the shared exponents in MX format, and eventually enables a better end-to-end model performance.
Formally, let X ∈ R 𝑀 × 𝐾 be the inputs for calibration, and W ∈ R 𝑁 × 𝐾 the layer weights. Given a linear layer Y = XW ⊤ , slice the weights across the 𝐾 dimension with the block size 𝐵 (i.e., MLEN in Figure 6) defined in our MX data format 𝜏 , yielding W 𝑏 ∈ R 𝑁 × 𝐵 to be quantized. We also slice activations across the 𝐾 dimension with the same block size, giving X 𝑏 ∈ R 𝑀 × 𝐵 . Let /Q_u.scantize (· ; 𝑝, 𝜏 ) denote per-row quantization in data format 𝜏 with clipping percentile 𝑝 . For each row 𝑖 = 1 , . . . , 𝑁 , we search for the percentile by
$$
$$
and get the quantized weight block:
$$
$$
We now detail the per-block clipping search, following the quantization definitions in Section 2.1, consider a block of weights 𝑤 𝜏 in data format 𝜏 , with representable range [ min 𝜏 , max 𝜏 ] and empirical weight range [ 𝑥 min , 𝑥 max ] . Directly mapping the full weight range usually wastes precision due to extreme outliers. To mitigate this, we introduce a clipping parameter 𝑝 ∈ P ⊂ [ 0 . 5 , 0 . 99 ] , which shrinks the
Algorithm 1 PLENA L2-Norm-Guided Hessian-Based Weights Quantization
effective range to [ 𝑝 𝑥 min , 𝑝 𝑥 max ] . We then adopt the symmetric quantization setting with zero-point fixed at 𝑧 = 0, the corresponding scale factor is
$$
$$
The blockwise quant-dequant operator then becomes
$$
$$
where RTN denotes round-to-nearest MX numbers.
By sweeping over a discrete candidate set P of clipping parameters, we evaluate multiple effective ranges and select the 𝑝 per block that minimizes the output reconstruction loss defined in Equation (5).
Output-norm guidance.
The system-level performance comparison is shown in Table 5, evaluating both small and large GQA-based LLaMA

Figure 11: Empirical Attainment Surfaces for latency ( ↓ ) and perplexity ( ↓ ) objectives across multiple seeds, evaluated with Llama3.2-1B and Llama-3-8B over the co-design space shown in Table 14. For the 1B model, we run 9 seeds with 50 trials, comparing BoTorch and TPE methods against Random sampling. For the 8B model, we run 5 seeds with 50 trials, comparing BoTorch against Random. Shaded regions show the 25% and 75% attainment bands across seeds.
models as well as the recently published MoE-based GPTOSS model, all implemented in 7 nm technology and supporting long-context inputs. This experiment investigates peak TPS by scaling the batch size to the maximum capacity that HBM can accommodate. As shown, PLENA achieves consistently higher TPS than both the A100 and TPU v6e under identical HBM settings and multiplier counts, with peak performance reaching up to 2.24 × that of the A100 and 3.85 × that of the TPU v6e. The higher TTFT observed in PLENA is
explained by its ability to store more batches within the same HBM capacity using our quantization scheme. As batch size increases, the prefill stage grows longer due to additional memory accesses and computation.
Table 7: Compute area, utilization, and attainable FLOPs of systolic arrays under W4A4KV4 bitwidth for LLaMA-3.3-70B. Baselines use 64 × 64 arrays, while PLENA employs a flattened ( 4 , 512 ) array. Results are shown for Standard (Prompt = 1k, Gen = 128) and Agentic (Prompt = 5.6k, Gen = 8k) workloads.
∗ Attainable FLOPs are computed from utilization and peak design throughput. Micro = MicroscopiQ. S. A FLOPs = Standard workload attainable FLOPs. A. A FLOPs = Agentic workload attainable FLOPs.
As shown in Table 7, PLENA achieves significantly higher utilization than prior designs in both short- and long-context workloads, with up to 8.5 × improvement in attainable utilization.
Per-Block Clipping Search.
GPTQ was initially designed for integer quantization. When adapting GPTQ for MXFP/MXINT, we observe that the clipping range within each microscaling block significantly affects overall model performance. To address this problem, we propose a blockwise clipping range search method that minimizes the quantization error of each output block.
Algorithm 1 outlines the quantization process of PLENA. PLENA uses per-microscaling-block quantization error to guide the search of the clipping range, and fuses this clipping range optimization into GPTQ's iterative error propagation. This also mitigates the outlier problem of weights, which later on affects the value of the shared exponents in MX format, and eventually enables a better end-to-end model performance.
Formally, let X ∈ R 𝑀 × 𝐾 be the inputs for calibration, and W ∈ R 𝑁 × 𝐾 the layer weights. Given a linear layer Y = XW ⊤ , slice the weights across the 𝐾 dimension with the block size 𝐵 (i.e., MLEN in Figure 6) defined in our MX data format 𝜏 , yielding W 𝑏 ∈ R 𝑁 × 𝐵 to be quantized. We also slice activations across the 𝐾 dimension with the same block size, giving X 𝑏 ∈ R 𝑀 × 𝐵 . Let /Q_u.scantize (· ; 𝑝, 𝜏 ) denote per-row quantization in data format 𝜏 with clipping percentile 𝑝 . For each row 𝑖 = 1 , . . . , 𝑁 , we search for the percentile by
$$
$$
and get the quantized weight block:
$$
$$
We now detail the per-block clipping search, following the quantization definitions in Section 2.1, consider a block of weights 𝑤 𝜏 in data format 𝜏 , with representable range [ min 𝜏 , max 𝜏 ] and empirical weight range [ 𝑥 min , 𝑥 max ] . Directly mapping the full weight range usually wastes precision due to extreme outliers. To mitigate this, we introduce a clipping parameter 𝑝 ∈ P ⊂ [ 0 . 5 , 0 . 99 ] , which shrinks the
Algorithm 1 PLENA L2-Norm-Guided Hessian-Based Weights Quantization
effective range to [ 𝑝 𝑥 min , 𝑝 𝑥 max ] . We then adopt the symmetric quantization setting with zero-point fixed at 𝑧 = 0, the corresponding scale factor is
$$
$$
The blockwise quant-dequant operator then becomes
$$
$$
where RTN denotes round-to-nearest MX numbers.
By sweeping over a discrete candidate set P of clipping parameters, we evaluate multiple effective ranges and select the 𝑝 per block that minimizes the output reconstruction loss defined in Equation (5).
Selective Online Activation Rotation
As shown in prior work, activations in LLMs typically contain more outliers than weights [37], therefore more sensitive to quantization. QuaRoT [6] recently demonstrated that applying a rotation matrix to LLMs can effectively suppress outliers. However, we observe that rotating all tensors suggested by QuaRoT may not yield the best performance for MX formats. When a tensor (e.g., the weight matrix) does not exhibit significant outliers, the benefit of rotation diminishes. Equation (9) is a simplified rotation mechanism in QuaRot [6], where a Hadamard matrix H smooths out the activation distributions, and its inverse is fused into weights. We performed experiments to empirically identify the activation tensors with extreme outliers and propose a selective online rotation scheme.
Wenotice that applying the rotation to finer-grained weight quantization (e.g., MXInt with smaller block sizes) may increase perplexity. Intuitively, weights have smaller dynamic ranges compared to activations. The rotation may be unnecessary since most weight outliers are effectively captured by the shared exponents, while permuting the weights with H leads to different quantized values, which may impact the model performance.
$$
$$
To address the issue that weight with fine-grained blocking does not need rotation, we propose an activation-only rotation strategy. As shown in Equation (10), the inverse rotation matrix 𝐻 -1 is decoupled from weight quantization and is instead applied directly to the quantized rotated activation at runtime.
$$
$$
The activation distribution varies significantly across layers. Consequently, the effect of rotation also differs from layer to layer. Rather than rotating all activations, we apply the rotation matrix selectively . A search is performed to identify the layers where rotation yields the greatest benefit. This selective activation rotation is performed on-the-fly (the green paths in Figure 5). The ablation of the above quantization modifications is shown in Table 8.
Activation & KV quantisation
Table 8: Ablation study on quantization techniques. Covering all 9 GEMMs in the Llama-3-8B model. Quantization configured with W_MXINT4, A_MXINT4, KV_MXINT4 with block size 16.
Implementation and Evaluation of the Quantization Framework
Table 8: Ablation study on quantization techniques. Covering all 9 GEMMs in the Llama-3-8B model. Quantization configured with W_MXINT4, A_MXINT4, KV_MXINT4 with block size 16.
quantisation Framework
Weight Optimization
PLENA Software Tooling
As discussed in Table 1, existing works lack several key components necessary to achieve complete end-to-end LLM inference. Some of these missing elements include a compiler, a simulator, and design-space exploration tools. In contrast, PLENA features a complete design and verification framework that allows it to rapidly adapt to new models or even new hardware accelerators and optimize for them. We also anticipate that future accelerators in the field could reuse certain components of this comprehensive framework to establish end-to-end performance comparisons.
Compiler
To efficiently deploy decoder-style LLMs, we design a compiler stack targeting only LLM models on our PLENA hardware. The models are first exported from the PyTorch framework into the ONNX format [7], where standard graph optimizations such as constant folding are applied. The optimized graph is then parsed into our custom IR through pattern matching, this essentially lowers high-level operators into primitives such as GEMM, quantization, dequantization, and FlashAttention.
The critical challenge lies in searching for an optimal scheduling strategy tailored to PLENA. Our scheduling policies include operator fusion, tiling configurations, memory
placement, and loop transformations, which jointly determine data reuse, memory traffic, and compute unit utilization. To accelerate the search, we systematically traverse candidate configurations and validate them by checking memory footprint constraints and transaction requirements. Feasible candidates are further evaluated by a lightweight rooflinebased performance model, and finally, the top-K schedules are selected to generate the assembly code for execution on PLENA.
Cycle-accurate Simulator
Our Rust-based cycle-accurate simulator offers significant advantages over the functional-level simulators used in most published accelerators:
· Supports full cycle-accurate emulation. · Event-driven simulation that directly executes the generated machine code from the compiler. · HBM-enabled simulation, incorporating realistic HBM timing and bandwidth characteristics (via Ramulator [42]).
This simulator supports the same data types and precisions as the PLENA accelerator, and we verified that it could generate closely matching results as the RTL simulation for the accelerator.
Hardware-Software Co-Design
To automate finding optimal hardware design and quantization parameters, we propose to employ active learning for design space exploration (DSE). We also provide capability for investigating the trade-offs between optimizing different objectives, such as maximizing accuracy, while minimizing latency and area. For this, we employ multi-objective Bayesian optimization (BO), which allows exploring the Pareto frontier in an active manner.
BO is a framework for optimising non-differentiable functions [59]. Multi-objective BO searches for optimal points in the design space that minimize a multi-objective function f , i.e. f ( x ∗ ) = min x f ( x ) . In our case, the objective function has three components: perplexity, latency, and chip area: f = GLYPH<2> 𝑓 𝑝 (·) , 𝑓 𝑙 (·) , 𝑓 𝑎 (·) GLYPH<3> . f is modelled with a multi-output Gaussian Process, which keeps track of the predictive mean and uncertainty for all points x in the design space. BO selects which candidate to evaluate next, such that uncertainty is reduced globally, but also comes back to regions with high predictive mean to further improve upon the previous points with favorable outcomes. BO scales to high-dimensional spaces [40, 66], supports both discrete and continuous search variables [9, 17, 19], and doesn't impose limiting restrictions on the properties of the objective f . Its model of the global posterior also facilitates interpretable analysis of the search results. Hence, this setup yields a flexible and informative framework for automating DSE.
We base our DSE implementation on the Optuna package [3] and conduct experiments with a BOTorch sampler

Figure 10: An overview of the open-source PLENA system.
and a tree-search sampler. With BOTorch [9] sampler we treat the design space as continuous during posterior modelling, but discretize the points proposed by BO for evaluating concrete design choices. We also test an alternative of using the Tree-Structured Parzen Estimator [68], often used for discrete spaces.
In our co-design setup, we incorporate post-training quantization directly into the optimization loop. This allows us to evaluate candidate hardware and quantization configurations jointly, using pre-trained model weights while searching over quantization parameters such as datatype and precision settings for activations and KV cache. The joint search space is defined in Table 14. For each candidate design, we assess accuracy , latency , and area :
· Accuracy is measured in terms of language modeling quality, where we evaluate perplexity on Wikitext2 using our accuracy evaluator. · Latency and area utilization are obtained from our Rooflinebased simulators, as illustrated in Figure 10.
To ensure efficient exploration, we impose input constraints over the design space (Table 15) and apply rejection sampling to discard invalid or infeasible candidates. This avoids unnecessary costly objective evaluations and accelerates convergence of the search. We first conduct experiments on Llama3.2-1B to enable rapid iteration, and then extend our evaluation to Llama-3-8B. The results are described in Section 5.3.
Evaluation
Experiment Setup
Models and Datasets. Weevaluate our quantization framework on mainly two families of LLMs, namely LLaMA-2 [64] and LLaMA-3 [45]. We also demonstrate our system on MoE (eg. GPT-OSS) and MLA-based (QWen-MLA) models. Quantization performance is measured in terms of perplexity on the WikiText-2 dataset [44]. The entire quantization process requires approximately 2-20 GPU hours on NVIDIA H100 GPUs, depending on the model size and configuration.
Quantization Baselines. We compare against several state-of-the-art quantization methods, including softwarebased approaches such as GPTQ [22], OmniQuant [61], and QuaRoT [6], as well as hardware-accelerated approaches such as Atom [75] and MicroscopiQ [54].
Accelerator Implementation. PLENA is implemented in SystemVerilog RTL. We perform synthesis using Synopsys Design Compiler with the 7 nm OpenROAD predictive process design kit [13] to generate area and power estimates under a 1 GHz clock frequency.
Accelerator Baselines. Since the works we selected for comparison, MicroscopiQ [54], FIGNA [31], and Olive [25], are not open-source and were not evaluated using the same technology node or toolchain, we re-implemented their core components and integrated them into the PLENA system for a fair inference performance comparison. Additionally, DeepScale [58] is used for overall system performance estimation, scaling all designs to the 7 nm process. Their detailed area and power of the core units are evaluated using our own implementations.
Inference Process. Instead of comparing only with SOTA accelerators, we also evaluate against the latest high performance commercial compute units, including GPUs (A100 80G) and TPUs (v6e-8). The GPU experiments are conducted in an environment with Ubuntu 22.04, CUDA 12.8, Python 3.11, PyTorch 2.8.0, and vLLM 0.10 V1. The TPU experiments are conducted in an environment with v2-alpha-tpuv6e software and vllm\vllm_tpu docker image.
Models and Datasets
Quantization compresses LLMs by mapping high-precision floating-point parameters X into lower-bit representations. Following the standard integer quantization definition [47], weformalize the process over the arbitrary target data format under a single-level scaling scheme using three elements: the data format ( 𝜏 ) , the scale factor ( 𝑠 ) and the zero point ( 𝑧 ) .
A data format is defined as a tuple 𝜏 = ( 𝑑,𝑏 ) , where 𝑑 denotes the numerical datatype and 𝑏 is the bit-width specifying its precision. For a datatype 𝜏 the values it can represent are restricted to a finite interval. We denote this interval as the representable set:
$$
$$
with min 𝜏 and max 𝜏 as the representable bounds. The scale factor 𝑠 maps the dynamic range of X into Ω ( 𝜏 ) , typically defined as:
$$
$$
while the zero-point 𝑧 shifts the range for alignment (with 𝑧 = 0 in symmetric quantization). Quantization then maps X into the target format as:
$$
$$
where RTN (·) denotes round-to-nearest. To approximate the original tensor, the quant-dequant operator is:
$$
$$
In Equation 3, values exceeding the representable range are clipped, introducing the clipping error. In this work, we address this with a novel adaptive clipping search, described in Section 4.2.
As the tensor size grows, the probability of having these outliers increases, widening the dynamic range and amplifying clipping error. Prior work mitigates this by varying the granularity at which scale and zero-point parameters are shared: from per-tensor, to per-channel, and to vector-wise schemes. In this work, we adopt block-wise micro-scaling datatypes (MXINT and MXFP), with both software and hardware implementation to support our dataformat-aware codesign, which we defer in Section 4.2.
Table 1: A Comparison of LLM accelerators: most lack cycle-accurate simulators for RTL-level timing, omit accurate HBM simulation in evaluation, are constrained by a lack of an ISA with compiler support, and accelerate only a subset of kernels - resulting in restricted flexibility, the need to offload to GPUs/CPUs and frequent host-device transfers and significant data-movement overheads.
L1: functional simulator; L2: cycle-accurate simulator; L3: cycle-accurate simulator with HBM enabled.
- : partial or planned open-source. Full inference coverage ∗ : all Transformer computations executed on-accelerator.

Figure 3: A typical setting of the MX data formats in this design. A scale is shared by a group of elements. Scale is in power of two quantization and elements can be quantized to integer or minifloat.
Additionally, quantization approaches generally fall into two categories: quantization-aware training (QAT), which integrates quantization during fine-tuning, and post-training quantization (PTQ), which applies quantization directly to a pretrained model. Our PTQ method achieves accuracy competitive with full-precision baselines, even under aggressive low-bit, full-system quantization, as demonstrated in Table 3.
Microscaling. Microscaling (MX) data formats, proposed in prior work [56], define a standardized format that enables block-wise scaling sharing. These formats support multilevel scaling schemes. We adopt only the level-1 scaling strategy, as illustrated in Figure 3. The scaling factor in the MX format can be computed similarly to Equation (2), after which it is quantized using power-of-two (PoT) quantization. The data elements in MX formats can be represented either as integers or as minifloats. In our design, we include both representations in the search space to evaluate software performance.
Quantization Comparison. In long-context scenarios, KV cache size is a key challenge [76], but hardware support for efficient quantization of KV cache remains limited [25, 54]. Existing frameworks often treat quantization in isolation rather than as part of full-system design, leaving gaps in non-GEMM operations and causing a mismatch between algorithmic advances and practical deployment on hardware.
Quantization Baselines
Accelerator Implementation
Table 10: This table investigates the effect of online rotation on activations in the linear layers. For the LLaMA2-7B model, applying rotation to the down_proj layer results in worse performance compared to not rotating it, whereas this effect is not observed in the LLaMA3-8B model. Moreover, rotating the o_proj layer severely degrades the performance of LLaMA3-8B. These results suggest that the effectiveness of rotation is highly model-dependent.
Accelerator Baselines
University of Cambridge Cambridge, UK
Inference Process
Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774
1.8.1.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
Quantization
Quantization Comparison
Full System Design
Our memory system is characterized by two key properties:
· Support for asymmetric precisions, variable-length memory transfers, and strided loads/stores to HBM. · Latency hiding for HBM accesses via a hardware prefetcher, enabling high bandwidth utilization.
To make more effective use of HBM capacity, as discussed in Section 3.1, all data stored in HBM is kept in MX format. However, due to address alignment constraints, it is impractical to concatenate each data block with its associated per-block scales. This is because the resulting combined size seldom matches a (2 𝑛 ) multiple, making it inefficient for the memory system.
To address this problem, we store the blocks and their scales separately - laying out all blocks contiguously, followed by the corresponding scales at the end of the block region. With this technique, the memory address alignment is preserved while locality is maintained. The resulting layout is shown in Figure 8.
To support variable-length transfers, the HBM controller integrates two data-packing units. MX-format blocks fetched via TileLink [62] (the on-chip interconnect used to access the HBM controller) are repacked into (i) MLEN -wide vectors for the Matrix SRAM and (ii) VLEN -wide vectors for the Vector SRAM. The controller automatically locates and fetches the corresponding per-block scales based on the active precision and the requested transfer size. On the write path, dedicated units accept vectors from the Matrix and Vector SRAMs, partition them into MX blocks, attach the appropriate perblock scales, and commit the aligned layout back to HBM.
The loading logic is critical to help us fully utilize the HBM memory bandwidth. The hardware load unit resides in both

Figure 8: Data layout and interaction in HBM. Data of different precisions can be stored simultaneously according to the defined storage pattern in HBM. Strided load and store operations are managed by the address remap unit, which generates and passes strided addresses to the TileLink channel.
the Matrix and Vector SRAMs and is connected directly to the HBM controller. This enables background fetching and streaming into each SRAM while the rest of PLENA executes other instructions, sustaining full utilization of the matrix unit and avoiding stalls on HBM accesses. The two load units are controlled directly by instructions, with the amount of data to be load encoded in each instruction. For example, during weight-activation GEMMs, where GEMM operations are invoked repeatedly while streaming data across the hidden dimension, the loaded amount is set to this dimension, so the load instruction only needs to be issued once.
Co-design
This subsection shows the results of our design space exploration experiments. Figure 11 shows the Empirical Attainment Surfaces (EAS) for the Pareto fronts found when optimizing with Llama3.2-1B and Llama-3-8B. EAS is a visualization approach well-suited for conveying the uncertainty of the Pareto fronts from multiple runs with different random seeds [21, 33]. Existing tools support visual analysis for two objectives [67], hence we plot EAS for accuracy and latency first. Then, in Table 16 we analyze the relationship between all objectives. Figure 11 shows that active learning with BOTorch sampler achieves a significantly better tradeoff between latency and perplexity than naive randomized sampling. Tree-Structured Parzen Estimator (TPE) shows more modest gains when optimizing with Llama3.2-1B compared to using BOTorch sampler, thus we focus on the latter for experiments with Llama-3-8B.
Compute Performance
The system-level performance comparison is shown in Table 5, evaluating both small and large GQA-based LLaMA

Figure 11: Empirical Attainment Surfaces for latency ( ↓ ) and perplexity ( ↓ ) objectives across multiple seeds, evaluated with Llama3.2-1B and Llama-3-8B over the co-design space shown in Table 14. For the 1B model, we run 9 seeds with 50 trials, comparing BoTorch and TPE methods against Random sampling. For the 8B model, we run 5 seeds with 50 trials, comparing BoTorch against Random. Shaded regions show the 25% and 75% attainment bands across seeds.
models as well as the recently published MoE-based GPTOSS model, all implemented in 7 nm technology and supporting long-context inputs. This experiment investigates peak TPS by scaling the batch size to the maximum capacity that HBM can accommodate. As shown, PLENA achieves consistently higher TPS than both the A100 and TPU v6e under identical HBM settings and multiplier counts, with peak performance reaching up to 2.24 × that of the A100 and 3.85 × that of the TPU v6e. The higher TTFT observed in PLENA is
explained by its ability to store more batches within the same HBM capacity using our quantization scheme. As batch size increases, the prefill stage grows longer due to additional memory accesses and computation.
Table 7: Compute area, utilization, and attainable FLOPs of systolic arrays under W4A4KV4 bitwidth for LLaMA-3.3-70B. Baselines use 64 × 64 arrays, while PLENA employs a flattened ( 4 , 512 ) array. Results are shown for Standard (Prompt = 1k, Gen = 128) and Agentic (Prompt = 5.6k, Gen = 8k) workloads.
∗ Attainable FLOPs are computed from utilization and peak design throughput. Micro = MicroscopiQ. S. A FLOPs = Standard workload attainable FLOPs. A. A FLOPs = Agentic workload attainable FLOPs.
As shown in Table 7, PLENA achieves significantly higher utilization than prior designs in both short- and long-context workloads, with up to 8.5 × improvement in attainable utilization.
Conclusion
This paper introduces PLENA , a hardware-software codesigned system that features a flattened systolic array, an asymmetric quantization scheme, and native architectural support for FlashAttention, addressing the underutilization challenges posed by memory bandwidth and capacity walls. Beyond the hardware, PLENA is supported by a full toolchainincluding a compiler, cycle-accurate simulator, and design space exploration framework-that enables rapid adaptation and optimization for emerging transformer models. Future work will focus on further optimizing GEMM utilization in FlashAttention and extending PLENA with a multi-core flattened systolic array to better exploit parallelism. In addition, the compiler can be enhanced to provide finer-grained control over execution scheduling. Finally, we plan to integrate PLENA with GPUs to form a heterogeneous LLM acceleration system.
Appendix
Ablation Study on Quantization Methods
Table 8: Ablation study on quantization techniques. Covering all 9 GEMMs in the Llama-3-8B model. Quantization configured with W_MXINT4, A_MXINT4, KV_MXINT4 with block size 16.
A/KV Datatype Search
Table 9: Perplexity on WikiText2 (lower is better) for various quantization settings applied to LLaMA-3 8B. We kept MXINT4 for weights.
Selective Rotation
Table 10: This table investigates the effect of online rotation on activations in the linear layers. For the LLaMA2-7B model, applying rotation to the down_proj layer results in worse performance compared to not rotating it, whereas this effect is not observed in the LLaMA3-8B model. Moreover, rotating the o_proj layer severely degrades the performance of LLaMA3-8B. These results suggest that the effectiveness of rotation is highly model-dependent.
Vector Core Datatype Search
Table 9: Perplexity on WikiText2 (lower is better) for various quantization settings applied to LLaMA-3 8B. We kept MXINT4 for weights.
Computation Flow of LLaMA Decoder-only Transformer
Algorithm 2 Computation flow of a LLaMA decoder-only Transformer with embedding, lm_head, and with 𝐿 layers: each decoder layer performs [MatMul1-9] interleaved with RMSNorm, RoPE , and nonlinear activations (Softmax, SiLU).
Matrix SRAM

Figure 12: This matrix SRAM supports both transposed and untransposed reads without additional cost. The key idea is to store each row of data separately across a set of sub-SRAMs, where the number of sub-SRAMs equals the vector dimension being stored. The row index assigned to each element differs across the subSRAMs,ensuring that elements from the same matrix column (green dotted line) are distributed across different sub-SRAMs. With this organization, when reading from the SRAM-whether in transposed or untransposed mode-each requested element resides in a different sub-SRAM. As a result, only one read port per sub-SRAM is required.
PE Array

Figure 13: In the hardware implementation of the PE array, element and scale will flow from top to bottom, left to right. All computations are performed using integer arithmetic.
Custom ISA
Table 11: A summary of the PLEANA customized ISA for the accelerator
Custom Instructions
Table 12: Summary of Custom ISA Instructions.
Downstream Tasks
Table 13: Zero-shot accuracy of Llama-3 and Llama-2 models with 4 bits (A4W4KV4) only comparing with QuaRot on PIQA (PQ), WinoGrande (WG), HellaSwag (HS), Arc-Easy (A-e), Arc-Challenge (A-c), and LAMBADA (LA). Baseline and quarot results taken from QuaRot Table2 and Table 12.
Co-Design Space and Analysis
Table 14: Hardware and quantisation parameters co-design search space. Categorical parameters are one-hot encoded, integer parameters are expressed as a power of 2.
Table 16: Design space exploration on Llama-3-8B: multi-objective results for five configurations from a BoTorch run. We report perplexity ( ↓ ) from the accuracy evaluator, end-to-end latency (seconds ↓ ), and area (micrometer 2 ↓ ) from the respective cost models. Perplexity is computed with GEMM-only emulation (nonlinear ops omitted) for faster iteration, therefore the FP setting affects latency and area but not the accuracy metric. We load weights pre-quantized to MXINT4 via our PTQ method and quantize activations and the KV cache on-the-fly during inference.
Processing ELements
Models and Datasets. Weevaluate our quantization framework on mainly two families of LLMs, namely LLaMA-2 [64] and LLaMA-3 [45]. We also demonstrate our system on MoE (eg. GPT-OSS) and MLA-based (QWen-MLA) models. Quantization performance is measured in terms of perplexity on the WikiText-2 dataset [44]. The entire quantization process requires approximately 2-20 GPU hours on NVIDIA H100 GPUs, depending on the model size and configuration.
Quantization Baselines. We compare against several state-of-the-art quantization methods, including softwarebased approaches such as GPTQ [22], OmniQuant [61], and QuaRoT [6], as well as hardware-accelerated approaches such as Atom [75] and MicroscopiQ [54].
Accelerator Implementation. PLENA is implemented in SystemVerilog RTL. We perform synthesis using Synopsys Design Compiler with the 7 nm OpenROAD predictive process design kit [13] to generate area and power estimates under a 1 GHz clock frequency.
Accelerator Baselines. Since the works we selected for comparison, MicroscopiQ [54], FIGNA [31], and Olive [25], are not open-source and were not evaluated using the same technology node or toolchain, we re-implemented their core components and integrated them into the PLENA system for a fair inference performance comparison. Additionally, DeepScale [58] is used for overall system performance estimation, scaling all designs to the 7 nm process. Their detailed area and power of the core units are evaluated using our own implementations.
Inference Process. Instead of comparing only with SOTA accelerators, we also evaluate against the latest high performance commercial compute units, including GPUs (A100 80G) and TPUs (v6e-8). The GPU experiments are conducted in an environment with Ubuntu 22.04, CUDA 12.8, Python 3.11, PyTorch 2.8.0, and vLLM 0.10 V1. The TPU experiments are conducted in an environment with v2-alpha-tpuv6e software and vllm\vllm_tpu docker image.
Compute Performance Experiment Settings
$$ \Omega(\tau) = {,x \in \mathbb{R} ;|; \min_{\tau} \leq x \leq \max_{\tau},}, $$
$$ s = \frac{\max(\mathbf{X})}{\max_{\tau}}, \label{eq:scaling_calculating} $$ \tag{eq:scaling_calculating}
$$ \label{equ:clip} X_\tau = \mathrm{clip}!\left(\mathrm{RTN}!\left(\tfrac{\mathbf{X}}{s}\right)+z, \min_{\tau}, \max_{\tau}\right), $$ \tag{equ:clip}
$$ Q(\mathbf{X}; s, \tau) = s ( X_\tau - z ). $$
$$ \label{eq:rowwise-argmin} \resizebox{0.9\linewidth}{!}{$ p_i^\star = \arg\min_{p\in\mathcal{P}} \left| \mathbf{X}b \Big( \mathbf{W}{i,,b:b{+}{B}-1}
\textsc{Quantize}(\mathbf{W}_{i,,b:b{+}{B}-1};,p,\tau) \Big)^\top \right|_2^2 $} $$ \tag{eq:rowwise-argmin}
$$ \mathbf{Y} = \mathrm{Quantize}(\mathbf{X}\mathbf{H}) \cdot \mathrm{Quantize}(\mathbf{H}^{-1}\mathbf{W}) \label{eq:quant_with_rotated_weight} $$ \tag{eq:quant_with_rotated_weight}
Algorithm: algorithm
[t]
\caption{PLENA L2-Norm-Guided Hessian-Based Weights Quantization}
\label{alg:plena-wq}
\begin{algorithmic}[1]
% \small
\Require full-precision weight matrix $\mathbf{W} \in \mathbb{R}^{N \times K}$
\Require calibration activations $\mathbf{X} \in \mathbb{R}^{M \times K}$
\Require block size ${B}$ (i.e., $\text{MLEN}$) defined in our MX data format $\tau$
\Require percentile set $\mathcal{P}$, target format $\tau$
\Ensure quantized weight matrix $\mathbf{Q}$; block quantisation errors $\mathbf{E} $
\State Initialize quantized weights $\mathbf{Q} \gets \mathbf{0} \in \mathbb{R}^ {N \times K}$
\State Initialize quantisation errors $\mathbf{E} \gets \mathbf{0} \in \mathbb{R}^ {N \times B}$
\State $\mathbf{H}^{-1} = (2 \mathbf{X} \mathbf{X}^\top + \lambda \mathbf{I})^{-1}$
\State $\mathbf{H}^{-1} \gets \text{Cholesky}( \mathbf{H}^{-1})^\top$
\For{each block $b = 0, B, 2B, \ldots, K-1$}
\State $b_2 \gets \min(b{+}B,\, K)$
\State $\mathbf{W}_b \gets \mathbf{W}_{:, b:b_2}$ \Comment{Extract weight block}
\State $\mathbf{X}_b \gets \mathbf{X}_{:, :, b:b_2}$ \Comment{Extract activation block}
\State Initialize $\mathbf{Q}^{\text{best}}_b \gets \mathbf{0}$, $\boldsymbol{\epsilon}^{\text{best}} \gets \infty$
\For{each candidate percentile $p \in \mathcal{P}$}
\State $\widetilde{\mathbf{W}}_b \gets \textsc{Quantize}(\mathbf{W}_b, p, \tau)$ ; \quad $\boldsymbol{\epsilon} \gets \left\| \mathbf{X}_b \mathbf{W}_b^\top - \mathbf{X}_b \widetilde{\mathbf{W}}_b^\top \right\|^2_2$
\State $\mathbf{mask} \gets \boldsymbol{\epsilon} < \boldsymbol{\epsilon}^{\text{best}} $
\State $\mathbf{Q}^{\text{best}}_b[\mathbf{mask}, :] \gets \widetilde{\mathbf{W}}_b[\mathbf{mask}, :]$ ; \quad $\boldsymbol{\epsilon}^{\text{best}}[\mathbf{mask}] \gets \boldsymbol{\epsilon}[\mathbf{mask}]$
\EndFor
\State $\mathbf{Q}_{:, b:b_2} \gets \mathbf{Q}^{\text{best}}_b$ ; \quad $\boldsymbol{\Delta}_b \gets \mathbf{W}_b - \mathbf{Q}^{\text{best}}_b $ ; \quad $\mathbf{d}_{bb} \gets \mathrm{diag}\bigl(\mathbf{H}^{-1}_{\, b:b_2,\, b:b_2}\bigr)$
\State $\mathbf{E}_b \gets \boldsymbol{\Delta}_b \, \mathrm{diag}(\mathbf{d}_{bb})^{-1} $ ; \quad $\mathbf{W}_{:,\, b_2:} \gets \mathbf{W}_{:,\, b_2:} \;-\; \mathbf{E}_b \cdot \mathbf{H}^{-1}_{\, b:b_2,\, b_2:}$
\EndFor
\end{algorithmic}
Algorithm: algorithm
[H]
% \caption{Output-aware Blockwise Percentile Clipping }
% \label{alg:oabgptq}
% \begin{algorithmic}[1]
% \small
% \Require $\mathbf{X}\in\mathbb{R}^{N_{\text{cal}}\times T \times d_{\text{col}}}$ \Comment{calibration activations}
% \Require $\mathbf{W}\in\mathbb{R}^{d_{\text{row}}\times d_{\text{col}}}$ \Comment{weights}
% \Require block size $B$, percentile set $\mathcal{P}$, precision settings $\mathsf{S}$
% \State $\mathbf{H}^{-1}\gets \bigl(2\,\mathbf{X}\mathbf{X}^\top+\lambda\mathbf{I}\bigr)^{-1}\in\mathbb{R}^{d_{\text{col}}\times d_{\text{col}}}$ \Comment{inverse Hessian surrogate}
% \State $\mathbf{Q}\gets \mathbf{0}_{d_{\text{row}}\times d_{\text{col}}}$ \Comment{dequantized quantized weights}
% \For{$b \gets 0, B, 2B, \dots$ \textbf{to} $d_{\text{col}}{-}1$}
% % \State $b_2\gets \min(b{+}B,\ d_{\text{col}})$; \quad $B'\gets b_2{-}b$
% \State $\mathbf{W}_b\gets \mathbf{W}_{:,\, b}\in\mathbb{R}^{d_{\text{row}}\times B}$ \Comment{weight block}
% \State $\mathbf{X}_b\gets \mathbf{X}_{:,\ :,\, b}\in\mathbb{R}^{N_{\text{cal}}\times T \times B}$ \Comment{activation block}
% \State $\mathbf{Q}_b\gets \textsc{MXINTQuant}(\mathbf{W}_b,\mathbf{X}_b,\mathcal{P},\mathsf{S})$
% \State $\Delta\gets \mathbf{W}_b-\mathbf{Q}_b \in \mathbb{R}^{d_{\text{row}}\times B'}$ \Comment{per block quantisation error}
% % \State $\mathbf{H}^{-1}_{bb}\gets \mathbf{H}^{-1}_{\, b:b_2,\ b:b_2}\in\mathbb{R}^{B'\times B'}$
% % \State \State $\mathbf{W}_{:,\, j:(i+B)} \gets \mathbf{W}_{:,\, j:(i+B)} - \mathbf{E}_{:,\, j-i}\cdot \mathbf{H}^{-1}_{j,\, j:(i+B)}$ \Comment{intra-block update}
% % \EndFor
% \State $\mathbf{W}_{:,\, (i+B):} \gets \mathbf{W}_{:,\, (i+B):} - \mathbf{E} \cdot \mathbf{H}^{-1}_{i:(i+B),\, (i+B):}$ \Comment{propagate to remaining cols}
% \State \Return $\mathbf{Q}$
% \end{algorithmic}
%
Algorithm: algorithm
[h]
\caption{Computation flow of a LLaMA decoder-only Transformer with embedding, lm\_head, and with \(\,L\,\) layers: each decoder layer performs \mat{1--9} interleaved with \nonlin{RMSNorm}, \linop{RoPE} , and nonlinear activations (\nonlin{Softmax}, \nonlin{SiLU}).}
\label{alg:llama-decoder}
\begin{algorithmic}[1]
\small
\Require $t \in [V]^T$ \Comment{token ids}
\Require $B,T,d,L,H,H_{\mathrm{kv}}$ \Comment{batch, seq, hidden\_dim, \#layers, \#Q heads, \#KV heads}
\Require $(\cos\theta,\sin\theta)$ \Comment{RoPE parameters}
% ---- Embedding (minimal) ----
\State $X^{(1)} \gets \textsc{Embed}(t)$ \Comment{$X^{(1)} \in \mathbb{R}^{B\times T\times d}$}
% ---- Decoder stack ----
\For{$\ell = 1$ \textbf{to} $L$}
\State \textit{Layer input:} $X^{(\ell)} \in \mathbb{R}^{B\times T\times d}$
\State $X_n \gets \nonlin{RMSNorm}(X^{(\ell)})$
\State $Q \gets X_n W_Q$ \hfill \mat{1}
\State $K \gets X_n W_K$ \hfill \mat{2}
\State $V \gets X_n W_V$ \hfill \mat{3}
\State $(Q, K) \gets \linop{RoPE}(Q,K;\cos\theta,\sin\theta)$
\State $(K, V) \gets \mathrm{RepeatGroups}(K, V, H/H_{\mathrm{kv}})$ \Comment{GQA}
\State $A_w \gets \nonlin{Softmax}\!\left(\tfrac{QK^{\top}}{\sqrt{d_h}}\right)$ \hfill \mat{4}
\State $A_w \gets A_w V$ \hfill \mat{5}
\State $A_o \gets A_w W_O$ \hfill \mat{6}
\State $X' \gets X^{(\ell)} + A_o$ \Comment{residual add}
\State $X_n' \gets \nonlin{RMSNorm}(X')$
\State $X_{\text{act}} \gets \nonlin{SiLU}(X_n' W_\text{up})$ \hfill \mat{7}
\State $X_{\text{gate}} \gets X_n' W_\text{gate}$ \hfill \mat{8}
\State $X_{\text{mlp}} \gets (X_{\text{act}} \odot X_{\text{gate}}) W_\text{down}$ \hfill \mat{9}
\State $X^{(\ell+1)} \gets X' + X_{\text{mlp}}$ \Comment{residual add}
\EndFor
% ---- LM head (minimal) ----
\State $\text{logits} \gets X^{(L+1)} W_{\text{LM}}$ \hfill
\State $\hat{p} \gets \nonlin{Softmax}(\text{logits})$
\State \Return $\hat{p}$
\end{algorithmic}
Table 17: Configuration settings for compute performance experiments, chosen to match the multiplier count of the A100 GPU. For MicroscopiQ, MLEN and BLEN are set to the same value to form a square shape.
| PLENA | PICACHU [53] | MicroScopiQ [54] | FlightLLM [73] | Tender [36] | FIGNA [31] | SystolicAttention [38] | Olive [25] | |
|---|---|---|---|---|---|---|---|---|
| Simulator | L3 | L1 | L2 | L1 | L1 | L1 | L3 | L1 |
| Custom ISA & Auto Code Gen | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| DSE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| FlashAttention support | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ |
| Full inference coverage ∗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Open source | ✓ | ✗ | ✗ | ✗ | ✓ |
| PLENA | MicroScopiQ* [54] | GPTQ[22] | QuaRot[6] | OmniQuant[60] | SmoothQuant[70] | Atom[75] | KiVi[76] | M-ANT[29] | |
|---|---|---|---|---|---|---|---|---|---|
| (QW, QACT, QKV) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✗ , ✗ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✗ , ✗ , ✓ ) | ( ✓ , ✓ , ✓ ) |
| No.GEMMs | 9/9 | 9/9 | 7/9 | 7/9 | 9/9 | 7/9 | 9/9 | 0/9 | 9/9 |
| Nonlinear_FN | ✓ | ✗ | ✗ | ✗ | ✓ * | ✗ | ✗ | ✗ | ✗ |
| Embeddin & lm_head | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| RMSNorm | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| ROPE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Supported datatypes | MXFP, MXINT, MiniFloat | MXFP, MXINT | INT | INT | INT | INT | INT, FP | INT | MANT |
| Mixed-precision | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
| Weights Quantization | Weights Quantization |
|---|---|
| Require: | full-precision weight matrix W ∈ R 𝑁 × 𝐾 calibration activations X ∈ R 𝑀 × 𝐾 |
| Require: Require: Require: | block size 𝐵 (i.e., MLEN) defined in our MX data format 𝜏 percentile set P , target format 𝜏 quantized weight matrix Q ; block quantisation errors E quantisation errors R 1 = ( 2 XX ⊤ + 𝜆 I ) - 1 1 ← Cholesky ( H - 1 ) ⊤ each block 𝑏 = 0 , 𝐵, 2 𝐵, . . .,𝐾 - 1 do 𝑏 2 ← min ( 𝑏 + 𝐵, 𝐾 ) |
| Ensure: 1: 2: | Initialize quantized weights Q ← 0 ∈ R 𝑁 × 𝐾 Initialize E ← 0 ∈ 𝑁 × 𝐵 |
| 11: | e W 𝑏 ← /Q_u.scantize ( W 𝑏 , 𝑝,𝜏 ) ; 𝝐 ← GLYPH<13> GLYPH<13> GLYPH<13> X 𝑏 W ⊤ 𝑏 - X 𝑏 e W 𝑏 |
| 3: H - 4: H - 5: for 6: 7: 8: | mask ← 𝝐 < 𝝐 best Q best [ mask , : ] ← W 𝑏 [ mask , : ] ; 𝝐 best [ mask ] ← 𝝐 [ |
| e mask ] | |
| Q : ,𝑏 : 𝑏 2 ← Q best 𝑏 ; 𝚫 𝑏 ← W 𝑏 - Q best 𝑏 ; d 𝑏𝑏 ← diag GLYPH<0> H - 1 𝑏 : 𝑏 2 ,𝑏 : 𝑏 1 1 | |
| E 𝑏 ← 𝚫 𝑏 diag ( d 𝑏𝑏 ) - ; W : ,𝑏 2 : ← W : ,𝑏 2 : - E 𝑏 · H - 𝑏 : 𝑏 | |
| 15: | |
| 12: 13: | W 𝑏 ← W : ,𝑏 : 𝑏 2 ⊲ Extract weight block X 𝑏 ← X : , : ,𝑏 : 𝑏 2 ⊲ Extract activation block 2 |
| 9: | Initialize Q best 𝑏 ← 0 , 𝝐 best ←∞ |
| for each candidate percentile 𝑝 ∈ P do | |
| 10: | |
| ⊤ GLYPH<13> GLYPH<13> GLYPH<13> 2 | |
| 𝑏 | |
| 14: | 2 |
| 2 ,𝑏 2 : |
| LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-3 [45] | LLaMA-3 [45] | ||
|---|---|---|---|---|---|---|
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| GPTQ [22] | 4/16/16 | 6.23 | 5.58 | 4.28 | 8.12 | 3.75 |
| AWQ [39] | 4/16/16 | 5.82 | 5.19 | 4.08 | 7.96 | 3.58 |
| OmniQuant [61] | 4/16/16 | 5.74 | 5.02 | 3.47 | 7.09 | 3.46 |
| MicroScopiQ [54] | 4/16/16 | 5.65 | 5.02 | 3.42 | 6.89 | 3.25 |
| QuaRot [6] | 4/16/16 | 5.60 | 5.00 | 3.41 | 6.52 ∗ | 3.53 ∗ |
| Ours | 4/16/16 | 5.61 | 4.97 | 3.41 | 6.45 | 3.59 |
| OmniQuant [61] | 4/4/16 | 11.47 | 8.32 | 5.41 | 10.21 | 5.30 |
| SmoothQuant [70] | 4/4/16 | 20.47 | 15.63 | 17.62 | 29.54 | 19.32 |
| Atom [75] | 4/4/16 | 6.16 | 6.12 | 5.20 | 8.12 | 4.69 |
| MicroScopiQ [54] | 4/4/16 | 6.11 | 5.57 | 4.48 | 8.12 | 4.65 |
| QuaRot [6] | 4/4/16 | 6.02 ∗ | 5.36 ∗ | 3.78 | 8.00 ∗ | 6.33 ∗ |
| M-ANT [29] | 4/4/16 | 5.92 | 5.24 | - | - | - |
| Ours | 4/4/16 | 5.69 | 5.03 | 3.59 | 6.76 | 4.51 |
| QuaRot [6] | 4/4/4 | 6.10 | 5.40 | 3.79 | 8.16 | 6.66 |
| QuaRot-128G [6] | 4/4/4 | 5.93 | 5.26 | 3.61 | 7.36 | 5.51 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
| LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-3 [45] | LLaMA-3 [45] | ||
|---|---|---|---|---|---|---|
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
| Ours-Full System | 4/4/4 | 5.91 | 5.19 | 3.63 | 7.23 | 4.82 |
| LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.3-70B | LLaMA-3.3-70B | LLaMA-3.3-70B | LLaMA-3.3-70B | |
|---|---|---|---|---|---|---|---|---|
| System | Standard | Standard | Agentic | Agentic | Standard | Standard | Agentic | Agentic |
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
| A100 80G | 6.20 | 1.00× | 0.22 | 1.00× | 1.12 | 1.00× | 1.05 | 1.00× |
| A100 80G (With Q) ∗ | 5.13 | 1.66× | 0.19 | 1.39× | 3.41 | 1.23× | 2.46 | 1.32× |
| TPU v6e | 5.63 | 0.90× | 4.58 | 0.39× | 50.07 | 0.31× | 7.98 | 0.84× |
| MicroScopiQ [54] | 16.43 | 0.35× | 3.27 | 0.57× | 61.63 | 0.16× | 19.23 | 0.09× |
| PLENA | 4.19 | 2.24× | 0.21 | 1.58× | 5.65 | 1.19× | 1.27 | 1.49× |
| GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | qwen2.5-7B ∗ | qwen2.5-7B ∗ | qwen2.5-7B ∗ | qwen2.5-7B ∗ | |
|---|---|---|---|---|---|---|---|---|
| System | Standard | Standard | Agentic | Agentic | Standard | Standard | Agentic | Agentic |
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
| A100 80G | 9.39 | 1.00× | 1.87 | 1.00× | 8.21 | 1.00× | 1.17 | 1.00× |
| PLENA | 6.13 | 1.36× | 1.41 | 1.21× | 5.71 | 1.42× | 1.52 | 1.30× |
| Metric | Micro [54] | Olive [25] | FIGNA [31] | PLENA |
|---|---|---|---|---|
| Comp Area (mm 2 ) | 0.1378 | 0.319 | 0.471 | 0.237 |
| TOPs/mm 2 | 59.45 | 25.66 | 17.39 | 34.49 |
| S. A FLOPs/mm 2 ∗ | 28.76 | 11.59 | 7.51 | 32.8 |
| A. A FLOPs/mm 2 ∗ | 1.08 | 0.44 | 0.31 | 5.31 |
| Method | Metric (e.g., PPL ↓ ) |
|---|---|
| Baseline FP16 | 6.14 |
| RTN RTN + Err 𝑤 Clip GPTQ + Err 𝑤 Clip GPTQ + Err 𝑦 Clip GPTQ + Err 𝑦 Clip + Selective Rotation | 8.2763(2 . 3793 ↑ ) 8.1948(0 . 0815 ↓ ) 8.5193 (0 . 3245 ↑ ) 7.6026 (0 . 9167 ↓ ) 7.2218(0 . 3808 ↓ ) |
| Quant Method | A/KV Datatype Search | A/KV Datatype Search | A/KV Datatype Search |
|---|---|---|---|
| e1m2 (MXFP4) | e2m1 (MXFP4) | MXINT4 | |
| Baseline FP16 | 6.14 | ||
| Ours | 8.7205 | 23.1579 | 7.22 |
| Rotated Layer | LLaMA2-7B | LLaMA3-8B |
|---|---|---|
| Attn Only | 5.9367 | 7.3933 |
| Attn + Down_proj | 5.9405 | 7.2721 |
| Attn + Up_proj | 5.9263 | 7.3529 |
| Attn + Gate_proj | 5.9241 | 7.3875 |
| Attn + Q_proj | 5.9183 | 7.3616 |
| Attn + K_proj | 5.9182 | 7.3555 |
| Attn + V_proj | 5.9322 | 7.3788 |
| Attn + O_proj | 5.9238 | nan |
| Require: 𝑡 ∈ | [ 𝑉 ] 𝑇 | ⊲ token ids |
|---|---|---|
| Require: 𝐵,𝑇,𝑑,𝐿,𝐻,𝐻 kv ⊲ batch, heads, #KV heads | seq, | hidden_dim, #layers, #Q |
| Require: ( cos 𝜃, sin 𝜃 ) | 𝑋 ( 1 ) ← Embed ( 𝑡 ) | ⊲ RoPE parameters ⊲ 𝑋 ( 1 ) ∈ R 𝐵 × 𝑇 × 𝑑 |
| 1: 2: | for ℓ = 1 to 𝐿 do | |
| 3: | Layer input: 𝑋 ( ℓ ) ∈ R 𝐵 × 𝑇 × 𝑑 | |
| 4: | 𝑋 𝑛 ← RMSNorm ( 𝑋 ( ℓ ) ) | |
| 5: | 𝑄 ← 𝑋 𝑛 𝑊 𝑄 | [MatMul1] |
| 6: | 𝐾 ← 𝑋 𝑛 𝑊 𝐾 | [MatMul2] |
| 7: | 𝑉 ← 𝑋 𝑛 𝑊 𝑉 | [MatMul3] |
| 8: | ( 𝑄,𝐾 )← RoPE ( 𝑄,𝐾 ; cos 𝜃, sin 𝜃 ) | |
| 9: | ( 𝐾,𝑉 )← RepeatGroups ( 𝐾,𝑉,𝐻 / 𝐻 kv ) | ⊲ GQA |
| 10: | 𝐴 𝑤 ← Softmax GLYPH<18> 𝑄𝐾 ⊤ √ 𝑑 GLYPH<19> | [MatMul4] |
| 11: | ℎ 𝐴 𝑤 ← 𝐴 𝑤 𝑉 | [MatMul5] |
| 12: | 𝐴 𝑜 ← 𝐴 𝑤 𝑊 𝑂 | [MatMul6] |
| 13: | 𝑋 ′ ← 𝑋 ( ℓ ) + 𝐴 𝑜 | ⊲ residual add |
| 14: | 𝑋 ′ 𝑛 ← RMSNorm ( 𝑋 ′ ) | |
| 15: | 𝑋 act ← SiLU ( 𝑋 ′ 𝑛 𝑊 up ) | [MatMul7] |
| 16: | 𝑋 gate ← 𝑋 ′ 𝑛 𝑊 gate | [MatMul8] |
| 17: | 𝑋 mlp ←( 𝑋 act ⊙ 𝑋 gate ) 𝑊 down | [MatMul9] |
| 18: | 𝑋 ( ℓ + 1 ) ← 𝑋 ′ + 𝑋 mlp | ⊲ residual add |
| 19: | logits ← 𝑋 ( 𝐿 + 1 ) 𝑊 LM | |
| 20: | ˆ 𝑝 ← Softmax ( logits ) | |
| 21: | return ˆ 𝑝 |
| Instruction Type | Description | Instruction No. |
|---|---|---|
| Matrix | Controls GEMM and GEMV opera- tions, with or without matrix trans- position | 6 |
| Vector | Performs elementwise, reduction operations, and rotation for quan- tization | 13 |
| Scalar | Performs scalar INT and FP arith- metic | 17 |
| HBM | Handles data transfers between HBM and matrix/vector SRAMs | 3 |
| Control | Defines operation settings, includ- ing the HBM physical address | 4 |
| Type | Instruction (Format) | Description |
|---|---|---|
| Matrix (M) | M_MM (opcode, rd, rs1, rs2) M_TMM (opcode, rd, rs1, rs2) M_MV (opcode, rd, rs1) M_TMV (opcode, rd, rs1) M_MV_WO (opcode, rd, imm) M_MM_WO (opcode, rd, imm) | Multiply Matrix[rs2] and Vector[rs1]; accumulate in systolic array. Same as M_MM but with matrix transpose. Multiply Matrix[rs2] and Vector[rs1]; store in first row of systolic array. Same as M_MV but with matrix transpose. Write out first row of systolic array to Vector SRAM[rd+imm]. Write out systolic array results to Vector SRAM[rd+imm]. |
| Vector (V) | V_ADD_VV (opcode, rd, rs1, rs2) V_ADD_VF (opcode, rd, rs1, rs2) V_SUB_VV (opcode, rd, rs1, rs2) V_SUB_VF (opcode, rd, rs1, fp2) V_MUL_VV (opcode, rd, rs1, rs2) V_MUL_VF (opcode, rd, rs1, fp2) V_EXP_V (opcode, rd, rs1) V_REC_V (opcode, rd, rs1) V_LD_F (opcode, rd, rs1) V_RED_SUM (opcode, rd, rs1) V_RED_MAX (opcode, rd, rs1) V_ROTATION_EN (opcode, rd, rs1) | Elementwise vector addition. Vector plus broadcasted FP register. Elementwise vector subtraction. Vector minus broadcasted FP register. Elementwise vector multiplication. Vector times broadcasted FP register. Elementwise exponentiation. Elementwise reciprocal. Broadcast FP register value to vector. Reduction sum of vector into FP register. Reduction max of vector into FP register. Selectively apply Hadamard rotation |
| Scalar (S) | S_ADD_INT (opcode, rd, rs1, rs2) S_ADDI_INT (opcode, rd, rs1, imm) S_SUB_INT (opcode, rd, rs1, rs2) S_LUI_INT (opcode, rd, imm) S_MUL_INT (opcode, rd, rs1, rs2) S_DIV_INT (opcode, rd, rs1, rs2) S_LD_INT (opcode, rd, rs1, imm) S_ST_INT (opcode, rd, rs1, imm) S_ADD_FP (opcode, rd, rs1, rs2) S_SUB_FP (opcode, rd, rs1, rs2) S_MUL_FP (opcode, rd, rs1, rs2) S_EXP_FP (opcode, rd, rs1) S_MAX_FP (opcode, rd, rs1, rs2) S_LD/ST_FP (opcode, rd, rs1, imm) | Integer addition. Integer add immediate. Integer subtraction. Load upper immediate. Integer multiplication. Integer division. Load from FIX_MEM into integer register. Store integer register to FIX_MEM. FP addition. FP subtraction. FP multiplication. FP exponentiation. FP maximum. Load/store FP register from/to FP_MEM. |
| Memory (H) | H_PREFETCH_M (opcode, rd, rs1, rs2, rstride, prec) H_PREFETCH_V (opcode, rd, rs1, rs2) H_STORE_V (opcode, rd, rs1, rs2, stride, prec) | Prefetch specified rows from HBM to Matrix SRAM. Prefetch specified amount of rows from HBM to Vector SRAM. Store VLEN rows from Vector SRAM to HBM. |
| Control (C) | C_SET_ADDR_REG (opcode, rd, rs1, rs2) C_SET_SCALE_REG (rd, opcode) C_SET_LUT_REG (rd, opcode) C_BREAK (opcode) | Set HBM address register from two FIX regs. Set MX scale offset for quantized data. Set MX scale offset for quantized data. Trigger breakpoint exception. |
| Model | Method | PQ [10] | WG[57] | HS [72] | A-e [14] | A-c [14] | LA [52] | Avg. |
|---|---|---|---|---|---|---|---|---|
| Llama-3-8B | FP16 | 80.74 | 72.77 | 79.06 | 77.82 | 53.33 | 75.63 | 73.22 |
| Llama-3-8B | QuaRot [6] | 75.14 | 65.82 | 72.94 | 68.01 | 43.34 | 65.81 | 65.18 |
| Llama-3-8B | Ours | 79.11 | 71.35 | 76.97 | 74.07 | 50.51 | 74.07 | 71.01 |
| Llama-2-7B | FP16 | 79.11 | 69.06 | 75.99 | 74.58 | 46.25 | 73.9 | 69.82 |
| Llama-2-7B | QuaRot [6] | 76.77 | 63.77 | 72.16 | 69.87 | 40.87 | 70.39 | 65.64 |
| Llama-2-7B | Ours | 78.73 | 68.19 | 74.24 | 72.52 | 43.69 | 73.3 | 68.45 |
| Parameter | Description | Search range |
|---|---|---|
| BLEN | Tile size of block unit | [2, 4, 8, 16, 32] |
| MLEN | Tile size of matrix unit | [2, 4, 8, 16, 32, 64, 128, 256, 512] |
| VLEN | Tile size of vector unit | [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| HBM_M_Prefetch | Prefetch amount for matrix data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Prefetch | Prefetch amount for vector data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Writeback | Writeback amount for vector data to HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| ACT_WIDTH | Activation precision | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, |
| KV_WIDTH | Key/Value precision | E5M2} |
| FP_SETTING | Floating-point precision setting | FP_{E3M2, E2M3, E6M5, E5M6, E4M7, E8M5} |
| INT_DATA_WIDTH | Integer data width | [16, 32, 64] |
| Constraint | Description |
|---|---|
| MLEN ≥ BLEN MLEN mod BLEN = 0 MATRIX_SRAM_DEPTH ≥ 2 × MLEN VECTOR_SRAM_DEPTH ≥ 2 × HEAD_DIM + HIDDEN_DIM VLEN INT_SRAM_DEPTH ≥ 16 FP_SRAM_DEPTH ≥ 3 × MLEN + FP_CONSTANT_NUM (MLEN × ACT_WIDTH + (MLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (VLEN × ACT_WIDTH + (VLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (MLEN × ACT_WIDTH + (MLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (VLEN × ACT_WIDTH + (VLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 | Matrix tile size must be at least the block tile size Matrix tile size must be divisible by the block tile size Matrix SRAM depth must accommodate 2 × MLEN Vector SRAM depth must store heads and hidden slices Minimum integer SRAM depth Floating-point SRAM depth constraint Bandwidth constraint at 1GHz, 1 TB/s Bandwidth constraint at 1GHz, 1 TB/s Bandwidth constraint at 1GHz, 1.5 TB/s Bandwidth constraint at 1GHz, 1.5 TB/s |
| Parameters | Parameters | Parameters | Parameters | Parameters | Parameters | Parameters | Metrics | Metrics | Metrics | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BLEN | MLEN | VLEN | HBM_M Prefetch | HBM_V Prefetch | HBM_V Writeback | ACT WIDTH | KV WIDTH | FP SETTING | INT_DATA WIDTH | Perplexity ↓ | Lat (s) ↓ | Area ( mm 2 ) ↓ |
| 32 | 128 | 32 | 16 | 8 | 256 | MXFP_E4M3 | MXFP_E3M4 | FP_E4M7 | 64 | 6.70 | 0.24 | 49615017.52 |
| 32 | 128 | 64 | 4 | 8 | 256 | MXINT_8 | MXINT_4 | FP_E3M2 | 32 | 6.76 | 0.24 | 51639793.20 |
| 32 | 256 | 128 | 256 | 64 | 128 | MXFP_E1M2 | MXINT_8 | FP_E6M5 | 16 | 12.14 | 0.15 | 99425984.56 |
| 8 | 128 | 32 | 128 | 8 | 256 | MXFP_E3M4 | MXFP_E3M4 | FP_E5M6 | 16 | 6.54 | 1.47 | 26456937.52 |
| 16 | 128 | 16 | 4 | 16 | 64 | MXINT_8 | MXFP_E4M3 | FP_E3M2 | 64 | 6.60 | 0.49 | 31983011.76 |
| System | Freq (GHz) | MLEN | BLEN | VLEN | SRAM (MB) | W. Width | A. Width | KV Width | FP Setting |
|---|---|---|---|---|---|---|---|---|---|
| PLENA | 1 | 2048 | 32 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |
| MicroscopiQ | 1 | 256 | 256 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |
LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference — they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization.
In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5×\times higher utilization than existing accelerators, and delivers 2.24×\times higher throughput than the A100 GPU and 3.85×\times higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.
Transformer models have revolutionised AI across numerous fields, including language, vision, and science (Wei et al., 2022; Vaswani et al., 2023; Kojima et al., 2023). Decoder-only transformer-based autoregressive large language models (LLMs), like GPT (OpenAI et al., 2024) and LLaMA (Touvron et al., 2023a), are now widely deployed in many applications, such as real-time chatbots (OpenAI, 2024), code generation (Jiang et al., 2024) and agentic tool-use and computer-use workflows (Niu et al., 2025).
The rapid rise of agentic LLM capabilities, e.g. computer use (Lu et al., 2024), tool use (Müller and Žunič, 2024; He et al., 2024), and command-line agents (Agarwal et al., 2020), relies heavily on their ability to process and reason over very long contexts. For instance, command-line agents need to both comprehend and generate large-scale codebases (Ishibashi and Nishimura, 2024; Zan et al., 2025; Rando et al., 2025), while tool- and computer-use agentic workflows must keep track of multiple pieces of information across prolonged inputs—such as an entire web page DOM—which typically require very long contexts (Lee et al., 2025; Drouin et al., 2024; Chezelles et al., 2025). Figure 1(a) shows that, compared with chatbot workloads, agentic workloads consume 100×\times more tokens per inference on average and up to 1,000×\times at the maximum case. In response, modern LLMs have expanded their context windows: the original GPT-3 (Brown et al., 2020) supports roughly 2K tokens, whereas GPT-4 (OpenAI et al., 2024) reaches up to 32K tokens, and LLaMA4-Maverick (AI, 2025) extends the context window to 1M tokens.
To clarify the computational impact of agentic workloads, Figure 1(b) analyzes a LLaMA 3.3 70B model with long-context capability and shows that, when the number of generated tokens is low, the Feed-Forward Networks (FFNs) account for most of the total inference FLOPs, whereas the attention layers become dominant as the number of tokens generated grows. Notice these two phases can happen in a single inference run since we are performing autoregressive decoding. For instance, in the Longwriter (Bai et al., 2024) workload, the prefilling phase finishes at 5K tokens, and the decoding phase starts from there and continues to expand the context to up to 85K tokens, causing the workload to shift from the FFN-intensive to Attention-intensive region in terms of FLOPs in a single inference run as shown in Figure 1(b). Furthermore, with such large context lengths, the KV cache becomes the primary consumer of HBM resources.
Figure 1(c) also identifies two major limiting factors on the memory side. The large number of KV values and weights that must be read, together with the portion of KV values written back, impose very substantial memory bandwidth demands. In addition, as context length increases, the KV-cache requirement grows linearly, quickly increasing memory usage and often surpassing the size of the model weights, making HBM capacity a primary limiting factor. For example, in LLaMA-3.3-70B, at a 128k context (Meta, 2024), the FP16 KV cache for a single batch is approximately 39 GB, which limits how many batches can be kept on the chip (Gholami et al., 2024). Building on this observation, we suggest that the two main challenges on the off-chip memory side, namely, (i) the limited memory bandwidth and (ii) the restricted memory capacity. We collectively term these memory walls. Together, they prevent devices from reaching peak performance at inference time, consistent with observations in prior work (Gholami et al., 2024; Davies et al., 2025; Zhang et al., 2024).
The memory wall phenomenon leads to underutilization of computing resources on modern hardware, including TPUs and GPUs. This effect is particularly evident in compute units dedicated to General Matrix-Matrix Multiplication (GEMM) operations (ℝM×K×ℝK×N→ℝM×N\mathbb{R}^{M\times K}\times\mathbb{R}^{K\times N}\to\mathbb{R}^{M\times N}), denoted as (M,K)×(K,N)(M,K)\times(K,N), which constitute the core computational workload during LLM inference (Guo et al., 2025). At the microarchitectural level, most hardware is built with square-shaped systolic arrays or matrix multiplication units, typically designed so that the MM and NN dimensions are close in size to KK. For example, TPU v3 (Google, 2025) features a 128×128 systolic array, supporting M=K=N=128M=K=N=128 GEMM operations. The NVIDIA Blackwell B200 architecture (Corporation, 2024) introduces a minimal computation granularity of 64×8×1664\times 8\times 16. However, in long-context models, as demonstrated in Figure 1(c), memory often constrains the inference batch size. This results in a fat GEMM, where the batch-related dimension (typically MM in (M,K)×(K,N)(M,K)\times(K,N)) is much smaller than the others, producing an uneven matrix shape. This imbalance hinders systolic arrays and Tensor Cores from achieving high utilization rate, resulting in significant underuse of computational resources (Hong et al., 2024).
To this end, we propose the Programmable Long-context Efficient Neural Accelerator (PLENA), an efficient transformer model accelerator system designed to maintain high utilization of GEMM units across all inference stages (prefilling and decoding), particularly for agnetic LLM inference tasks with large contexts. PLENA achieves high efficiency for long-context inference by exploring three optimization pathways across both hardware and software design spaces: i) a flattened systolic array architecture tailored to fat GEMM (large inner dimension KK); ii) a set of quantization methods with mixed data types and precisions to address both memory wall challenges; and iii) a set of custom instructions (the PLENA_ISA) that contain native FlashAttention support (Dao, 2023).
Footnote 2 shows how these three pathways together can increase the utilization compared to the conventional square-shaped GEMM hardware without any optimization. First, our novel flattened systolic-array optimization (Pathway 1) achieves a higher attainable compute utilisation. The (M,K,N)(M,K,N) matrix multiplication typically has N≪KN\ll K because of the memory capacity wall.333All KVs must be stored, so the batch size (the NN dimension) is kept lower than the hidden size (KK). While various offloading techniques are available (Aminabadi et al., 2022), they complicate system-level trade-offs and tend to make the system more memory I/O–bound. Our flattened systolic array thus brings a more effective utilization of the multiplication resources, as illustrated in Figure 2(a) and Figure 2(b). Second, we apply an asymmetric quantization strategy (Pathway 2), where Weights(W)/Activations(A)/KV Cache(KV) can be set to different arithmetic widths and precisions to address both memory bandwidth and capacity limitations. With more aggressive W and KV cache quantization such as Wmxint4,Amxint8,KVmxint4W_{mxint4},A_{mxint8},KV_{mxint4}, we free up more space in HBM for data scaling (e.g., larger batch sizes). We strategically integrate and enhance state-of-the-art (SOTA) optimization schemes for quantization, incorporating techniques such as micro-scaling (Rouhani et al., 2023), output-norm-guided Hessian-based iterative optimization (Frantar et al., 2022), and selective rotation (Ashkboos et al., 2024). Our final quantization results demonstrate SOTA performance, effectively reducing limitations related to both memory bandwidth and capacity. Finally, we design and implement PLENA with a custom ISA to have native support for FlashAttention (Pathway 3), since Figure 1(b) shows that attention dominates at longer context lengths. We present a novel approach for effectively supporting FlashAttention on systolic-array-based architectures that avoids excessive off-chip memory I/O during attention. Together, these three optimizations yield significantly higher utilization than conventional square-shaped systolic-array accelerators. The main contributions of our work are as follows:
We analytically characterize the bandwidth and capacity memory walls in agentic LLM inference and show that existing systolic-array accelerators are normally under-utilized when running this workload.
We introduce three optimization pathways that jointly address the under-utilization caused by memory walls: (i) a flattened systolic array architecture; (ii) an asymmetric quantization scheme with mixed data types and precisions; and (iii) native support for FlashAttention.
We present PLENA, a complete hardware–software system that realizes the above optimizations. It comprises (i) a custom instruction set (PLENA_ISA) for large Transformer inference; (ii) a PyTorch-to-PLENA_ISA compiler; (iii) an HBM-enabled transaction-level simulator; (iv) an automated, accuracy aware design space exploration (DSE) flow; and (v) a full RTL implementation. We demonstrate that PLENA supports different SOTA transformer model variants (e.g., GQA, MHA and MLA (Meng et al., 2025), Dense and MoE (Artetxe et al., 2022)). We also show that PLENA achieves SOTA energy efficiency for agentic LLM inference tasks – it achieves 2.24×\times higher throughput than A100 GPUs on 3.85×\times higher throughput than v6e TPU. The overall PLENA system is illustrated in Figure 10, and the entire system will be fully open-sourced upon acceptance.
L1: functional simulator; L2: cycle-accurate simulator; L3: cycle-accurate simulator with HBM enabled.
: partial or planned open-source. Full inference coverage∗: all Transformer computations executed on-accelerator.
Quantization compresses LLMs by mapping high-precision floating-point parameters 𝐗\mathbf{X} into lower-bit representations. Following the standard integer quantization definition (Nagel et al., 2021), we formalize the process over the arbitrary target data format under a single-level scaling scheme using three elements: the data format (τ\tau), the scale factor (ss) and the zero point (zz).
A data format is defined as a tuple τ=(d,b)\tau=(d,b), where dd denotes the numerical datatype and bb is the bit-width specifying its precision. For a datatype τ\tau the values it can represent are restricted to a finite interval. We denote this interval as the representable set:
with minτ\min_{\tau} and maxτ\max_{\tau} as the representable bounds. The scale factor ss maps the dynamic range of 𝐗\mathbf{X} into Ω(τ)\Omega(\tau), typically defined as:
while the zero-point zz shifts the range for alignment (with z=0z=0 in symmetric quantization). Quantization then maps 𝐗\mathbf{X} into the target format as:
where RTN(⋅)\mathrm{RTN}(\cdot) denotes round-to-nearest. To approximate the original tensor, the quant–dequant operator is:
In Equation 3, values exceeding the representable range are clipped, introducing the clipping error. In this work, we address this with a novel adaptive clipping search, described in Section 4.2.
As the tensor size grows, the probability of having these outliers increases, widening the dynamic range and amplifying clipping error. Prior work mitigates this by varying the granularity at which scale and zero-point parameters are shared: from per-tensor, to per-channel, and to vector-wise schemes. In this work, we adopt block-wise micro-scaling datatypes (MXINT and MXFP), with both software and hardware implementation to support our dataformat-aware co-design, which we defer in Section 4.2.
Additionally, quantization approaches generally fall into two categories: quantization-aware training (QAT), which integrates quantization during fine-tuning, and post-training quantization (PTQ), which applies quantization directly to a pretrained model. Our PTQ method achieves accuracy competitive with full-precision baselines, even under aggressive low-bit, full-system quantization, as demonstrated in Table 3.
Microscaling (MX) data formats, proposed in prior work (Rouhani et al., 2023), define a standardized format that enables block-wise scaling sharing. These formats support multi-level scaling schemes. We adopt only the level-1 scaling strategy, as illustrated in Figure 3. The scaling factor in the MX format can be computed similarly to Equation 2, after which it is quantized using power-of-two (PoT) quantization. The data elements in MX formats can be represented either as integers or as minifloats. In our design, we include both representations in the search space to evaluate software performance.
In long-context scenarios, KV cache size is a key challenge (Zirui Liu et al., 2023), but hardware support for efficient quantization of KV cache remains limited (Ramachandran et al., 2025; Guo et al., 2023). Existing frameworks often treat quantization in isolation rather than as part of full-system design, leaving gaps in non-GEMM operations and causing a mismatch between algorithmic advances and practical deployment on hardware.
FlashAttention optimizes memory I/O in the standard attention layer (Dao, 2023). In a standard attention layer, computing QK⊤QK^{\top} produces a prohibitively large square matrix, often thousands by thousands in size. Because on-chip memory cannot hold this intermediate result, it must be written to off-chip memory and later reloaded for the subsequent softmax and PVPV steps, which significantly degrades performance. FlashAttention avoids this round trip by tiling and fusing the attention computation (GEMM–Softmax–GEMM) so that all intermediate results fit on-chip.
The overall configuration of PLENA is shown in Figure 4. It is designed to support instruction-level pipelining and mainly consists of three compute units: the Matrix Unit, the Vector Unit, and the Scalar Unit. All units are highly configurable, supporting multiple data types and precisions, enabling the application of different quantization methods to the accelerator. PLENA also includes two main on-chip SRAM blocks. The Vector SRAM acts as a scratchpad for computation, storing frequently used data such as activations, which do not need to be written back to HBM, thereby reducing memory access overhead. The custom Matrix SRAM is dedicated to loading weights and KV tensors and supports reading data in either transposed or untransposed layouts with no additional overhead.
To support asymmetric quantization strategies (Section 4.1), PLENA natively offers multiple numeric formats—covering different data types and precisions—across its compute and memory units (Table 14). This innovative asymmetric data-handling configuration has the following characteristics:
(i) Activations are stored in a high-precision floating-point (FP) format on-chip in the Vector SRAM, as they are more sensitive to quantization errors than KV or weights. (ii) KV and weights, being less accuracy-sensitive, can be more aggressively quantized and staged in the Matrix SRAM using lower-precision MX formats (MX-FP or MX-INT). (iii) An optional on-chip rotation step can suppress outliers before quantization to preserve accuracy.
Figure 5 illustrates the precision formats used by each unit and the dataflow between them. When appending newly computed KK and VV to the KV cache, we optionally apply a selective rotation (Hadamard transform) to suppress outliers before quantizing to MX-INT. Because KK and VV are consumed only by the attention layer’s GEMM, they are loaded exclusively into the Matrix SRAM. Before use, the matrix unit applies the inverse Hadamard transform to de-rotate KK and VV. These rotation/de-rotation stages can be selectively applied per tensor; for example, weights loaded into the matrix unit bypass the inverse transform.
All compute units are optimized for feed-forward (FFN) and attention computations in transformer inference, with particular emphasis on long-context workloads. As shown in Figure 2(b), long-context workloads frequently involve fat GEMMs, where the batch-related dimension (typically MM in (M,K)×(K,N)(M,K)\times(K,N)) is much smaller than the others, resulting in uneven matrix shapes (Figure 6). The reduction dimensions KK tend to be very long. For example, the weight–activation GEMM reduces over the model’s hidden size (e.g., 4,0964{,}096 for LLaMA-8B and 8,1928{,}192 for LLaMA-70B). In addition, a variety of arithmetic operations—such as elementwise addition, summation, and special functions like the exponential—are required across long-dimension tensors.
To optimize GEMM in long-context workloads involving fat GEMMs, we propose flattened systolic arrays, enabling higher utilization across the entire fat GEMM computation flow. The unit computes a (BLEN,MLEN)×(MLEN,BLEN)(\texttt{BLEN},\texttt{MLEN})\times(\texttt{MLEN},\texttt{BLEN}) GEMM and produces results of shape (BLEN,BLEN)(\texttt{BLEN},\texttt{BLEN}), and normally BLEN is set to be much smaller than MLEN to match the workload characteristics of long-context LLM inference.
This flattened systolic array is designed for output-stationary dataflow in order to maintain high utilization and avoid frequent reads/writes of partial sums—and the bubbles associated with streaming operands into the systolic array. As shown in Figure 6, operands stream along the large reduction dimension KK while partial sums remain resident in the PEs. The array is fully pipelined, eliminating bubbles between consecutive GEMM tiles.
The microarchitecture of the flattened systolic array is shown in Figure 7. It is built from a series of small square-shaped systolic arrays (sub-arrs), each consisting of a grid of processing elements (PEs). Each PE repeatedly performs multiply–accumulate operations and passes data to its neighboring PEs below and to the right across the array. As described in Section 3.1, the systolic array is designed to natively accept data in the MX format. The detailed PE configuration is provided in the Figure 13.
On each cycle, the flattened systolic array fetches two MLEN-wide inputs, one from the Matrix SRAM (top) and one from the Vector SRAM (left). These inputs are buffered and reordered, then partitioned into MLEN/BLEN\texttt{MLEN}/\texttt{BLEN} vectors (assuming MLEN is divisible by BLEN), each of length BLEN. Each vector is then fed to a corresponding sub-arrs from the top and left direction.
However, a matrix unit composed solely of sub-arrs is insufficient to complete a (BLEN,MLEN)×(MLEN,BLEN)(\texttt{BLEN},\texttt{MLEN})\times(\texttt{MLEN},\texttt{BLEN}) GEMM. Each array accumulates only partial sums for a fragment of the final result; producing a complete (BLEN,BLEN)(\texttt{BLEN},\texttt{BLEN}) output requires a cross-array reduction that sums the partial sums held in the PEs across the tiled row. To address this, we integrate an output adder tree (see Figure 7) that performs the cross-array summation efficiently. This unit is invoked via a dedicated instruction, as only one cross-array summation is required when computing GEMM along the large reduction dimension. This could prevent bubbles and improve computational efficiency.
This unit supports all vector operations required during LLM inference, including elementwise computations (e.g., addition, multiplication, and exponential) and reduction operations (e.g., summation, maximum). The vector dimension is parameterised by VLEN. A complete list of vector-unit instructions is provided in Table 12.
The scalar unit has two separate ALU units supporting the two data types of computations: Integer (INT) and Floating Point (FP). Both the INT and FP units are connected to their respective SRAMs and register files and operate independently.
INT operations are used primarily for on-chip address generation and indexing, and run on a control path decoupled from the FP datapath. In contrast, the FP unit implements basic arithmetics and the non-linear functions required by transformer workloads (e.g., exponential, reciprocal, and reciprocal square root (rsqrt)). To accommodate future models that may require additional special functions, we also include a look-up table (LUT) unit so new functions can be realized via table lookups without introducing additional logic.
Our memory system is characterized by two key properties:
Support for asymmetric precisions, variable-length memory transfers, and strided loads/stores to HBM.
Latency hiding for HBM accesses via a hardware prefetcher, enabling high bandwidth utilization.
To make more effective use of HBM capacity, as discussed in Section 3.1, all data stored in HBM is kept in MX format. However, due to address alignment constraints, it is impractical to concatenate each data block with its associated per-block scales. This is because the resulting combined size seldom matches a (2n2^{n}) multiple, making it inefficient for the memory system.
To address this problem, we store the blocks and their scales separately – laying out all blocks contiguously, followed by the corresponding scales at the end of the block region. With this technique, the memory address alignment is preserved while locality is maintained. The resulting layout is shown in Figure 8.
To support variable-length transfers, the HBM controller integrates two data-packing units. MX-format blocks fetched via TileLink (SiFive, Inc., 2020) (the on-chip interconnect used to access the HBM controller) are repacked into (i) MLEN-wide vectors for the Matrix SRAM and (ii) VLEN-wide vectors for the Vector SRAM. The controller automatically locates and fetches the corresponding per-block scales based on the active precision and the requested transfer size. On the write path, dedicated units accept vectors from the Matrix and Vector SRAMs, partition them into MX blocks, attach the appropriate per-block scales, and commit the aligned layout back to HBM.
The loading logic is critical to help us fully utilize the HBM memory bandwidth. The hardware load unit resides in both the Matrix and Vector SRAMs and is connected directly to the HBM controller. This enables background fetching and streaming into each SRAM while the rest of PLENA executes other instructions, sustaining full utilization of the matrix unit and avoiding stalls on HBM accesses. The two load units are controlled directly by instructions, with the amount of data to be load encoded in each instruction. For example, during weight–activation GEMMs, where GEMM operations are invoked repeatedly while streaming data across the hidden dimension, the loaded amount is set to this dimension, so the load instruction only needs to be issued once.
Our customized ISA is designed to cover all operations required for transformer inference. The instructions are structured to balance efficiency with flexibility and are built to support multiple transformer-based models and computation optimizations. In addition to FlashAttention, the ISA also supports different transformer variants, such as MHA, MLA (Meng et al., 2025), and MoE (Artetxe et al., 2022). A brief summary is provided in Table 11, with the detailed specification given in Table 12.
To achieve the efficiency and flexibility balance, the ISA is designed to minimize overhead while maximizing utilization of compute and memory resources. This is achieved through features such as tile-level scheduling, which enables fine-grained control of computation and memory instructions at the tile granularity. Furthermore, the ISA defines dedicated instruction classes (Matrix, Vector, Scalar, Memory, and Control) that decouple responsibilities, simplify scheduling, and allow flexible mixing across different computation types.
The instructions (32bits per instr) are dynamically passed from the CPU to the instruction buffer via PCIe. The scalar unit contains an integer register file storing on-chip addresses. Vector- or matrix-related instructions control reads and writes to the matrix and vector SRAMs using the specified integer registers. Simple arithmetic operations in the scalar unit are used for address manipulation.
Most current accelerators cannot execute FlashAttention natively because (i) they expose only GEMM primitives and lack in-line, row-wise reductions and nonlinear operations (max/sum, exp, div) required for the online softmax; (ii) they lack memory-layout support such as transpose-on-read and efficient strided/blocked streaming; and (iii) they rely on rigid ISAs with fixed scheduling and coarse-grained kernel boundaries, preventing tile-by-tile flexible execution.
In PLENA, we address (i) with tightly coupled vector and scalar units that implement the required reductions and elementwise operations; the vector unit’s width is configurable to match the tile dimensions used by FlashAttention. For (ii), we introduce a Matrix SRAM that can be read in either standard or transposed order without extra data movement. In the QK⊤QK^{\top} step, explicitly transposing large tiles on the fly is costly in area, energy, and latency, and storing K⊤K^{\top} in HBM is impractical because it complicates appending new KK vectors to the existing KK cache during decoding. The Matrix SRAM avoids both issues by banking the storage across multiple sub-SRAMs and using lightweight address remapping to present a transposed view at read time (implementation details in Figure 12). For (iii), our custom ISA offers composable, fine-grained control that enables persistent, tile-by-tile scheduling of the fused attention pipeline. This allows each operation in FlashAttention to be controlled individually at the tile level. Combined with the above capabilities, this allows PLENA to support FlashAttention natively.
✓∗ denotes partial quantization support ; ∗ At the time of writing, MicroScopiQ has not yet released its code; the comparison is based on information obtained directly from the authors.
The proposed quantization framework supports a wide range of datatypes and precisions. As shown in Figure 9, to accurately reflect hardware behavior in LLM architectures, the framework must satisfy two key requirements: 1) different operands within the same operation can be quantized to different datatypes and precisions, and 2) all operations in the model must be quantized. Table 2 summarizes existing quantization methods. Most of these approaches focus on GEMMs only, several support mixed precision, while none of them support mixed data types. In contrast, our quantization flow allows both mixed precision and mixed data types in GEMMs, with all intermediate data between GEMM operations quantized.
For GEMM operations (e.g., linear layers and matrix multiplications between activations), the two operands can have two different precisions; e.g., INT4 activations multiplied with INT8 weights. The operands may also use two different datatypes, e.g., MXFP activations multiplied with MXInt weights. Besides, the GEMM operation also models the behaviour that the output will be casting to minifloat. For non-GEMM operations, which are executed on the vector machine in hardware, the data is stored as minifloats. When data flows from a non-GEMM operation to a GEMM, a cast module is required to convert minifloats to the corresponding target formats, and this is also modelled in the quantization framework. Beyond this basic setup, we adopt and refine advanced quantization tactics for a more aggressive quantization scheme than plain casting, including Hessian-based quantization optimization (GPTQ) and selective online activation rotation (QuaRot).
GPTQ was initially designed for integer quantization. When adapting GPTQ for MXFP/MXINT, we observe that the clipping range within each microscaling block significantly affects overall model performance. To address this problem, we propose a blockwise clipping range search method that minimizes the quantization error of each output block.
Algorithm 1 outlines the quantization process of PLENA. PLENA uses per-microscaling-block quantization error to guide the search of the clipping range, and fuses this clipping range optimization into GPTQ’s iterative error propagation. This also mitigates the outlier problem of weights, which later on affects the value of the shared exponents in MX format, and eventually enables a better end-to-end model performance.
Formally, let 𝐗∈ℝM×K\mathbf{X}\in\mathbb{R}^{M\times K} be the inputs for calibration, and 𝐖∈ℝN×K\mathbf{W}\in\mathbb{R}^{N\times K} the layer weights. Given a linear layer 𝐘=𝐗𝐖⊤\mathbf{Y}=\mathbf{X}\mathbf{W}^{\top}, slice the weights across the KK dimension with the block size B{B} (i.e., MLEN in Figure 6) defined in our MX data format τ\tau, yielding 𝐖b∈ℝN×B\mathbf{W}{b}\in\mathbb{R}^{N\times{B}} to be quantized. We also slice activations across the KK dimension with the same block size, giving 𝐗b∈ℝM×B\mathbf{X}{b}\in\mathbb{R}^{M\times{B}}. Let Quantize(⋅;p,τ)\textsc{Quantize}(\cdot;,p,\tau) denote per-row quantization in data format τ\tau with clipping percentile pp. For each row i=1,…,Ni=1,\dots,N, we search for the percentile by
pi⋆=argminp∈𝒫‖𝐗b(𝐖i,b:b+B−1−Quantize(𝐖i,b:b+B−1;p,τ))⊤‖22p_{i}^{\star}=\arg\min_{p\in\mathcal{P}}\left|\mathbf{X}{b}\Big(\mathbf{W}{i,,b:b{+}{B}-1}-\textsc{Quantize}(\mathbf{W}{i,,b:b{+}{B}-1};,p,\tau)\Big)^{\top}\right|{2}^{2}
and get the quantized weight block:
We now detail the per-block clipping search, following the quantization definitions in Section 2.1, consider a block of weights wτw_{\tau} in data format τ\tau, with representable range [minτ,maxτ][\min_{\tau},,\max_{\tau}] and empirical weight range [xmin,xmax][x_{\min},,x_{\max}]. Directly mapping the full weight range usually wastes precision due to extreme outliers. To mitigate this, we introduce a clipping parameter p∈𝒫⊂[0.5,0.99]p\in\mathcal{P}\subset[0.5,0.99], which shrinks the effective range to [pxmin,pxmax][p,x_{\min},,p,x_{\max}]. We then adopt the symmetric quantization setting with zero-point fixed at z=0z=0, the corresponding scale factor is
The blockwise quant–dequant operator then becomes
By sweeping over a discrete candidate set 𝒫\mathcal{P} of clipping parameters, we evaluate multiple effective ranges and select the pp per block that minimizes the output reconstruction loss defined in Equation 5.
As shown in prior work, activations in LLMs typically contain more outliers than weights (Li et al., 2024), therefore more sensitive to quantization. QuaRoT (Ashkboos et al., 2024) recently demonstrated that applying a rotation matrix to LLMs can effectively suppress outliers. However, we observe that rotating all tensors suggested by QuaRoT may not yield the best performance for MX formats. When a tensor (e.g., the weight matrix) does not exhibit significant outliers, the benefit of rotation diminishes. Equation 9 is a simplified rotation mechanism in QuaRot (Ashkboos et al., 2024), where a Hadamard matrix 𝐇\mathbf{H} smooths out the activation distributions, and its inverse is fused into weights. We performed experiments to empirically identify the activation tensors with extreme outliers and propose a selective online rotation scheme.
We notice that applying the rotation to finer-grained weight quantization (e.g., MXInt with smaller block sizes) may increase perplexity. Intuitively, weights have smaller dynamic ranges compared to activations. The rotation may be unnecessary since most weight outliers are effectively captured by the shared exponents, while permuting the weights with 𝐇\mathbf{H} leads to different quantized values, which may impact the model performance.
To address the issue that weight with fine-grained blocking does not need rotation, we propose an activation-only rotation strategy. As shown in Equation 10, the inverse rotation matrix H−1H^{-1} is decoupled from weight quantization and is instead applied directly to the quantized rotated activation at runtime.
The activation distribution varies significantly across layers. Consequently, the effect of rotation also differs from layer to layer. Rather than rotating all activations, we apply the rotation matrix selectively. A search is performed to identify the layers where rotation yields the greatest benefit. This selective activation rotation is performed on-the-fly (the green paths in Figure 5). The ablation of the above quantization modifications is shown in Table 8.
As discussed in Table 1, existing works lack several key components necessary to achieve complete end-to-end LLM inference. Some of these missing elements include a compiler, a simulator, and design-space exploration tools. In contrast, PLENA features a complete design and verification framework that allows it to rapidly adapt to new models or even new hardware accelerators and optimize for them. We also anticipate that future accelerators in the field could reuse certain components of this comprehensive framework to establish end-to-end performance comparisons.
To efficiently deploy decoder-style LLMs, we design a compiler stack targeting only LLM models on our PLENA hardware. The models are first exported from the PyTorch framework into the ONNX format (Bai et al., 2019), where standard graph optimizations such as constant folding are applied. The optimized graph is then parsed into our custom IR through pattern matching, this essentially lowers high-level operators into primitives such as GEMM, quantization, dequantization, and FlashAttention.
The critical challenge lies in searching for an optimal scheduling strategy tailored to PLENA. Our scheduling policies include operator fusion, tiling configurations, memory placement, and loop transformations, which jointly determine data reuse, memory traffic, and compute unit utilization. To accelerate the search, we systematically traverse candidate configurations and validate them by checking memory footprint constraints and transaction requirements. Feasible candidates are further evaluated by a lightweight roofline-based performance model, and finally, the top-K schedules are selected to generate the assembly code for execution on PLENA.
Our Rust-based cycle-accurate simulator offers significant advantages over the functional-level simulators used in most published accelerators:
Supports full cycle-accurate emulation.
Event-driven simulation that directly executes the generated machine code from the compiler.
HBM-enabled simulation, incorporating realistic HBM timing and bandwidth characteristics (via Ramulator (Luo et al., 2023)).
This simulator supports the same data types and precisions as the PLENA accelerator, and we verified that it could generate closely matching results as the RTL simulation for the accelerator.
To automate finding optimal hardware design and quantization parameters, we propose to employ active learning for design space exploration (DSE). We also provide capability for investigating the trade-offs between optimizing different objectives, such as maximizing accuracy, while minimizing latency and area. For this, we employ multi-objective Bayesian optimization (BO), which allows exploring the Pareto frontier in an active manner.
BO is a framework for optimising non-differentiable functions (Shahriari et al., 2015). Multi-objective BO searches for optimal points in the design space that minimize a multi-objective function 𝐟\mathbf{f}, i.e. 𝐟(𝐱∗)=min𝐱𝐟(𝐱)\mathbf{f}(\mathbf{x}{*})=\min{\mathbf{x}}\mathbf{f}(\mathbf{x}). In our case, the objective function has three components: perplexity, latency, and chip area: 𝐟=[fp(⋅),fl(⋅),fa(⋅)]\mathbf{f}=\big[f_{p}(\cdot),f_{l}(\cdot),f_{a}(\cdot)\big]. 𝐟\mathbf{f} is modelled with a multi-output Gaussian Process, which keeps track of the predictive mean and uncertainty for all points 𝐱\mathbf{x} in the design space. BO selects which candidate to evaluate next, such that uncertainty is reduced globally, but also comes back to regions with high predictive mean to further improve upon the previous points with favorable outcomes. BO scales to high-dimensional spaces (Wang et al., 2016; Liu et al., 2020), supports both discrete and continuous search variables (Balandat et al., 2020; Deshwal et al., 2021; Daulton et al., 2022), and doesn’t impose limiting restrictions on the properties of the objective 𝐟\mathbf{f}. Its model of the global posterior also facilitates interpretable analysis of the search results. Hence, this setup yields a flexible and informative framework for automating DSE.
We base our DSE implementation on the Optuna package (Akiba et al., 2019) and conduct experiments with a BOTorch sampler and a tree-search sampler. With BOTorch (Balandat et al., 2020) sampler we treat the design space as continuous during posterior modelling, but discretize the points proposed by BO for evaluating concrete design choices. We also test an alternative of using the Tree-Structured Parzen Estimator (Watanabe, 2023b), often used for discrete spaces.
In our co-design setup, we incorporate post-training quantization directly into the optimization loop. This allows us to evaluate candidate hardware and quantization configurations jointly, using pre-trained model weights while searching over quantization parameters such as datatype and precision settings for activations and KV cache. The joint search space is defined in Table 14. For each candidate design, we assess accuracy, latency, and area:
Accuracy is measured in terms of language modeling quality, where we evaluate perplexity on Wikitext2 using our accuracy evaluator.
Latency and area utilization are obtained from our Roofline-based simulators, as illustrated in Figure 10.
To ensure efficient exploration, we impose input constraints over the design space (Table 15) and apply rejection sampling to discard invalid or infeasible candidates. This avoids unnecessary costly objective evaluations and accelerates convergence of the search. We first conduct experiments on Llama 3.2-1B to enable rapid iteration, and then extend our evaluation to Llama- 3-8B . The results are described in Section 5.3.
We evaluate our quantization framework on mainly two families of LLMs, namely LLaMA-2 (Touvron et al., 2023b) and LLaMA-3 (Meta, 2024). We also demonstrate our system on MoE (eg. GPT-OSS) and MLA-based (QWen-MLA) models. Quantization performance is measured in terms of perplexity on the WikiText-2 dataset (Merity et al., 2016). The entire quantization process requires approximately 2–20 GPU hours on NVIDIA H100 GPUs, depending on the model size and configuration.
We compare against several state-of-the-art quantization methods, including software-based approaches such as GPTQ (Frantar et al., 2022), OmniQuant (Shao et al., 2023b), and QuaRoT (Ashkboos et al., 2024), as well as hardware-accelerated approaches such as Atom (Zhao et al., 2024) and MicroscopiQ (Ramachandran et al., 2025).
PLENA is implemented in SystemVerilog RTL. We perform synthesis using Synopsys Design Compiler with the 7 nm OpenROAD predictive process design kit (Clark et al., 2016) to generate area and power estimates under a 1 GHz clock frequency.
Since the works we selected for comparison, MicroscopiQ (Ramachandran et al., 2025), FIGNA (Jang et al., 2024), and Olive (Guo et al., 2023), are not open-source and were not evaluated using the same technology node or toolchain, we re-implemented their core components and integrated them into the PLENA system for a fair inference performance comparison. Additionally, DeepScale (Sarangi and Baas, 2021) is used for overall system performance estimation, scaling all designs to the 7 nm process. Their detailed area and power of the core units are evaluated using our own implementations.
Instead of comparing only with SOTA accelerators, we also evaluate against the latest high performance commercial compute units, including GPUs (A100 80G) and TPUs (v6e-8). The GPU experiments are conducted in an environment with Ubuntu 22.04, CUDA 12.8, Python 3.11, PyTorch 2.8.0, and vLLM 0.10 V1. The TPU experiments are conducted in an environment with v2-alpha-tpuv6e software and vllm\vllm_tpu docker image.
Note: Results marked with ∗ are reproduced from the authors’ released code. Specifically, for the QuaRot 4/4/16 configuration, we follow the experimental setup described in their paper, where activations are per-token symmetric-quantized with a clipping ratio of 0.9.
We evaluate our quantization method against related work; results are summarized in Table 3. For a fair comparison, we first match prior settings by quantizing only the nine GEMMs in the Llama decoder. We then report full-system experiments that also quantize nonlinear functions, RoPE, and embeddings in table 4. In the W4A4KV16 setting, our results outperform all related work. For Llama- 3-8B , compared with prior approaches, our method achieves at least a 1.24 reduction in perplexity. The key contributions to this performance improvement come from three aspects: 1). MXInt operation: While previous work (Hu et al., 2025) adopts a group size of 32, our design keeps the group size small while still maintaining high hardware efficiency. 2). Selective rotation: Our approach searches for the best layer-wise rotation combination for each model. Unlike QuaRoT (Ashkboos et al., 2024), which merges rotation into weights, we apply online rotation only to specific layers. This provides an additional design space for finding optimal solutions in the PTQ setting. 3). Clipping strategy: By integrating output-guided, blockwise clipping into iterative weight quantization, we validate that output reconstruction error correlates strongly with end-task performance; consequently, our approach substantially reduces perplexity degradation.
Note: MicroScopiQ (Ramachandran et al., 2025) was developed by us, and we deploy its replicated compute unit on the PLENA platform to do the testing. The version of LLaMA-3.1-8B used is LLaMA-3.1-8B-Instruct-quantized.w8a8. With Q∗ means QuaRot quantization (Ashkboos et al., 2024).
We performed a brute-force sweep to select the vector-core precision. where we find quantizing the remaining operators to a MiniFloat E6M5 format is effectively lossless in perplexity while reducing precision by 25% relative to FP16. As shown in table 4, the maximum perplexity increase under full-system quantization is ≤0.05\leq 0.05.
Note: The remaining accelerators and TPUs are not included since they do not support these configurations.
This subsection shows the results of our design space exploration experiments. Figure 11 shows the Empirical Attainment Surfaces (EAS) for the Pareto fronts found when optimizing with Llama 3.2-1B and Llama- 3-8B . EAS is a visualization approach well-suited for conveying the uncertainty of the Pareto fronts from multiple runs with different random seeds (Knowles, 2005; Fonseca et al., 2011). Existing tools support visual analysis for two objectives (Watanabe, 2023a), hence we plot EAS for accuracy and latency first. Then, in Table 16 we analyze the relationship between all objectives. Figure 11 shows that active learning with BOTorch sampler achieves a significantly better tradeoff between latency and perplexity than naive randomized sampling. Tree-Structured Parzen Estimator (TPE) shows more modest gains when optimizing with Llama 3.2-1B compared to using BOTorch sampler, thus we focus on the latter for experiments with Llama- 3-8B .
The system-level performance comparison is shown in Table 5, evaluating both small and large GQA-based LLaMA models as well as the recently published MoE-based GPT-OSS model, all implemented in 7 nm technology and supporting long-context inputs. This experiment investigates peak TPS by scaling the batch size to the maximum capacity that HBM can accommodate. As shown, PLENA achieves consistently higher TPS than both the A100 and TPU v6e under identical HBM settings and multiplier counts, with peak performance reaching up to 2.24×\times that of the A100 and 3.85×\times that of the TPU v6e. The higher TTFT observed in PLENA is explained by its ability to store more batches within the same HBM capacity using our quantization scheme. As batch size increases, the prefill stage grows longer due to additional memory accesses and computation.
∗Attainable FLOPs are computed from utilization and peak design throughput. Micro = MicroscopiQ. S. A FLOPs = Standard workload attainable FLOPs. A. A FLOPs = Agentic workload attainable FLOPs.
As shown in Table 7, PLENA achieves significantly higher utilization than prior designs in both short- and long-context workloads, with up to 8.5×\times improvement in attainable utilization.
This paper introduces PLENA, a hardware–software co-designed system that features a flattened systolic array, an asymmetric quantization scheme, and native architectural support for FlashAttention, addressing the underutilization challenges posed by memory bandwidth and capacity walls. Beyond the hardware, PLENA is supported by a full toolchain—including a compiler, cycle-accurate simulator, and design space exploration framework—that enables rapid adaptation and optimization for emerging transformer models. Future work will focus on further optimizing GEMM utilization in FlashAttention and extending PLENA with a multi-core flattened systolic array to better exploit parallelism. In addition, the compiler can be enhanced to provide finer-grained control over execution scheduling. Finally, we plan to integrate PLENA with GPUs to form a heterogeneous LLM acceleration system.
Table: S1.T1: A Comparison of LLM accelerators: most lack cycle-accurate simulators for RTL-level timing, omit accurate HBM simulation in evaluation, are constrained by a lack of an ISA with compiler support, and accelerate only a subset of kernels — resulting in restricted flexibility, the need to offload to GPUs/CPUs and frequent host-device transfers and significant data-movement overheads.
| PLENA | PICACHU (Qin et al., 2025) | MicroScopiQ (Ramachandran et al., 2025) | FlightLLM (Zeng et al., 2024) | Tender (Lee et al., 2024) | FIGNA (Jang et al., 2024) | SystolicAttention (Lin et al., 2025) | Olive (Guo et al., 2023) | |
|---|---|---|---|---|---|---|---|---|
| Simulator | L3 | L1 | L2 | L1 | L1 | L1 | L3 | L1 |
| Custom ISA & Auto Code Gen | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| DSE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| FlashAttention support | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ |
| Full inference coverage∗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Open source | ✓ | ✗ | ✗ | ✗ | ✓ |
Table: S4.T2: Comparison of post-training quantization methods for LLMs across key features. (QW, QACT, QKV) denote quantization of weights, activations, and key–value cache, respectively. Each decoder layer in LLaMA contains nine matrix multiplications, as outlined in Algorithm 2. PLENA introduces the first accuracy evaluator supporting mixed MX datatypes, providing software emulation for MXINT, MXFP, and MiniFloat formats. Unlike prior approaches, it fully simulates hardware-precision behavior in software, extending quantization beyond matrix multiplications to also include RMSNorm, embedding layers, LM output heads, and nonlinear operations such as RMSNorm, softmax, and SiLU (see Algorithm 2).
| (QW, QACT, QKV) | (✓,✓,✓) | (✓,✓,✓) | (✓,✗,✗) | (✓,✓,✓) | (✓,✓,✓) | (✓,✓,✓) | (✓,✓,✓) | (✗,✗,✓) | (✓,✓,✓) |
|---|---|---|---|---|---|---|---|---|---|
| No. GEMMs | 9/9 | 9/9 | 7/9 | 7/9 | 9/9 | 7/9 | 9/9 | 0/9 | 9/9 |
| Nonlinear_FN | ✓ | ✗ | ✗ | ✗ | ✓* | ✗ | ✗ | ✗ | ✗ |
| Embeddin & lm_head | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| RMSNorm | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| ROPE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Supported datatypes | MXFP, MXINT, MiniFloat | MXFP, MXINT | INT | INT | INT | INT | INT, FP | INT | MANT |
| Mixed-precision | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
Table: S6.T3: WikiText-2 perplexity (lower is better) under GEMM-only emulation (nonlinear ops left in full precision) for Llama. Results for GPTQ, AWQ, OmniQuant, and Atom are taken from MicroScopiQ; QuaRot numbers are from the paper or reproduced when not reported. W/A/KV denote bit widths for weights, activations, and KV cache.
| LLaMA-2 (Touvron et al., 2023b) | LLaMA-3 (Meta, 2024) | |||||
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| GPTQ (Frantar et al., 2022) | 4/16/16 | 6.23 | 5.58 | 4.28 | 8.12 | 3.75 |
| AWQ (Lin et al., 2024) | 4/16/16 | 5.82 | 5.19 | 4.08 | 7.96 | 3.58 |
| OmniQuant (Shao et al., 2023b) | 4/16/16 | 5.74 | 5.02 | 3.47 | 7.09 | 3.46 |
| MicroScopiQ (Ramachandran et al., 2025) | 4/16/16 | 5.65 | 5.02 | 3.42 | 6.89 | 3.25 |
| QuaRot (Ashkboos et al., 2024) | 4/16/16 | 5.60 | 5.00 | 3.41 | 6.52∗ | 3.53∗ |
| Ours | 4/16/16 | 5.61 | 4.97 | 3.41 | 6.45 | 3.59 |
| OmniQuant (Shao et al., 2023b) | 4/4/16 | 11.47 | 8.32 | 5.41 | 10.21 | 5.30 |
| SmoothQuant (Xiao et al., 2023) | 4/4/16 | 20.47 | 15.63 | 17.62 | 29.54 | 19.32 |
| Atom (Zhao et al., 2024) | 4/4/16 | 6.16 | 6.12 | 5.20 | 8.12 | 4.69 |
| MicroScopiQ (Ramachandran et al., 2025) | 4/4/16 | 6.11 | 5.57 | 4.48 | 8.12 | 4.65 |
| QuaRot (Ashkboos et al., 2024) | 4/4/16 | 6.02∗ | 5.36∗ | 3.78 | 8.00∗ | 6.33∗ |
| M-ANT (Hu et al., 2025) | 4/4/16 | 5.92 | 5.24 | - | - | - |
| Ours | 4/4/16 | 5.69 | 5.03 | 3.59 | 6.76 | 4.51 |
| QuaRot (Ashkboos et al., 2024) | 4/4/4 | 6.10 | 5.40 | 3.79 | 8.16 | 6.66 |
| QuaRot-128G (Ashkboos et al., 2024) | 4/4/4 | 5.93 | 5.26 | 3.61 | 7.36 | 5.51 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
Table: S6.T4: Quantization results for Llama comparing GEMM-only qunatization with full-system quantization (including GEMM, nonlinear ops, input embeddings, and LM head). Nonlinear operators are simulated in MiniFloat E6M5.
| LLaMA-2 (Touvron et al., 2023b) | LLaMA-3 (Meta, 2024) | |||||
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
| Ours-Full System | 4/4/4 | 5.91 | 5.19 | 3.63 | 7.23 | 4.82 |
Table: S6.T5: System-level comparison across standard (Prompt = 1k, Gen = 128) and agentic (Prompt = 5.6k, Gen = 85k) workloads. For fairness, we use four A100 GPUs with a total HBM capacity of 320 GB as the reference. PLENA and MicroScopiQ are both assumed to have four cores and identical HBM configurations, including capacity and bandwidth. The selected configurations are listed in Table 17. Since the GPU’s silicon area includes significant overhead for non-compute functionality, we ensure that the multiplier count is matched across systems for a balanced comparison.
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
|---|---|---|---|---|---|---|---|---|
| A100 80G | 6.20 | 1.00× | 0.22 | 1.00× | 1.12 | 1.00× | 1.05 | 1.00× |
| A100 80G (With Q)∗ | 5.13 | 1.66× | 0.19 | 1.39× | 3.41 | 1.23× | 2.46 | 1.32× |
| TPU v6e | 5.63 | 0.90× | 4.58 | 0.39× | 50.07 | 0.31× | 7.98 | 0.84× |
| MicroScopiQ (Ramachandran et al., 2025) | 16.43 | 0.35× | 3.27 | 0.57× | 61.63 | 0.16× | 19.23 | 0.09× |
| PLENA | 4.19 | 2.24× | 0.21 | 1.58× | 5.65 | 1.19× | 1.27 | 1.49× |
Table: S6.T6: System-level comparison on GPT-OSS 20B (MoE) (Artetxe et al., 2022) and QWen with MLA (Meng et al., 2025), showing that PLENA can be adapted to new models with both MLA and MoE configurations and achieve higher TPS than A100 80G under the same experimental settings as Table 5.
| System | GPT-OSS 20B (MoE) | qwen2.5-7B∗ | ||||||
| Standard | Agentic | Standard | Agentic | |||||
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
| A100 80G | 9.39 | 1.00× | 1.87 | 1.00× | 8.21 | 1.00× | 1.17 | 1.00× |
| PLENA | 6.13 | 1.36× | 1.41 | 1.21× | 5.71 | 1.42× | 1.52 | 1.30× |
Table: S6.T7: Compute area, utilization, and attainable FLOPs of systolic arrays under W4A4KV4 bitwidth for LLaMA-3.3-70B. Baselines use 64×6464\times 64 arrays, while PLENA employs a flattened (4,512)(4,512) array. Results are shown for Standard (Prompt = 1k, Gen = 128) and Agentic (Prompt = 5.6k, Gen = 8k) workloads.
| Metric | Micro (Ramachandran et al., 2025) | Olive (Guo et al., 2023) | FIGNA (Jang et al., 2024) | PLENA |
|---|---|---|---|---|
| Comp Area (mm2\mathrm{mm}^{2}) | 0.1378 | 0.319 | 0.471 | 0.237 |
| TOPs/mm2\mathrm{mm}^{2} | 59.45 | 25.66 | 17.39 | 34.49 |
| S. A FLOPs/mm2\mathrm{mm}^{2}∗ | 28.76 | 11.59 | 7.51 | 32.80 |
| A. A FLOPs/mm2\mathrm{mm}^{2}∗ | 1.08 | 0.44 | 0.31 | 5.31 |
Table: A1.T8: Ablation study on quantization techniques. Covering all 9 GEMMs in the Llama- 3-8B model. Quantization configured with W_MXINT4, A_MXINT4, KV_MXINT4 with block size 16.
| Method | Metric (e.g., PPL↓\downarrow) |
|---|---|
| Baseline FP16 | 6.14 |
| RTN | 8.2763(2.3793↑2.3793\uparrow) |
| RTN + Errw\mathrm{Err}_{w} Clip | 8.1948(0.0815↓0.0815\downarrow) |
| GPTQ + Errw\mathrm{Err}_{w} Clip | 8.5193 (0.3245↑0.3245\uparrow) |
| GPTQ + Erry\mathrm{Err}_{y} Clip | 7.6026 (0.9167↓0.9167\downarrow) |
| GPTQ + Erry\mathrm{Err}_{y} Clip | 7.2218(0.3808↓0.3808\downarrow) |
Table: A1.T9: Perplexity on WikiText2 (lower is better) for various quantization settings applied to LLaMA-3 8B. We kept MXINT4 for weights.
| Quant Method | A/KV Datatype Search | ||
|---|---|---|---|
| e1m2 (MXFP4) | e2m1 (MXFP4) | MXINT4 | |
| Baseline FP16 | 6.14 | ||
| Ours | 8.7205 | 23.1579 | 7.22 |
Table: A1.T10: This table investigates the effect of online rotation on activations in the linear layers. For the LLaMA2-7B model, applying rotation to the down_proj layer results in worse performance compared to not rotating it, whereas this effect is not observed in the LLaMA3-8B model. Moreover, rotating the o_proj layer severely degrades the performance of LLaMA3-8B. These results suggest that the effectiveness of rotation is highly model-dependent.
| Attn Only | 5.9367 | 7.3933 |
|---|---|---|
| Attn + Down_proj | 5.9405 | 7.2721 |
| Attn + Up_proj | 5.9263 | 7.3529 |
| Attn + Gate_proj | 5.9241 | 7.3875 |
| Attn + Q_proj | 5.9183 | 7.3616 |
| Attn + K_proj | 5.9182 | 7.3555 |
| Attn + V_proj | 5.9322 | 7.3788 |
| Attn + O_proj | 5.9238 | nan |
Table: A1.T11: A summary of the PLEANA customized ISA for the accelerator
| Instruction Type | Description | Instruction No. |
|---|---|---|
| Matrix | Controls GEMM and GEMV operations, with or without matrix transposition | 6 |
| Vector | Performs elementwise, reduction operations, and rotation for quantization | 13 |
| Scalar | Performs scalar INT and FP arithmetic | 17 |
| HBM | Handles data transfers between HBM and matrix/vector SRAMs | 3 |
| Control | Defines operation settings, including the HBM physical address | 4 |
Table: A1.T13: Zero-shot accuracy of Llama- 3 and Llama- 2 models with 4 bits (A4W4KV4) only comparing with QuaRot on PIQA (PQ), WinoGrande (WG), HellaSwag (HS), Arc-Easy (A-e), Arc-Challenge (A-c), and LAMBADA (LA). Baseline and quarot results taken from QuaRot Table2 and Table 12.
| Model | Method | PQ (Bisk et al., 2020) | WG (Sakaguchi et al., 2021) | HS (Zellers et al., 2019) | A-e (Clark et al., 2018) | A-c (Clark et al., 2018) | LA (Paperno et al., 2016) | Avg. |
|---|---|---|---|---|---|---|---|---|
| Llama- 3-8B | FP16 | 80.74 | 72.77 | 79.06 | 77.82 | 53.33 | 75.63 | 73.22 |
| QuaRot (Ashkboos et al., 2024) | 75.14 | 65.82 | 72.94 | 68.01 | 43.34 | 65.81 | 65.18 | |
| Ours | 79.11 | 71.35 | 76.97 | 74.07 | 50.51 | 74.07 | 71.01 | |
| Llama- 2-7B | FP16 | 79.11 | 69.06 | 75.99 | 74.58 | 46.25 | 73.90 | 69.82 |
| QuaRot (Ashkboos et al., 2024) | 76.77 | 63.77 | 72.16 | 69.87 | 40.87 | 70.39 | 65.64 | |
| Ours | 78.73 | 68.19 | 74.24 | 72.52 | 43.69 | 73.30 | 68.45 |
Table: A1.T14: Hardware and quantisation parameters co-design search space. Categorical parameters are one-hot encoded, integer parameters are expressed as a power of 2.
| Parameter | Description | Search range |
| BLEN | Tile size of block unit | [2, 4, 8, 16, 32] |
| MLEN | Tile size of matrix unit | [2, 4, 8, 16, 32, 64, 128, 256, 512] |
| VLEN | Tile size of vector unit | [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| HBM_M_Prefetch | Prefetch amount for matrix data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Prefetch | Prefetch amount for vector data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Writeback | Writeback amount for vector data to HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| ACT_WIDTH | Activation precision | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} |
| KV_WIDTH | Key/Value precision | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} |
| FP_SETTING | Floating-point precision setting | FP_{E3M2, E2M3, E6M5, E5M6, E4M7, E8M5} |
| INT_DATA_WIDTH | Integer data width | [16, 32, 64] |
Table: A1.T15: Constraints applied to the hardware and quantisation co-design search space.
| Constraint | Description |
|---|---|
| MLEN ≥\geq BLEN | Matrix tile size must be at least the block tile size |
| MLEN mod BLEN =0=0 | Matrix tile size must be divisible by the block tile size |
| MATRIX_SRAM_DEPTH ≥2×\geq 2\times MLEN | Matrix SRAM depth must accommodate 2×2\times MLEN |
| VECTOR_SRAM_DEPTH ≥2×\geq 2\times HEAD_DIM +HIDDEN_DIMVLEN+\frac{\text{HIDDEN_DIM}}{\text{VLEN}} | Vector SRAM depth must store heads and hidden slices |
| INT_SRAM_DEPTH ≥16\geq 16 | Minimum integer SRAM depth |
| FP_SRAM_DEPTH ≥3×\geq 3\times MLEN ++ FP_CONSTANT_NUM | Floating-point SRAM depth constraint |
| (MLEN ×\times ACT_WIDTH ++ (MLEN / BLEN) ×\times ACT_SCALE_WIDTH) <1510<1510 | Bandwidth constraint at 1 GHz, 1 TB/s |
| (VLEN ×\times ACT_WIDTH ++ (VLEN / BLEN) ×\times ACT_SCALE_WIDTH) <1510<1510 | Bandwidth constraint at 1 GHz, 1 TB/s |
| (MLEN ×\times ACT_WIDTH ++ (MLEN / BLEN) ×\times ACT_SCALE_WIDTH) <1510<1510 | Bandwidth constraint at 1 GHz, 1.5 TB/s |
| (VLEN ×\times ACT_WIDTH ++ (VLEN / BLEN) ×\times ACT_SCALE_WIDTH) <1510<1510 | Bandwidth constraint at 1 GHz, 1.5 TB/s |
Table: A1.T16: Design space exploration on Llama- 3-8B : multi-objective results for five configurations from a BoTorch run. We report perplexity (↓) from the accuracy evaluator, end-to-end latency (seconds ↓), and area (micrometer2 ↓) from the respective cost models. Perplexity is computed with GEMM-only emulation (nonlinear ops omitted) for faster iteration, therefore the FP setting affects latency and area but not the accuracy metric. We load weights pre-quantized to MXINT4 via our PTQ method and quantize activations and the KV cache on-the-fly during inference.
| Parameters | Metrics | |||||||||||
| BLEN | MLEN | VLEN | HBM_M | HBM_V | HBM_V | ACT | KV | FP | INT_DATA | Perplexity ↓\downarrow | Lat (s) ↓\downarrow | Area (mm2\mathrm{mm}^{2}) ↓\downarrow |
| Prefetch | Prefetch | Writeback | WIDTH | WIDTH | SETTING | WIDTH | ||||||
| 32 | 128 | 32 | 16 | 8 | 256 | MXFP_E4M3 | MXFP_E3M4 | FP_E4M7 | 64 | 6.70 | 0.24 | 49615017.52 |
| 32 | 128 | 64 | 4 | 8 | 256 | MXINT_8 | MXINT_4 | FP_E3M2 | 32 | 6.76 | 0.24 | 51639793.20 |
| 32 | 256 | 128 | 256 | 64 | 128 | MXFP_E1M2 | MXINT_8 | FP_E6M5 | 16 | 12.14 | 0.15 | 99425984.56 |
| 8 | 128 | 32 | 128 | 8 | 256 | MXFP_E3M4 | MXFP_E3M4 | FP_E5M6 | 16 | 6.54 | 1.47 | 26456937.52 |
| 16 | 128 | 16 | 4 | 16 | 64 | MXINT_8 | MXFP_E4M3 | FP_E3M2 | 64 | 6.60 | 0.49 | 31983011.76 |
Table: A1.T17: Configuration settings for compute performance experiments, chosen to match the multiplier count of the A100 GPU. For MicroscopiQ, MLEN and BLEN are set to the same value to form a square shape.
| System | Freq (GHz) | MLEN | BLEN | VLEN | SRAM (MB) | W. Width | A. Width | KV Width | FP Setting |
|---|---|---|---|---|---|---|---|---|---|
| PLENA | 1 | 2048 | 32 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |
| MicroscopiQ | 1 | 256 | 256 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |
(a) Compared with standard chatbot workloads, the selected agentic web and code tasks generally consume over 100×\times more tokens.
(b) Compute shifts from FFN to Attention with increasing context length.
(c) KV cache scales with context length, eventually dominating memory usage.
(a) PLENA achieves higher utilization than the standard square systolic array(same resources).
(b) PLENA’s optimization pathways—(1) a flattened systolic array and (2) asymmetric quantization—together achieve improved effective memory bandwidth utilization and help reduce memory capacity limitations.
A typical setting of the MX data formats in this design. A scale is shared by a group of elements. Scale is in power of two quantization and elements can be quantized to integer or minifloat.
PLENA architecture overview. Execution is controlled by the decoder’s system-pipeline controller, which derives control signals from decoded instructions and monitors memory dependencies. For example, if the current instruction needs to read from a Vector SRAM row that is still being updated by the vector or matrix unit, the controller inserts a stall to ensure correctness. Vector SRAM acts as the on-chip scratchpad, providing data to the matrix and vector units and accepting their results.
Asymmetric-precision datapath example. Vector SRAM stores FP4 values, whereas Matrix SRAM stores MX-INT4 values. Green paths denote the selective rotational quantization flow: a fast Walsh–Hadamard transform is applied, with its inverse used to map back (Pan et al., 2021). Blue paths indicate the data flow for the remaining computation.
Processing flow for the weight–activation GEMM. Because memory capacity constrains batch size, the MM dimension remains small. Setting BLEN=M\texttt{BLEN}=M on the flattened systolic array yields near-100% utilization.
The flattened systolic array is composed of a series of smaller square-shaped systolic arrays arranged in a row to form the desired fat GEMM shape. Each receives inputs distributed from the MLEN vector buffers W and X, as shown in Figure 4.
Data layout and interaction in HBM. Data of different precisions can be stored simultaneously according to the defined storage pattern in HBM. Strided load and store operations are managed by the address remap unit, which generates and passes strided addresses to the TileLink channel.
A dataflow graph of LLM workloads in PLEANA. The blue lines indicate data represented in MX datatypes, and the orange lines indicate minifloat datatypes. The GEMM and projection layers are executed on the PLENA matrix unit, which takes inputs in MX and produces outputs in minifloats. All other operations are executed on the vector unit in minifloats.
An overview of the open-source PLENA system.
Empirical Attainment Surfaces for latency (↓\downarrow) and perplexity (↓\downarrow) objectives across multiple seeds, evaluated with Llama 3.2-1B and Llama- 3-8B over the co-design space shown in Table 14. For the 1B model, we run 9 seeds with 50 trials, comparing BoTorch and TPE methods against Random sampling. For the 8B model, we run 5 seeds with 50 trials, comparing BoTorch against Random. Shaded regions show the 25% and 75% attainment bands across seeds.
This matrix SRAM supports both transposed and untransposed reads without additional cost. The key idea is to store each row of data separately across a set of sub-SRAMs, where the number of sub-SRAMs equals the vector dimension being stored. The row index assigned to each element differs across the sub-SRAMs, ensuring that elements from the same matrix column (green dotted line) are distributed across different sub-SRAMs. With this organization, when reading from the SRAM—whether in transposed or untransposed mode—each requested element resides in a different sub-SRAM. As a result, only one read port per sub-SRAM is required.
In the hardware implementation of the PE array, element and scale will flow from top to bottom, left to right. All computations are performed using integer arithmetic.
$$ s(p)=\frac{px_{\max}}{\max_{\tau}}. $$ \tag{S4.E7}
| PLENA | PICACHU [53] | MicroScopiQ [54] | FlightLLM [73] | Tender [36] | FIGNA [31] | SystolicAttention [38] | Olive [25] | |
|---|---|---|---|---|---|---|---|---|
| Simulator | L3 | L1 | L2 | L1 | L1 | L1 | L3 | L1 |
| Custom ISA & Auto Code Gen | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| DSE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| FlashAttention support | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ |
| Full inference coverage ∗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Open source | ✓ | ✗ | ✗ | ✗ | ✓ |
| PLENA | MicroScopiQ* [54] | GPTQ[22] | QuaRot[6] | OmniQuant[60] | SmoothQuant[70] | Atom[75] | KiVi[76] | M-ANT[29] | |
|---|---|---|---|---|---|---|---|---|---|
| (QW, QACT, QKV) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✗ , ✗ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✓ , ✓ , ✓ ) | ( ✗ , ✗ , ✓ ) | ( ✓ , ✓ , ✓ ) |
| No.GEMMs | 9/9 | 9/9 | 7/9 | 7/9 | 9/9 | 7/9 | 9/9 | 0/9 | 9/9 |
| Nonlinear_FN | ✓ | ✗ | ✗ | ✗ | ✓ * | ✗ | ✗ | ✗ | ✗ |
| Embeddin & lm_head | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| RMSNorm | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| ROPE | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Supported datatypes | MXFP, MXINT, MiniFloat | MXFP, MXINT | INT | INT | INT | INT | INT, FP | INT | MANT |
| Mixed-precision | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
| Weights Quantization | Weights Quantization |
|---|---|
| Require: | full-precision weight matrix W ∈ R 𝑁 × 𝐾 calibration activations X ∈ R 𝑀 × 𝐾 |
| Require: Require: Require: | block size 𝐵 (i.e., MLEN) defined in our MX data format 𝜏 percentile set P , target format 𝜏 quantized weight matrix Q ; block quantisation errors E quantisation errors R 1 = ( 2 XX ⊤ + 𝜆 I ) - 1 1 ← Cholesky ( H - 1 ) ⊤ each block 𝑏 = 0 , 𝐵, 2 𝐵, . . .,𝐾 - 1 do 𝑏 2 ← min ( 𝑏 + 𝐵, 𝐾 ) |
| Ensure: 1: 2: | Initialize quantized weights Q ← 0 ∈ R 𝑁 × 𝐾 Initialize E ← 0 ∈ 𝑁 × 𝐵 |
| 11: | e W 𝑏 ← /Q_u.scantize ( W 𝑏 , 𝑝,𝜏 ) ; 𝝐 ← GLYPH<13> GLYPH<13> GLYPH<13> X 𝑏 W ⊤ 𝑏 - X 𝑏 e W 𝑏 |
| 3: H - 4: H - 5: for 6: 7: 8: | mask ← 𝝐 < 𝝐 best Q best [ mask , : ] ← W 𝑏 [ mask , : ] ; 𝝐 best [ mask ] ← 𝝐 [ |
| e mask ] | |
| Q : ,𝑏 : 𝑏 2 ← Q best 𝑏 ; 𝚫 𝑏 ← W 𝑏 - Q best 𝑏 ; d 𝑏𝑏 ← diag GLYPH<0> H - 1 𝑏 : 𝑏 2 ,𝑏 : 𝑏 1 1 | |
| E 𝑏 ← 𝚫 𝑏 diag ( d 𝑏𝑏 ) - ; W : ,𝑏 2 : ← W : ,𝑏 2 : - E 𝑏 · H - 𝑏 : 𝑏 | |
| 15: | |
| 12: 13: | W 𝑏 ← W : ,𝑏 : 𝑏 2 ⊲ Extract weight block X 𝑏 ← X : , : ,𝑏 : 𝑏 2 ⊲ Extract activation block 2 |
| 9: | Initialize Q best 𝑏 ← 0 , 𝝐 best ←∞ |
| for each candidate percentile 𝑝 ∈ P do | |
| 10: | |
| ⊤ GLYPH<13> GLYPH<13> GLYPH<13> 2 | |
| 𝑏 | |
| 14: | 2 |
| 2 ,𝑏 2 : |
| LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-3 [45] | LLaMA-3 [45] | ||
|---|---|---|---|---|---|---|
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| GPTQ [22] | 4/16/16 | 6.23 | 5.58 | 4.28 | 8.12 | 3.75 |
| AWQ [39] | 4/16/16 | 5.82 | 5.19 | 4.08 | 7.96 | 3.58 |
| OmniQuant [61] | 4/16/16 | 5.74 | 5.02 | 3.47 | 7.09 | 3.46 |
| MicroScopiQ [54] | 4/16/16 | 5.65 | 5.02 | 3.42 | 6.89 | 3.25 |
| QuaRot [6] | 4/16/16 | 5.60 | 5.00 | 3.41 | 6.52 ∗ | 3.53 ∗ |
| Ours | 4/16/16 | 5.61 | 4.97 | 3.41 | 6.45 | 3.59 |
| OmniQuant [61] | 4/4/16 | 11.47 | 8.32 | 5.41 | 10.21 | 5.30 |
| SmoothQuant [70] | 4/4/16 | 20.47 | 15.63 | 17.62 | 29.54 | 19.32 |
| Atom [75] | 4/4/16 | 6.16 | 6.12 | 5.20 | 8.12 | 4.69 |
| MicroScopiQ [54] | 4/4/16 | 6.11 | 5.57 | 4.48 | 8.12 | 4.65 |
| QuaRot [6] | 4/4/16 | 6.02 ∗ | 5.36 ∗ | 3.78 | 8.00 ∗ | 6.33 ∗ |
| M-ANT [29] | 4/4/16 | 5.92 | 5.24 | - | - | - |
| Ours | 4/4/16 | 5.69 | 5.03 | 3.59 | 6.76 | 4.51 |
| QuaRot [6] | 4/4/4 | 6.10 | 5.40 | 3.79 | 8.16 | 6.66 |
| QuaRot-128G [6] | 4/4/4 | 5.93 | 5.26 | 3.61 | 7.36 | 5.51 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
| LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-2 [64] | LLaMA-3 [45] | LLaMA-3 [45] | ||
|---|---|---|---|---|---|---|
| Method | W/A/KV | 7B | 13B | 70B | 8B | 70B |
| Baseline | 16/16/16 | 5.47 | 4.83 | 3.31 | 6.13 | 2.85 |
| Ours | 4/4/4 | 5.89 | 5.18 | 3.62 | 7.22 | 4.77 |
| Ours-Full System | 4/4/4 | 5.91 | 5.19 | 3.63 | 7.23 | 4.82 |
| LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.1-8B | LLaMA-3.3-70B | LLaMA-3.3-70B | LLaMA-3.3-70B | LLaMA-3.3-70B | |
|---|---|---|---|---|---|---|---|---|
| System | Standard | Standard | Agentic | Agentic | Standard | Standard | Agentic | Agentic |
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
| A100 80G | 6.20 | 1.00× | 0.22 | 1.00× | 1.12 | 1.00× | 1.05 | 1.00× |
| A100 80G (With Q) ∗ | 5.13 | 1.66× | 0.19 | 1.39× | 3.41 | 1.23× | 2.46 | 1.32× |
| TPU v6e | 5.63 | 0.90× | 4.58 | 0.39× | 50.07 | 0.31× | 7.98 | 0.84× |
| MicroScopiQ [54] | 16.43 | 0.35× | 3.27 | 0.57× | 61.63 | 0.16× | 19.23 | 0.09× |
| PLENA | 4.19 | 2.24× | 0.21 | 1.58× | 5.65 | 1.19× | 1.27 | 1.49× |
| GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | GPT-OSS 20B (MoE) | qwen2.5-7B ∗ | qwen2.5-7B ∗ | qwen2.5-7B ∗ | qwen2.5-7B ∗ | |
|---|---|---|---|---|---|---|---|---|
| System | Standard | Standard | Agentic | Agentic | Standard | Standard | Agentic | Agentic |
| TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | TTFT (s) | TPS (×A100) | |
| A100 80G | 9.39 | 1.00× | 1.87 | 1.00× | 8.21 | 1.00× | 1.17 | 1.00× |
| PLENA | 6.13 | 1.36× | 1.41 | 1.21× | 5.71 | 1.42× | 1.52 | 1.30× |
| Metric | Micro [54] | Olive [25] | FIGNA [31] | PLENA |
|---|---|---|---|---|
| Comp Area (mm 2 ) | 0.1378 | 0.319 | 0.471 | 0.237 |
| TOPs/mm 2 | 59.45 | 25.66 | 17.39 | 34.49 |
| S. A FLOPs/mm 2 ∗ | 28.76 | 11.59 | 7.51 | 32.8 |
| A. A FLOPs/mm 2 ∗ | 1.08 | 0.44 | 0.31 | 5.31 |
| Method | Metric (e.g., PPL ↓ ) |
|---|---|
| Baseline FP16 | 6.14 |
| RTN RTN + Err 𝑤 Clip GPTQ + Err 𝑤 Clip GPTQ + Err 𝑦 Clip GPTQ + Err 𝑦 Clip + Selective Rotation | 8.2763(2 . 3793 ↑ ) 8.1948(0 . 0815 ↓ ) 8.5193 (0 . 3245 ↑ ) 7.6026 (0 . 9167 ↓ ) 7.2218(0 . 3808 ↓ ) |
| Quant Method | A/KV Datatype Search | A/KV Datatype Search | A/KV Datatype Search |
|---|---|---|---|
| e1m2 (MXFP4) | e2m1 (MXFP4) | MXINT4 | |
| Baseline FP16 | 6.14 | ||
| Ours | 8.7205 | 23.1579 | 7.22 |
| Rotated Layer | LLaMA2-7B | LLaMA3-8B |
|---|---|---|
| Attn Only | 5.9367 | 7.3933 |
| Attn + Down_proj | 5.9405 | 7.2721 |
| Attn + Up_proj | 5.9263 | 7.3529 |
| Attn + Gate_proj | 5.9241 | 7.3875 |
| Attn + Q_proj | 5.9183 | 7.3616 |
| Attn + K_proj | 5.9182 | 7.3555 |
| Attn + V_proj | 5.9322 | 7.3788 |
| Attn + O_proj | 5.9238 | nan |
| Require: 𝑡 ∈ | [ 𝑉 ] 𝑇 | ⊲ token ids |
|---|---|---|
| Require: 𝐵,𝑇,𝑑,𝐿,𝐻,𝐻 kv ⊲ batch, heads, #KV heads | seq, | hidden_dim, #layers, #Q |
| Require: ( cos 𝜃, sin 𝜃 ) | 𝑋 ( 1 ) ← Embed ( 𝑡 ) | ⊲ RoPE parameters ⊲ 𝑋 ( 1 ) ∈ R 𝐵 × 𝑇 × 𝑑 |
| 1: 2: | for ℓ = 1 to 𝐿 do | |
| 3: | Layer input: 𝑋 ( ℓ ) ∈ R 𝐵 × 𝑇 × 𝑑 | |
| 4: | 𝑋 𝑛 ← RMSNorm ( 𝑋 ( ℓ ) ) | |
| 5: | 𝑄 ← 𝑋 𝑛 𝑊 𝑄 | [MatMul1] |
| 6: | 𝐾 ← 𝑋 𝑛 𝑊 𝐾 | [MatMul2] |
| 7: | 𝑉 ← 𝑋 𝑛 𝑊 𝑉 | [MatMul3] |
| 8: | ( 𝑄,𝐾 )← RoPE ( 𝑄,𝐾 ; cos 𝜃, sin 𝜃 ) | |
| 9: | ( 𝐾,𝑉 )← RepeatGroups ( 𝐾,𝑉,𝐻 / 𝐻 kv ) | ⊲ GQA |
| 10: | 𝐴 𝑤 ← Softmax GLYPH<18> 𝑄𝐾 ⊤ √ 𝑑 GLYPH<19> | [MatMul4] |
| 11: | ℎ 𝐴 𝑤 ← 𝐴 𝑤 𝑉 | [MatMul5] |
| 12: | 𝐴 𝑜 ← 𝐴 𝑤 𝑊 𝑂 | [MatMul6] |
| 13: | 𝑋 ′ ← 𝑋 ( ℓ ) + 𝐴 𝑜 | ⊲ residual add |
| 14: | 𝑋 ′ 𝑛 ← RMSNorm ( 𝑋 ′ ) | |
| 15: | 𝑋 act ← SiLU ( 𝑋 ′ 𝑛 𝑊 up ) | [MatMul7] |
| 16: | 𝑋 gate ← 𝑋 ′ 𝑛 𝑊 gate | [MatMul8] |
| 17: | 𝑋 mlp ←( 𝑋 act ⊙ 𝑋 gate ) 𝑊 down | [MatMul9] |
| 18: | 𝑋 ( ℓ + 1 ) ← 𝑋 ′ + 𝑋 mlp | ⊲ residual add |
| 19: | logits ← 𝑋 ( 𝐿 + 1 ) 𝑊 LM | |
| 20: | ˆ 𝑝 ← Softmax ( logits ) | |
| 21: | return ˆ 𝑝 |
| Instruction Type | Description | Instruction No. |
|---|---|---|
| Matrix | Controls GEMM and GEMV opera- tions, with or without matrix trans- position | 6 |
| Vector | Performs elementwise, reduction operations, and rotation for quan- tization | 13 |
| Scalar | Performs scalar INT and FP arith- metic | 17 |
| HBM | Handles data transfers between HBM and matrix/vector SRAMs | 3 |
| Control | Defines operation settings, includ- ing the HBM physical address | 4 |
| Type | Instruction (Format) | Description |
|---|---|---|
| Matrix (M) | M_MM (opcode, rd, rs1, rs2) M_TMM (opcode, rd, rs1, rs2) M_MV (opcode, rd, rs1) M_TMV (opcode, rd, rs1) M_MV_WO (opcode, rd, imm) M_MM_WO (opcode, rd, imm) | Multiply Matrix[rs2] and Vector[rs1]; accumulate in systolic array. Same as M_MM but with matrix transpose. Multiply Matrix[rs2] and Vector[rs1]; store in first row of systolic array. Same as M_MV but with matrix transpose. Write out first row of systolic array to Vector SRAM[rd+imm]. Write out systolic array results to Vector SRAM[rd+imm]. |
| Vector (V) | V_ADD_VV (opcode, rd, rs1, rs2) V_ADD_VF (opcode, rd, rs1, rs2) V_SUB_VV (opcode, rd, rs1, rs2) V_SUB_VF (opcode, rd, rs1, fp2) V_MUL_VV (opcode, rd, rs1, rs2) V_MUL_VF (opcode, rd, rs1, fp2) V_EXP_V (opcode, rd, rs1) V_REC_V (opcode, rd, rs1) V_LD_F (opcode, rd, rs1) V_RED_SUM (opcode, rd, rs1) V_RED_MAX (opcode, rd, rs1) V_ROTATION_EN (opcode, rd, rs1) | Elementwise vector addition. Vector plus broadcasted FP register. Elementwise vector subtraction. Vector minus broadcasted FP register. Elementwise vector multiplication. Vector times broadcasted FP register. Elementwise exponentiation. Elementwise reciprocal. Broadcast FP register value to vector. Reduction sum of vector into FP register. Reduction max of vector into FP register. Selectively apply Hadamard rotation |
| Scalar (S) | S_ADD_INT (opcode, rd, rs1, rs2) S_ADDI_INT (opcode, rd, rs1, imm) S_SUB_INT (opcode, rd, rs1, rs2) S_LUI_INT (opcode, rd, imm) S_MUL_INT (opcode, rd, rs1, rs2) S_DIV_INT (opcode, rd, rs1, rs2) S_LD_INT (opcode, rd, rs1, imm) S_ST_INT (opcode, rd, rs1, imm) S_ADD_FP (opcode, rd, rs1, rs2) S_SUB_FP (opcode, rd, rs1, rs2) S_MUL_FP (opcode, rd, rs1, rs2) S_EXP_FP (opcode, rd, rs1) S_MAX_FP (opcode, rd, rs1, rs2) S_LD/ST_FP (opcode, rd, rs1, imm) | Integer addition. Integer add immediate. Integer subtraction. Load upper immediate. Integer multiplication. Integer division. Load from FIX_MEM into integer register. Store integer register to FIX_MEM. FP addition. FP subtraction. FP multiplication. FP exponentiation. FP maximum. Load/store FP register from/to FP_MEM. |
| Memory (H) | H_PREFETCH_M (opcode, rd, rs1, rs2, rstride, prec) H_PREFETCH_V (opcode, rd, rs1, rs2) H_STORE_V (opcode, rd, rs1, rs2, stride, prec) | Prefetch specified rows from HBM to Matrix SRAM. Prefetch specified amount of rows from HBM to Vector SRAM. Store VLEN rows from Vector SRAM to HBM. |
| Control (C) | C_SET_ADDR_REG (opcode, rd, rs1, rs2) C_SET_SCALE_REG (rd, opcode) C_SET_LUT_REG (rd, opcode) C_BREAK (opcode) | Set HBM address register from two FIX regs. Set MX scale offset for quantized data. Set MX scale offset for quantized data. Trigger breakpoint exception. |
| Model | Method | PQ [10] | WG[57] | HS [72] | A-e [14] | A-c [14] | LA [52] | Avg. |
|---|---|---|---|---|---|---|---|---|
| Llama-3-8B | FP16 | 80.74 | 72.77 | 79.06 | 77.82 | 53.33 | 75.63 | 73.22 |
| Llama-3-8B | QuaRot [6] | 75.14 | 65.82 | 72.94 | 68.01 | 43.34 | 65.81 | 65.18 |
| Llama-3-8B | Ours | 79.11 | 71.35 | 76.97 | 74.07 | 50.51 | 74.07 | 71.01 |
| Llama-2-7B | FP16 | 79.11 | 69.06 | 75.99 | 74.58 | 46.25 | 73.9 | 69.82 |
| Llama-2-7B | QuaRot [6] | 76.77 | 63.77 | 72.16 | 69.87 | 40.87 | 70.39 | 65.64 |
| Llama-2-7B | Ours | 78.73 | 68.19 | 74.24 | 72.52 | 43.69 | 73.3 | 68.45 |
| Parameter | Description | Search range |
|---|---|---|
| BLEN | Tile size of block unit | [2, 4, 8, 16, 32] |
| MLEN | Tile size of matrix unit | [2, 4, 8, 16, 32, 64, 128, 256, 512] |
| VLEN | Tile size of vector unit | [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| HBM_M_Prefetch | Prefetch amount for matrix data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Prefetch | Prefetch amount for vector data from HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| HBM_V_Writeback | Writeback amount for vector data to HBM | [2, 4, 8, 16, 32, 64, 128, 256] |
| ACT_WIDTH | Activation precision | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, |
| KV_WIDTH | Key/Value precision | E5M2} |
| FP_SETTING | Floating-point precision setting | FP_{E3M2, E2M3, E6M5, E5M6, E4M7, E8M5} |
| INT_DATA_WIDTH | Integer data width | [16, 32, 64] |
| Constraint | Description |
|---|---|
| MLEN ≥ BLEN MLEN mod BLEN = 0 MATRIX_SRAM_DEPTH ≥ 2 × MLEN VECTOR_SRAM_DEPTH ≥ 2 × HEAD_DIM + HIDDEN_DIM VLEN INT_SRAM_DEPTH ≥ 16 FP_SRAM_DEPTH ≥ 3 × MLEN + FP_CONSTANT_NUM (MLEN × ACT_WIDTH + (MLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (VLEN × ACT_WIDTH + (VLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (MLEN × ACT_WIDTH + (MLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 (VLEN × ACT_WIDTH + (VLEN / BLEN) × ACT_SCALE_WIDTH) < 1510 | Matrix tile size must be at least the block tile size Matrix tile size must be divisible by the block tile size Matrix SRAM depth must accommodate 2 × MLEN Vector SRAM depth must store heads and hidden slices Minimum integer SRAM depth Floating-point SRAM depth constraint Bandwidth constraint at 1GHz, 1 TB/s Bandwidth constraint at 1GHz, 1 TB/s Bandwidth constraint at 1GHz, 1.5 TB/s Bandwidth constraint at 1GHz, 1.5 TB/s |
| Parameters | Parameters | Parameters | Parameters | Parameters | Parameters | Parameters | Metrics | Metrics | Metrics | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BLEN | MLEN | VLEN | HBM_M Prefetch | HBM_V Prefetch | HBM_V Writeback | ACT WIDTH | KV WIDTH | FP SETTING | INT_DATA WIDTH | Perplexity ↓ | Lat (s) ↓ | Area ( mm 2 ) ↓ |
| 32 | 128 | 32 | 16 | 8 | 256 | MXFP_E4M3 | MXFP_E3M4 | FP_E4M7 | 64 | 6.70 | 0.24 | 49615017.52 |
| 32 | 128 | 64 | 4 | 8 | 256 | MXINT_8 | MXINT_4 | FP_E3M2 | 32 | 6.76 | 0.24 | 51639793.20 |
| 32 | 256 | 128 | 256 | 64 | 128 | MXFP_E1M2 | MXINT_8 | FP_E6M5 | 16 | 12.14 | 0.15 | 99425984.56 |
| 8 | 128 | 32 | 128 | 8 | 256 | MXFP_E3M4 | MXFP_E3M4 | FP_E5M6 | 16 | 6.54 | 1.47 | 26456937.52 |
| 16 | 128 | 16 | 4 | 16 | 64 | MXINT_8 | MXFP_E4M3 | FP_E3M2 | 64 | 6.60 | 0.49 | 31983011.76 |
| System | Freq (GHz) | MLEN | BLEN | VLEN | SRAM (MB) | W. Width | A. Width | KV Width | FP Setting |
|---|---|---|---|---|---|---|---|---|---|
| PLENA | 1 | 2048 | 32 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |
| MicroscopiQ | 1 | 256 | 256 | 2048 | 128 | MXINT4 | MXINT4 | MXINT4 | FP E4M3 |

References
[nvidia2024blackwell] NVIDIA Corporation. (2024). NVIDIA Blackwell Architecture Technical Brief.
[rouhani2023microscaling] Rouhani, Bita Darvish, Zhao, Ritchie, More, Ankit, Hall, Mathew, Khodamoradi, Alireza, Deng, Summer, Choudhary, Dhruv, Cornea, Marius, Dellinger, Eric, Denolf, Kristof, others. (2023). Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537.
[li2024svdquant] Li, Muyang, Lin, Yujun, Zhang, Zhekai, Cai, Tianle, Li, Xiuyu, Guo, Junxian, Xie, Enze, Meng, Chenlin, Zhu, Jun-Yan, Han, Song. (2024). Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. arXiv preprint arXiv:2411.05007.
[nvidia2020ampere] NVIDIA Corporation. (2020). NVIDIA A100 Tensor Core GPU Architecture.
[boolq] Clark, Christopher, Lee, Kenton, Chang, Ming-Wei, Kwiatkowski, Tom, Collins, Michael, Toutanova, Kristina. (2019). Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
[hellaswag] Zellers, Rowan, Holtzman, Ari, Bisk, Yonatan, Farhadi, Ali, Choi, Yejin. (2019). Hellaswag: Can a machine really finish your sentence?. arXiv preprint arXiv:1905.07830.
[piqa] Bisk, Yonatan, Zellers, Rowan, Gao, Jianfeng, Choi, Yejin, others. (2020). Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI conference on artificial intelligence.
[arcc] Clark, Peter, Cowhey, Isaac, Etzioni, Oren, Khot, Tushar, Sabharwal, Ashish, Schoenick, Carissa, Tafjord, Oyvind. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
[mmlu] Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, Steinhardt, Jacob. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
[paperno2016lambada] Paperno, Denis, Kruszewski, Germ{'a. (2016). The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
[sakaguchi2021winogrande] Sakaguchi, Keisuke, Bras, Ronan Le, Bhagavatula, Chandra, Choi, Yejin. (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM.
[hu2025m] Hu, Weiming, Zhang, Haoyan, Guo, Cong, Feng, Yu, Guan, Renyang, Hua, Zhendong, Liu, Zihan, Guan, Yue, Guo, Minyi, Leng, Jingwen. (2025). M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type. 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[nvidia2023h100] NVIDIA Corporation. (2023). NVIDIA H100 Tensor Core GPU Architecture.
[ptx8.8] {NVIDIA Corporation. (2025). {Parallel Thread Execution ISA Version 8.8.
[hooper2025fgmp] Hooper, Coleman, Sakr, Charbel, Keller, Ben, Venkatesan, Rangharajan, Keutzer, Kurt, Shao, Sophia, Khailany, Brucek. (2025). FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference. arXiv preprint arXiv:2504.14152.
[AI_and_Mem_Wall] Gholami, Amir, Yao, Zhewei, Kim, Sehoon, Hooper, Coleman, Mahoney, Michael W., Keutzer, Kurt. (2024). AI and Memory Wall. IEEE Micro. doi:10.1109/MM.2024.3373763.
[guo2022ant] Guo, Cong, Zhang, Chen, Leng, Jingwen, Liu, Zihan, Yang, Fan, Liu, Yunxin, Guo, Minyi, Zhu, Yuhao. (2022). Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization. 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[Transistive] Cong Guo, Chiyue Wei, Jiaming Tang, Bowen Duan, Song Han, Hai Li, Yiran Chen. (2025). Transitive Array: An Efficient GEMM Accelerator with Result Reuse.
[zadeh2020gobo] Zadeh, Ali Hadi, Edo, Isak, Awad, Omar Mohamed, Moshovos, Andreas. (2020). Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[guo2023olive] Guo, Cong, Tang, Jiaming, Hu, Weiming, Leng, Jingwen, Zhang, Chen, Yang, Fan, Liu, Yunxin, Guo, Minyi, Zhu, Yuhao. (2023). Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. Proceedings of the 50th Annual International Symposium on Computer Architecture.
[frantar2022gptq] Frantar, Elias, Ashkboos, Saleh, Hoefler, Torsten, Alistarh, Dan. (2022). Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
[lin2024awq] Lin, Ji, Tang, Jiaming, Tang, Haotian, Yang, Shang, Chen, Wei-Ming, Wang, Wei-Chen, Xiao, Guangxuan, Dang, Xingyu, Gan, Chuang, Han, Song. (2024). AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems.
[zhao2024atom] Zhao, Yilong, Lin, Chien-Yu, Zhu, Kan, Ye, Zihao, Chen, Lequn, Zheng, Size, Ceze, Luis, Krishnamurthy, Arvind, Chen, Tianqi, Kasikci, Baris. (2024). Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems.
[xiao2023smoothquant] Xiao, Guangxuan, Lin, Ji, Seznec, Mickael, Wu, Hao, Demouth, Julien, Han, Song. (2023). Smoothquant: Accurate and efficient post-training quantization for large language models. International Conference on Machine Learning.
[shao2023omniquant] Shao, Wenqi, Chen, Mengzhao, Zhang, Zhaoyang, Xu, Peng, Zhao, Lirui, Li, Zhiqian, Zhang, Kaipeng, Gao, Peng, Qiao, Yu, Luo, Ping. (2023). Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv:2308.13137.
[Microscopiq] Ramachandran, Akshat, Kundu, Souvik, Krishna, Tushar. (2025). Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization. Proceedings of the 52nd Annual International Symposium on Computer Architecture.
[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
[hu2025mant] Hu, Weiming, Zhang, Haoyan, Guo, Cong, Feng, Yu, Guan, Renyang, Hua, Zhendong, Liu, Zihan, Guan, Yue, Guo, Minyi, Leng, Jingwen. (2025). M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type. 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
[wikitext2] Merity, Stephen, Xiong, Caiming, Bradbury, James, Socher, Richard. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
[llama3] Meta, AI. (2024). Introducing meta llama 3: The most capable openly available llm to date. Meta AI.
[PICACHU] Qin, Jiajun, Xia, Tianhua, Tan, Cheng, Zhang, Jeff, Zhang, Sai Qian. (2025). PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. doi:10.1145/3676641.3716013.
[FlightLLM] Zeng, Shulin, Liu, Jun, Dai, Guohao, Yang, Xinhao, Fu, Tianyu, Wang, Hongyi, Ma, Wenheng, Sun, Hanbo, Li, Shiyao, Huang, Zixiao, Dai, Yadong, Li, Jintao, Wang, Zehao, Zhang, Ruoyu, Wen, Kairui, Ning, Xuefei, Wang, Yu. (2024). FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. doi:10.1145/3626202.3637562.
[attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (2023). Attention Is All You Need.
[wei2022emergentabilitieslargelanguage] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus. (2022). Emergent Abilities of Large Language Models.
[kojima2023largelanguagemodelszeroshot] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa. (2023). Large Language Models are Zero-Shot Reasoners.
[Efficient_LLM_Inference] Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis. (2025). Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need.
[GPT3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. (2020). Language Models are Few-Shot Learners.
[llama] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. (2023). LLaMA: Open and Efficient Foundation Language Models.
[LLMCompass] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan. (2025). DeepSeek-V3 Technical Report. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). doi:10.1109/ISCA59077.2024.00082.
[MoE] Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov. (2022). Efficient Large Scale Language Modeling with Mixtures of Experts.
[MLA] Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang. (2025). TransMLA: Multi-Head Latent Attention Is All You Need.
[FlashDecoding] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang. (2024). FlashDecoding++: Faster Large Language Model Inference on GPUs.
[mixed-precision_decoding] Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris. (2025). Progressive Mixed-Precision Decoding for Efficient LLM Inference.
[Roofline_Model] Williams, Samuel, Waterman, Andrew, Patterson, David. (2009). Roofline: an insightful visual performance model for multicore architectures. Commun. ACM. doi:10.1145/1498765.1498785.
[TPU_V3_Systolic_Array] {Google. (2025). System Architecture: TPU VM.
[microscaling] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung. (2023). Microscaling Data Formats for Deep Learning.
[FlashAttention] Tri Dao. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
[QuaRot] Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs.
[omniquant] Shao, Wenqi, Chen, Mengzhao, Zhang, Zhaoyang, Xu, Peng, Zhao, Lirui, Li, Zhiqian, Zhang, Kaipeng, Gao, Peng, Qiao, Yu, Luo, Ping. (2023). Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137.
[hadamard_transform] Hongyi Pan, Diaa Dabawi, Ahmet Enis Cetin. (2021). Fast Walsh-Hadamard Transform and Smooth-Thresholding Based Binary Layers in Deep Neural Networks.
[kivi] {Zirui Liu. (2023). KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization. doi:10.13140/RG.2.2.28167.37282.
[deepspeedinference] Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022). DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
[chatgpt] OpenAI. (2024). ChatGPT.
[onnx] Bai, Junjie, Lu, Fang, Zhang, Ke. (2019). ONNX: Open Neural Network Exchange. GitHub repository.
[grammar_checking] Tao Fang, Derek F. Wong, Lusheng Zhang, Keyan Jin, Qiang Zhang, Tianjiao Li, Jinlong Hou, Lidia S. Chao. (2024). LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning.
[code_gen] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim. (2024). A Survey on Large Language Models for Code Generation.
[agentic_workflow] Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu. (2025). Flow: Modularized Agentic Workflow Automation.
[gpt2] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.
[gpt4] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph. (2024). GPT-4 Technical Report.
[FIGNA] Jiawei Lin, Guokai Chen, Yuanlong Li, Thomas Bourgeat. (2025). SystolicAttention: Fusing FlashAttention within a Single Systolic Array. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). doi:10.1109/ISCA59077.2024.00080.
[gpt4_turbo] OpenAI. (2024). GPT-4 Turbo Model Documentation.
[claude] Anthropic. (2025). Claude Opus 4.1 Documentation.
[OmniParser] Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah. (2024). OmniParser for Pure Vision Based GUI Agent.
[browser_use2024] Müller, Magnus, Žunič, Gregor. (2024). Browser Use: Enable AI to control your browser.
[command_agent] Mayank Agarwal, Jorge J. Barroso, Tathagata Chakraborti, Eli M. Dow, Kshitij Fadnis, Borja Godoy, Madhavan Pallan, Kartik Talamadupula. (2020). Project CLAI: Instrumenting the Command Line as a New Environment for AI Agents.
[longwriter] Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. (2024). LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.
[multi-swe] Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang. (2025). Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving.
[LongCodeBench] Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto. (2025). LongCodeBench: Evaluating Coding LLMs at 1M Context Windows.
[code_gen_long_context] Yoichi Ishibashi, Yoshimasa Nishimura. (2024). Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization.
[webdom_long_context] Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee. (2025). Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents.
[llama4_maverick] Meta AI. (2025). {The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.
[WebVoyager] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu. (2024). WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models.
[BrowserGym] Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste. (2025). The BrowserGym Ecosystem for Web Agent Research.
[workarena] Drouin, Alexandre, Gasse, Maxime, Caccia, Massimo, Laradji, Issam H., Del Verme, Manuel, Marty, Tom, Vazquez, David, Chapados, Nicolas, Lacoste, Alexandre. (2024). {W. Proceedings of the 41st International Conference on Machine Learning.
[TileLink] {SiFive, Inc.. (2020). TileLink Specification.
[Ramulator] Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, Onur Mutlu. (2023). Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator.
[openroad] L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, G. Yeric. (2016). ASAP: A 7-nm finFET predictive process design kit. Microelectronics Journal. doi:10.1016/j.mejo.2016.04.006.
[DeepScale] Nagel, Markus, Fournarakis, Marios, Amjad, Rana Ali, Bondarenko, Yelysei, Van Baalen, Mart, Blankevoort, Tijmen. (2021). A white paper on neural network quantization. arXiv preprint arXiv:2106.08295. doi:10.1109/ISCAS51556.2021.9401196.
[shahriari2015taking] Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams, Ryan P, De Freitas, Nando. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE.
[wang2023recent] Wang, Xilu, Jin, Yaochu, Schmitt, Sebastian, Olhofer, Markus. (2023). Recent advances in Bayesian optimization. ACM Computing Surveys.
[srinivas2009gaussian] Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M, Seeger, Matthias. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
[wang2016bayesian] Wang, Ziyu, Hutter, Frank, Zoghi, Masrour, Matheson, David, De Feitas, Nando. (2016). Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research.
[liu2020gaussian] Liu, Haitao, Ong, Yew-Soon, Shen, Xiaobo, Cai, Jianfei. (2020). When Gaussian process meets big data: A review of scalable GPs. IEEE transactions on neural networks and learning systems.
[balandat2020botorch] Balandat, Maximilian, Karrer, Brian, Jiang, Daniel, Daulton, Samuel, Letham, Ben, Wilson, Andrew G, Bakshy, Eytan. (2020). BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Advances in neural information processing systems.
[deshwal2021bayesian] Deshwal, Aryan, Belakaria, Syrine, Doppa, Janardhan Rao. (2021). Bayesian optimization over hybrid spaces. International Conference on Machine Learning.
[daulton2022bayesian] Daulton, Samuel, Wan, Xingchen, Eriksson, David, Balandat, Maximilian, Osborne, Michael A, Bakshy, Eytan. (2022). Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Advances in Neural Information Processing Systems.
[akiba2019optuna] Shuhei Watanabe. (2023). Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance.
[knowles2005eas] Knowles, Joshua. (2005). A summary-attainment-surface plotting method for visualizing the performance of stochastic multiobjective optimizers. 5th International Conference on Intelligent Systems Design and Applications (ISDA'05).
[fonseca2011eas] Fonseca, Carlos M, Guerreiro, Andreia P, L{'o. (2011). On the computation of the empirical attainment function. International Conference on Evolutionary Multi-criterion Optimization.
[watanabe2023eas] Watanabe, Shuhei. (2023). Python tool for visualizing variability of Pareto fronts over multiple runs. arXiv preprint arXiv:2305.08852.
[eval-harness] Gao, Leo, Tow, Jonathan, Abbasi, Baber, Biderman, Stella, Black, Sid, DiPofi, Anthony, Foster, Charles, Golding, Laurence, Hsu, Jeffrey, Le Noac'h, Alain, Li, Haonan, McDonell, Kyle, Muennighoff, Niklas, Ociepa, Chris, Phang, Jason, Reynolds, Laria, Schoelkopf, Hailey, Skowron, Aviya, Sutawika, Lintang, Tang, Eric, Thite, Anish, Wang, Ben, Wang, Kevin, Zou, Andy. The Language Model Evaluation Harness. doi:10.5281/zenodo.12608602.
[bib1]
[bib2] Agarwal et al. (2020) Mayank Agarwal, Jorge J. Barroso, Tathagata Chakraborti, Eli M. Dow, Kshitij Fadnis, Borja Godoy, Madhavan Pallan, and Kartik Talamadupula. 2020. Project CLAI: Instrumenting the Command Line as a New Environment for AI Agents. arXiv:2002.00762 [cs.HC] https://arxiv.org/abs/2002.00762
[bib3] Meta AI. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Accessed: 2025-08-16.
[bib4] Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv:1907.10902
[bib5] Aminabadi et al. (2022) Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 [cs.LG] https://arxiv.org/abs/2207.00032
[bib6] Artetxe et al. (2022) Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 2022. Efficient Large Scale Language Modeling with Mixtures of Experts. arXiv:2112.10684 [cs.CL] https://arxiv.org/abs/2112.10684
[bib7] Ashkboos et al. (2024) Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456 [cs.LG] https://arxiv.org/abs/2404.00456
[bib8] Bai et al. (2019) Junjie Bai, Fang Lu, and Ke Zhang. 2019. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx. GitHub repository (2019).
[bib9] Bai et al. (2024) Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. arXiv:2408.07055 [cs.CL] https://arxiv.org/abs/2408.07055
[bib10] Balandat et al. (2020) Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G Wilson, and Eytan Bakshy. 2020. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Advances in neural information processing systems 33 (2020), 21524–21538.
[bib11] Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439.
[bib12] Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL] https://arxiv.org/abs/2005.14165
[bib13] Chezelles et al. (2025) Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. 2025. The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467 [cs.LG] https://arxiv.org/abs/2412.05467
[bib14] Clark et al. (2016) L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric. 2016. ASAP: A 7-nm finFET predictive process design kit. Microelectronics Journal 53 (July 2016), 105–115. doi:10.1016/j.mejo.2016.04.006
[bib15] Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018).
[bib16] NVIDIA Corporation. 2024. NVIDIA Blackwell Architecture Technical Brief. Technical Report. NVIDIA Corporation. https://resources.nvidia.com/en-us-blackwell-architecture
[bib17] Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691
[bib18] Daulton et al. (2022) Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A Osborne, and Eytan Bakshy. 2022. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Advances in Neural Information Processing Systems 35 (2022), 12760–12774.
[bib19] Davies et al. (2025) Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. 2025. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need. arXiv:2507.14397 [cs.AR] https://arxiv.org/abs/2507.14397
[bib20] Deshwal et al. (2021) Aryan Deshwal, Syrine Belakaria, and Janardhan Rao Doppa. 2021. Bayesian optimization over hybrid spaces. In International Conference on Machine Learning. PMLR, 2632–2643.
[bib21] Drouin et al. (2024) Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 11642–11662. https://proceedings.mlr.press/v235/drouin24a.html
[bib22] Fonseca et al. (2011) Carlos M Fonseca, Andreia P Guerreiro, Manuel López-Ibánez, and Luís Paquete. 2011. On the computation of the empirical attainment function. In International Conference on Evolutionary Multi-criterion Optimization. Springer, 106–120.
[bib23] Frantar et al. (2022) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
[bib24] Gholami et al. (2024) Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. AI and Memory Wall. IEEE Micro 44, 3 (May 2024), 33–39. doi:10.1109/MM.2024.3373763
[bib25] Google. 2025. System Architecture: TPU VM. Technical Report. Google Cloud. Last updated August 1, 2025.
[bib26] Guo et al. (2023) Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2023. Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–15.
[bib27] Guo et al. (2025) Cong Guo, Chiyue Wei, Jiaming Tang, Bowen Duan, Song Han, Hai Li, and Yiran Chen. 2025. Transitive Array: An Efficient GEMM Accelerator with Result Reuse. arXiv:2504.16339 [cs.AR] https://arxiv.org/abs/2504.16339
[bib28] He et al. (2024) Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] https://arxiv.org/abs/2401.13919
[bib29] Hong et al. (2024) Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. FlashDecoding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282 [cs.LG] https://arxiv.org/abs/2311.01282
[bib30] Hu et al. (2025) Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, and Jingwen Leng. 2025. M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1112–1126.
[bib31] Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization. arXiv:2404.02183 [cs.SE] https://arxiv.org/abs/2404.02183
[bib32] Jang et al. (2024) Jaeyong Jang, Yulhwa Kim, Juheun Lee, and Jae-Joon Kim. 2024. FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 760–773. doi:10.1109/HPCA57654.2024.00064
[bib33] Jiang et al. (2024) Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] https://arxiv.org/abs/2406.00515
[bib34] Joshua Knowles. 2005. A summary-attainment-surface plotting method for visualizing the performance of stochastic multiobjective optimizers. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05). IEEE, 552–557.
[bib35] Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916
[bib36] Lee et al. (2025) Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, and Kimin Lee. 2025. Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents. arXiv:2503.10689 [cs.CL] https://arxiv.org/abs/2503.10689
[bib37] Lee et al. (2024) Jungi Lee, Wonbeom Lee, and Jaewoong Sim. 2024. Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 1048–1062. doi:10.1109/ISCA59077.2024.00080
[bib38] Li et al. (2024) Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. 2024. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. arXiv preprint arXiv:2411.05007 (2024).
[bib39] Lin et al. (2025) Jiawei Lin, Guokai Chen, Yuanlong Li, and Thomas Bourgeat. 2025. SystolicAttention: Fusing FlashAttention within a Single Systolic Array. arXiv:2507.11331 [cs.AR] https://arxiv.org/abs/2507.11331
[bib40] Lin et al. (2024) Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100.
[bib41] Liu et al. (2020) Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. 2020. When Gaussian process meets big data: A review of scalable GPs. IEEE transactions on neural networks and learning systems 31, 11 (2020), 4405–4423.
[bib42] Lu et al. (2024) Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. OmniParser for Pure Vision Based GUI Agent. arXiv:2408.00203 [cs.CV] https://arxiv.org/abs/2408.00203
[bib43] Luo et al. (2023) Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator. arXiv:2308.11030 [cs.AR] https://arxiv.org/abs/2308.11030
[bib44] Meng et al. (2025) Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, and Muhan Zhang. 2025. TransMLA: Multi-Head Latent Attention Is All You Need. arXiv:2502.07864 [cs.LG] https://arxiv.org/abs/2502.07864
[bib45] Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).
[bib46] AI Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. Meta AI (2024).
[bib47] Magnus Müller and Gregor Žunič. 2024. Browser Use: Enable AI to control your browser. https://github.com/browser-use/browser-use
[bib48] Nagel et al. (2021) Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295 (2021).
[bib49] Niu et al. (2025) Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, and Tongliang Liu. 2025. Flow: Modularized Agentic Workflow Automation. arXiv:2501.07834 [cs.AI] https://arxiv.org/abs/2501.07834
[bib50] OpenAI. 2024. ChatGPT. https://openai.com/index/chatgpt/. Accessed: 2024-08-04.
[bib51] OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774
[bib52] Pan et al. (2021) Hongyi Pan, Diaa Dabawi, and Ahmet Enis Cetin. 2021. Fast Walsh-Hadamard Transform and Smooth-Thresholding Based Binary Layers in Deep Neural Networks. arXiv:2104.07085 [cs.CV] https://arxiv.org/abs/2104.07085
[bib53] Paperno et al. (2016) Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 (2016).
[bib54] Qin et al. (2025) Jiajun Qin, Tianhua Xia, Cheng Tan, Jeff Zhang, and Sai Qian Zhang. 2025. PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 845–861. doi:10.1145/3676641.3716013
[bib55] Ramachandran et al. (2025) Akshat Ramachandran, Souvik Kundu, and Tushar Krishna. 2025. Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 1193–1209.
[bib56] Rando et al. (2025) Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evaluating Coding LLMs at 1M Context Windows. arXiv:2505.07897 [cs.CL] https://arxiv.org/abs/2505.07897
[bib57] Rouhani et al. (2023) Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, and Eric Chung. 2023. Microscaling Data Formats for Deep Learning. arXiv:2310.10537 [cs.LG] https://arxiv.org/abs/2310.10537
[bib58] Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64, 9 (2021), 99–106.
[bib59] Satyabrata Sarangi and Bevan Baas. 2021. DeepScaleTool: A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). 1–5. doi:10.1109/ISCAS51556.2021.9401196
[bib60] Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104, 1 (2015), 148–175.
[bib61] Shao et al. (2023a) Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023a. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137 (2023).
[bib62] Shao et al. (2023b) Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023b. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv:2308.13137 (2023).
[bib63] SiFive, Inc. 2020. TileLink Specification. Specification v1.8.1. SiFive, Inc. https://starfivetech.com/uploads/tilelink_spec_1.8.1.pdf Version 1.8.1.
[bib64] Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971
[bib65] Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[bib66] Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https://arxiv.org/abs/1706.03762
[bib67] Wang et al. (2016) Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando De Feitas. 2016. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research 55 (2016), 361–387.
[bib68] Shuhei Watanabe. 2023a. Python tool for visualizing variability of Pareto fronts over multiple runs. arXiv preprint arXiv:2305.08852 (2023).
[bib69] Shuhei Watanabe. 2023b. Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance. arXiv:2304.11127
[bib70] Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL] https://arxiv.org/abs/2206.07682
[bib71] Xiao et al. (2023) Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
[bib72] Zan et al. (2025) Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv:2504.02605 [cs.SE] https://arxiv.org/abs/2504.02605
[bib73] Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
[bib74] Zeng et al. (2024) Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA ’24). Association for Computing Machinery, New York, NY, USA, 223–234. doi:10.1145/3626202.3637562
[bib75] Zhang et al. (2024) Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff. 2024. LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 1080–1096. doi:10.1109/ISCA59077.2024.00082
[bib76] Zhao et al. (2024) Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209.
[bib77] Zirui Liu et al. (2023) Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2023. KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization. (2023). doi:10.13140/RG.2.2.28167.37282