1. Introduction

Continuous batching processes multiple requests dynamically, improving computational efficiency and speed. Instead of handling each request individually, we batch them together, which is ideal for real-time applications like text generation or inference.

Let us understand this concept with an example. Later, we’ll implement it from scratch using Pytorch.

Example Scenerio: Customer Support Chatbot

Initial Batch:  Imagine an e-commerce platform’s AI chatbot handling customer queries. Here are 8 simultaneous queries:

  1. “How do I track my order?”
  2. “What’s your return policy?”
  3. “Is this product available in red?”
  4. “Explain the features of the latest smartphone model.”
  5. “How long does shipping take?”
  6. “I received a damaged item. What should I do?”
  7. “Are there any ongoing sales?”
  8. “How do I change my delivery address?”

Dynamic Processing:
As the model generates responses, some queries are answered quickly (e.g., “What’s your return policy?”), while others take longer (e.g., “Explain the features of the latest smartphone model”). The BatchManager removes completed responses and adds new ones to maintain optimal GPU utilization.

Continuous Flow:
Once Customers 2, 5, and 7 receive quick responses, they are removed from the batch, and new queries are added:

  1. “Do you offer gift wrapping?”
  2. “What are your best-selling items this week?”
  3. “How do I apply a discount code?”

Continuous Operation
This process continuously adapts the batch, removing completed queries and adding new ones, maintaining optimal GPU utilization and minimizing customer wait times.

Benefits

  • Efficient Resource Utilization: Maximizes GPU throughput by always processing an optimal number of queries.
  • Reduced Latency: Shorter queries are processed quickly, without waiting for longer ones.
  • Scalability: Smoothly handles a large number of incoming queries.
  • Adaptability: Manages varying query lengths and complexities efficiently.

    Continuous batching enables real-time processing of multiple, varying-length inputs, which is essential for many NLP applications like chatbots.

    Well, we’ve almost understood the concepts of Continuous Batching, however, we need a solid understanding of the differences we are making over synchronous batching.

Advantages of Continuous Batching over Synchronous Batching:

Dynamic batch management:

  • Continuous batching can add new prompts to the batch as others finish.
  • Synchronous batching processes fixed-size batches, potentially leading to under utilization when prompts finish at different times.

Efficient resource utilization:

  • Continuous batching keeps the GPU more consistently occupied, as it doesn’t wait for all prompts in a batch to finish before starting new ones.

Reduced latency for shorter prompts:

  • In continuous batching, shorter prompts can be completed and replaced with new ones without waiting for longer prompts to finish.

Better handling of varying prompt lengths:

  • Continuous batching adapts well to prompts of different lengths.
  • Synchronous batching might waste resources on padding for shorter prompts within a batch.

The code, output, and logs of comparison between Continuous Batching and Synchronous Batching are provided at the end of this blog. Take a glance and come back, we’ve new things coming up.

2. Project Overview

Our implementation creates a flexible continuous batching system for text generation tasks. We use PyTorch for tensor operations, the Transformers library for working with pre-trained language models, and Streamlit for the user interface.

The core components of our system include:

  • Model and tokenizer loading
  • Tokenizer utilities
  • Token generation
  • Batch management
  • Continuous batching logic

Let’s explore each of these components in detail.

3. Key Components
3.1 Model and Tokenizer Loading

The ModelLoader class handles the loading of the pre-trained model and tokenizer. It supports both standard and quantized (GPTQ) models, providing flexibility in model selection based on available resources.

class ModelLoader:


    def __init__(self, model_name):
     

        self.model_name = model_name

        self.device = config.device.device

        self.model = None

        self.tokenizer = None

   

    def load_model(self):
       

       # Load the model based on configuration
     

        if config.model.use_gptq:
     

        gptq_config = GPTQConfig(

          bits=config.model.gptq_bits,

          use_exllama=config.model.use_exllama

        )

            self.model = AutoModelForCausalLM.from_pretrained(

                self.model_name,

                quantization_config=gptq_config,

                device_map=self.device,

                trust_remote_code=True,

            )

        else:

            self.model = AutoModelForCausalLM.from_pretrained(

            self.model_name,

            device_map=self.device,

            trust_remote_code=True,

            torch_dtype=torch.float16,

            )

       

        # Move model to device and set to evaluation mode

        self.model.to(self.device)

        self.model.eval()

       # Load and configure the tokenizer

        self.tokenizer = AutoTokenizer.from_pretrained(

        self.model_name,

        use_fast=config.tokenizer.use_fast
       

        )
       

        self.tokenizer.pad_token =

        self.tokenizer.eos_token
       

        self.model.config.pad_token_id = self.

        model.config.eos_token_id
       

        self.tokenizer.padding_side = config.tokenizer.padding_side
       

        self.tokenizer.truncation_side = config.tokenizer.truncation_side


        return self.model, self.tokenizer

3.2 Tokenizer Utilities

The TokenizerUtils class provides essential methods for working with the tokenizer, including encoding prompts, decoding responses, and managing token counts.

class TokenizerUtils:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def encode_prompt(self, prompt: str, system_prompt: str) -> str:
        # Combine system prompt and user prompt
        return f”{system_prompt}\n\nQ: {prompt}\nA:”

    def decode_response(self, response: str) -> str:
        # Extract the model’s response from the full output
        parts = response.split(“A:”)
        return parts[-1].strip() if len(parts) > 1 else response.strip()

    def tokenize_batch(self, prompts: List[Tuple[str, int]]) -> dict:
        # Tokenize a batch of prompts
        texts = [prompt for prompt, _ in prompts]
        max_tokens = [tokens for _, tokens in prompts]
        inputs = self.tokenizer(texts, padding=True, return_tensors=“pt”)
        return {**inputs, “max_tokens”: max_tokens, “original_prompts”: texts}

    # … (other utility methods)

3.3 Token Generation

The TokenGenerator class is responsible for generating tokens using the loaded model. It applies various logit processors to control the generation process, such as temperature scaling, top-k, and top-p sampling.

class TokenGenerator:
    def __init__(self, model, tokenizer, device):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate_batch_tokens_with_past(
        self, inputs, repetition_penalty=1.8, temperature=0.7, top_k=150, top_p=0.95
    ):
        # Generate the next token for a batch of inputs
        with torch.no_grad():
            outputs = self.model(**inputs, use_cache=True)
        logits = outputs.logits
        last_logits = logits[:, -1, :]

        # Apply logits processors
        logits_processors = LogitsProcessorList([
            RepetitionPenaltyLogitsProcessor(repetition_penalty),
            TemperatureLogitsWarper(temperature),
            TopKLogitsWarper(top_k=top_k),
            TopPLogitsWarper(top_p=top_p),
        ])
        last_logits = logits_processors(inputs[“input_ids”], last_logits)
        next_token_ids = last_logits.argmax(dim=1)

        return next_token_ids, outputs.past_key_values

    def get_next_inputs(self, batch, next_token_ids, past_key_values):
        # Prepare the inputs for the next iteration
        new_accumulate_ids = [
            acc + [new_id.item()]
            for acc, new_id in zip(batch[“accumulate_ids”], next_token_ids)
        ]
        next_responses = self.tokenizer.batch_decode(
            new_accumulate_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )

        return {
            “input_ids”: next_token_ids.reshape((-1, 1)).to(self.device),
            “position_ids”: batch[“position_ids”][:, -1].unsqueeze(-1).to(self.device) + 1,
            “attention_mask”: torch.cat([
                batch[“attention_mask”].to(self.device),
                torch.ones((next_token_ids.shape[0], 1), device=self.device),
            ], dim=1).to(self.device),
            “past_key_values”: past_key_values,
            “responses”: next_responses,
            “tokens_remaining”: [v – 1 for v in batch[“tokens_remaining”]],
            “accumulate_ids”: new_accumulate_ids,
        }

3.4 Batch Management

The BatchManager class handles the creation, merging, and filtering of batches. It’s crucial for maintaining efficient batch sizes and removing completed sequences.

class ContinuousBatcher:
    def __init__(self, batch_manager, token_generator):
        self.batch_manager = batch_manager
        self.token_generator = token_generator

    def run(self, request_queue, batch_size):
        t0 = time.time()
        latencies = []

        with tqdm(total=len(request_queue), desc=f”bs= {batch_size}) as pbar:
            # Initialize the first batch
            batch = self.batch_manager.init_batch(request_queue[:batch_size])
            batch_start_time = time.time()
            cached_batch = self.token_generator.get_next_inputs(
                batch, *self.token_generator.generate_batch_tokens_with_past(batch)
            )
            request_queue = request_queue[batch_size:]

            while len(request_queue) > 0 or cached_batch[“input_ids”].size(0) > 0:
                # Calculate available capacity in the current batch
                batch_capacity = batch_size – cached_batch[“input_ids”].size(0)

                # Add new requests to the batch if there’s capacity
                if batch_capacity > 0 and len(request_queue) > 0:
                    new_batch = self.batch_manager.init_batch(request_queue[:batch_capacity])
                    new_batch = self.token_generator.get_next_inputs(
                        new_batch,
                        *self.token_generator.generate_batch_tokens_with_past(new_batch),
                    )
                    request_queue = request_queue[batch_capacity:]
                    cached_batch = self.batch_manager.merge_batches(cached_batch, new_batch)

# Generate next tokens for the current batch
                cached_batch = self.token_generator.get_next_inputs(
                    cached_batch,
                    *self.token_generator.generate_batch_tokens_with_past(cached_batch),
                )

                # Filter completed sequences
                remember_this_batch = cached_batch
                cached_batch, remove_indices = self.batch_manager.filter_batch(cached_batch)

                pbar.update(len(remove_indices))

                # Yield completed sequences
                for idx in remove_indices:
                    prompt_duration = time.time() – batch_start_time
                    prompt_tokens = len(self.batch_manager.tokenizer.encode(remember_this_batch[“responses”][idx]))
                    prompt_latency = prompt_duration / prompt_tokens
                    latencies.append(prompt_latency)

                    yield (
                        remember_this_batch[“responses”][idx],
                        prompt_tokens,
                        prompt_latency,
                    )

                    pbar.set_postfix({“latency”: f”{prompt_latency:.4f}, “tokens”: f”{prompt_tokens}})
                    pbar.update(1)

                remember_this_batch = cached_batch

            duration_s = time.time() – t0
            print(“duration”, duration_s)
            print(latencies)

3.5 Continuous Batching Logic

The ContinuousBatcher class ties everything together, implementing the core logic for continuous batching. It manages the flow of prompts through the system, from initial batching to final output generation.

class ContinuousBatcher:
    def __init__(self, batch_manager, token_generator):
        self.batch_manager = batch_manager
        self.token_generator = token_generator

    def run(self, request_queue, batch_size):
        t0 = time.time()
        latencies = []

        with tqdm(total=len(request_queue), desc=f”bs= {batch_size}) as pbar:
            # Initialize the first batch
            batch = self.batch_manager.init_batch(request_queue[:batch_size])
            batch_start_time = time.time()
            cached_batch = self.token_generator.get_next_inputs(
                batch, *self.token_generator.generate_batch_tokens_with_past(batch)
            )
            request_queue = request_queue[batch_size:]

            while len(request_queue) > 0 or cached_batch[“input_ids”].size(0) > 0:
                # Calculate available capacity in the current batch
                batch_capacity = batch_size – cached_batch[“input_ids”].size(0)

                # Add new requests to the batch if there’s capacity
                if batch_capacity > 0 and len(request_queue) > 0:
                    new_batch = self.batch_manager.init_batch(request_queue[:batch_capacity])
                    new_batch = self.token_generator.get_next_inputs(
                        new_batch,
                        *self.token_generator.generate_batch_tokens_with_past(new_batch),
                    )
                    request_queue = request_queue[batch_capacity:]
                    cached_batch = self.batch_manager.merge_batches(cached_batch, new_batch)

 # Generate next tokens for the current batch
                cached_batch = self.token_generator.get_next_inputs(
                    cached_batch,
                    *self.token_generator.generate_batch_tokens_with_past(cached_batch),
                )

                # Filter completed sequences
                remember_this_batch = cached_batch
                cached_batch, remove_indices = self.batch_manager.filter_batch(cached_batch)

                pbar.update(len(remove_indices))

                # Yield completed sequences
                for idx in remove_indices:
                    prompt_duration = time.time() – batch_start_time
                    prompt_tokens = len(self.batch_manager.tokenizer.encode(remember_this_batch[“responses”][idx]))
                    prompt_latency = prompt_duration / prompt_tokens
                    latencies.append(prompt_latency)

                    yield (
                        remember_this_batch[“responses”][idx],
                        prompt_tokens,
                        prompt_latency,
                    )

                    pbar.set_postfix({“latency”: f”{prompt_latency:.4f}, “tokens”: f”{prompt_tokens}})
                    pbar.update(1)

                remember_this_batch = cached_batch

            duration_s = time.time() – t0
            print(“duration”, duration_s)
            print(latencies)

4. Performance Optimization

To optimize performance, we implemented several techniques:

  1. Efficient batch merging and filtering to maintain optimal batch sizes.
  2. Dynamic truncation of attention masks to reduce unnecessary computations.
  3. Caching of past key values to speed up subsequent token generations.

Use of PyTorch’s no_grad() context to disable gradient calculation during inference.

5. User Interface

We created a Streamlit app (streamlit_app.py) to provide an interactive interface for users to input prompts and view generated responses. This app allows users to:

  • Enter multiple prompts
  • Set the maximum number of tokens to generate
  • View generated responses in real-time

import streamlit as st
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), “..”)))

from config.config import config
from model.model_loader import ModelLoader
from model.token_generator import TokenGenerator
from batching.batch_manager import BatchManager
from batching.continuous_batching import ContinuousBatcher
from utils.tokenizer_utils import TokenizerUtils

# Initialize session state
if “prompts” not in st.session_state:
    st.session_state.prompts = []

@st.cache_resource
def load_model_and_tokenizer():
    model_loader = ModelLoader(config.model.model_name)
    model, tokenizer = model_loader.load_model()
    return model, tokenizer

model, tokenizer = load_model_and_tokenizer()

# Initialize utility classes
token_generator = TokenGenerator(model, tokenizer, config.device.device)
batch_manager = BatchManager(tokenizer, config.device.device)
continuous_batcher = ContinuousBatcher(batch_manager, token_generator)
tokenizer_utils = TokenizerUtils(tokenizer)

st.title(“Continuous Batching Text Generation”)

# Input area for new prompts
new_prompt = st.text_input(“Enter a prompt:”)
max_tokens = st.number_input(“Max tokens to generate:”, min_value=1, value=config.generation.max_new_tokens)

if st.button(“Add Prompt”):
    if new_prompt:
        st.session_state.prompts.append((new_prompt, max_tokens))
        st.success(f”Prompt added: {new_prompt})

# Display current prompts
st.write(“Current Prompts:”)
for i, (prompt, tokens) in enumerate(st.session_state.prompts):
    st.write(f”{i+1}. {prompt} (Max Tokens: {tokens})”)

# Generate button
if st.button(“Generate”):
    if st.session_state.prompts:
        # Create placeholders for each prompt
        placeholders = [st.empty() for _ in st.session_state.prompts]

        # Prepare prompts
        formatted_prompts = [
            (tokenizer_utils.encode_prompt(prompt, config.system_prompt), tokens)
            for prompt, tokens in st.session_state.prompts
        ]

        # Run continuous batching
        for i, (response, tokens, latency) in enumerate(
            continuous_batcher.run(formatted_prompts, config.batching.batch_size)
        ):
            answer = tokenizer_utils.decode_response(response)

placeholders[i].write(
                f”Question: {st.session_state.prompts[i][0]}\n\nAnswer: {answer}\nTokens: {tokens}\nLatency: {latency:.4f}
            )

        # Clear prompts after generation
        st.session_state.prompts = []
    else:
        st.warning(“No prompts to generate. Please add some prompts first.”)

# Clear all prompts button
if st.button(“Clear All Prompts”):
    st.session_state.prompts = []
    st.success(“All prompts cleared.”)

6. Challenges and Solutions

During the implementation of our continuous batching system, we encountered several challenges:

1. Varying Sequence Lengths:
One of the main challenges was efficiently managing prompts and responses of different lengths within the same batch.
Solution: We implemented dynamic batch filtering and merging in the BatchManager class. This allows us to remove completed sequences and add new ones on the fly, maintaining optimal batch sizes throughout the generation process.

2. Memory Optimization:
With long sequences, memory usage can quickly become a bottleneck, especially on GPUs with limited VRAM.
Solution: We implemented attention mask truncation and past key value optimization in the BatchManager.filter_batch method. This reduces the memory footprint by removing unnecessary padding and history for completed sequences.

3. Performance Tuning:
Balancing between batch size, generation speed, and memory usage required careful tuning.
Solution: We made batch size and other generation parameters configurable, allowing users to adjust these based on their specific hardware and requirements.

4. State Management:
Keeping track of the state for each sequence in the batch (e.g., tokens generated, tokens remaining) was crucial for correct operation.
Solution: We maintained state information within the batch dictionary, including tokens_remaining and accumulate_ids, allowing us to track progress for each sequence individually.

7. Conclusion and Future Work

Our implementation of continuous batching demonstrates an efficient approach to processing multiple text generation tasks in parallel. By leveraging PyTorch and the Transformers library, we’ve created a flexible system that can handle varying input lengths and generation requirements.

Key advantages of this implementation include:

  • Efficient utilization of GPU resources
  • Ability to handle prompts of varying lengths in the same batch
  • Real-time generation and display of results
  • Configurable parameters for different use cases and hardware steps

For future improvements, we could consider:

  1. Advanced Scheduling Algorithms: Implement more sophisticated scheduling algorithms to optimize the order of prompt processing based on expected generation time or priority.
  2. Multi-GPU Parallelism: Extend the system to utilize multiple GPUs, potentially allowing for even larger batch sizes and faster processing.
  3. Adaptive Batch Sizing: Implement dynamic adjustment of batch sizes based on real-time performance metrics and available resources.
  4. Streaming Output: Modify the system to stream tokens as they’re generated, rather than waiting for complete sequences. This could provide a more responsive user experience for long-form generation tasks.
  5. Integration with Different Models: Extend support for a wider range of language models and architectures, making the system more versatile.

By implementing continuous batching from scratch, we’ve gained deep insights into the process and created a flexible system that can be adapted to various text generation tasks. This approach allows for efficient utilization of GPU resources and can significantly improve throughput in production environments.

8. Code Snippets

Here’s the simple code implementation of Synchronous and Continuous Batching along with their outputs and loggings:

# This code is not executable because it is dependent on other helper functions.
queue_size = 32
batch_size = 8

request_queue = [
    (“Explain the concept of supply and demand”, 255),

 (“Write a short story about a time traveler”, 39),
    (“Explain how a computer processor works”, 26),
    (“Write a brief biography of Marie Curie”, 83),
    (“Describe the process of photosynthesis”, 72),
    (“Explain the rules of chess”, 98),
    (“Write a recipe for chocolate chip cookies”, 30),
    (“Describe the water cycle”, 37),
    (“Explain the theory of relativity”, 87),
    (“Write a short essay on climate change”, 62),
    (“Describe the structure of a cell”, 24),
    (“List the main events of World War II”, 75),
    (“Compose a haiku about autumn”, 17),
    (“Summarize the plot of Romeo and Juliet”, 291),
    (“Describe the process of making coffee”, 46),
    (“Write a short story about a time traveler”, 68),
    (“Explain the concept of supply and demand”, 13),
    (“Describe the water cycle”, 95),
    (“Explain how a computer processor works”, 58),
    (“Write a brief biography of Marie Curie”, 42),
    (“Compose a haiku about autumn”, 81),
    (“Explain the rules of chess”, 33),
    (“Write a recipe for chocolate chip cookies”, 77),
    (“Describe the structure of a cell”, 50),
    (“Summarize the plot of Romeo and Juliet”, 222),
    (“List the main events of World War II”, 66),
    (“Describe the process of photosynthesis”, 89),
    (“Write a short essay on climate change”, 35),
    (“Explain the theory of relativity”, 70),
    (“Describe the process of making coffee”, 19),
    (“Write a short story about a time traveler”, 93),
    (“Explain the concept of supply and demand”, 448),
] * 1
batches = [
    request_queue[i:i+batch_size]
    for i in range(0,len(request_queue),batch_size)
]

# Synchronous Batching

# generate tokens for all batches and record duration

average_latencies = []
t0 = time.time()
with tqdm(total=len(batches), desc=f”bs={batch_size}”) as pbar:

    for i, batch in enumerate(batches):
        t1 = time.time()
        batch_max_tokens = [b[1] for b in batch]
        max_tokens = max(batch_max_tokens)

        pbar.set_postfix({‘max_tokens’: max_tokens})

        batch_prompts = [p[0] for p in prompts ]

        inputs = tokenizer(
            batch_prompts, return_tensors=’pt’,padding=True,
        )
        generate_batch(inputs, max_tokens=max_tokens)

        duration = time.time() – t1

        batch_latency = duration/len(batch)
        print(f”Batch size: {len(batch)}, Maximum Token in this batch: {max_tokens}, Latency for this batch: {batch_latency}”)
        average_latencies.append(batch_latency)

        pbar.set_postfix({‘max_tokens’:max_tokens, ‘latency’:f”{batch_latency:.4f}”})
        pbar.update(1)

duration_s = time.time() – t0
print(“Duration: “, duration_s)

## Output:

bs=8:  12%|█▎        | 1/8 [00:11<01:19, 11.36s/it, max_tokens=291]               
Batch size: 8, Maximum Token in this batch: 255, Latency for this batch: 1.4196937084197998
bs=8:  25%|██▌       | 2/8 [00:22<01:07, 11.22s/it, max_tokens=95]                
Batch size: 8, Maximum Token in this batch: 291, Latency for this batch: 1.390859216451645
bs=8:  38%|███▊      | 3/8 [00:25<00:37,  7.59s/it, max_tokens=448]              
Batch size: 8, Maximum Token in this batch: 95, Latency for this batch: 0.4074409008026123
bs=8:  50%|█████     | 4/8 [00:45<00:49, 12.33s/it, max_tokens=255]               
Batch size: 8, Maximum Token in this batch: 448, Latency for this batch:
2.4483776092529297
bs=8:  62%|██████▎   | 5/8 [00:54<00:34, 11.36s/it, max_tokens=291]               
Batch size: 8, Maximum Token in this batch: 255, Latency for this batch: 1.2074491679668427
bs=8:  75%|███████▌  | 6/8 [01:06<00:22, 11.30s/it, max_tokens=95]                
Batch size: 8, Maximum Token in this batch: 291, Latency for this batch: 1.3954536616802216
bs=8:  88%|████████▊ | 7/8 [01:09<00:08,  8.69s/it, max_tokens=448]              
Batch size: 8, Maximum Token in this batch: 95, Latency for this batch: 0.41396722197532654
bs=8: 100%|██████████| 8/8 [01:28<00:00, 11.03s/it, max_tokens=448, latency=2.3426]
Batch size: 8, Maximum Token in this batch: 448, Latency for this batch: 2.342629700899124
Duration:  88.22081732749939

# Continuous Batching

t0 = time.time()

latencies = []
total_token_generated = 0
with tqdm(total=len(request_queue), desc=f”bs = {batch_size}”) as pbar:

    batch = init_batch(request_queue[:batch_size]) # prefill the batches

    batch_start_time = time.time() # generation suru huna agadi ko pal

    cached_batch = generate_next_token(batch)  # generates next token for each sequence in batch

    batch_tokens = cached_batch[‘input_ids’].size(0) # generate bhaisake paxi, kati ota tokens generate garyo as a whole in a batch

    request_queue = request_queue[batch_size:]

    # continue until both the request queue is fully drained. or
    while(len(request_queue) > 0 or cached_batch[‘input_ids’].size(0) > 0):
batch_capacity = batch_size – cached_batch[‘input_ids’].size(0)

        if batch_capacity > 0 and len(request_queue) > 0:

            new_batch = init_batch(request_queue[:batch_capacity])

            new_batch = generate_next_token(new_batch)

            # edi naya batch banyo bhane pani, tesma generate bhayeko naya tokens haru lai pani ta consider garnu parxa. obviously, latency farak parihalxa ni
            batch_tokens += new_batch[“input_ids”].size(0)
            request_queue = request_queue[batch_capacity:]

            # merge
            cached_batch = merge_batches(cached_batch, new_batch)

        cached_batch = generate_next_token(cached_batch)

        batch_tokens += cached_batch[‘input_ids’].size(0)

        remember_this_batch = cached_batch
        # remove any inputs that have finished generation
        cached_batch, remove_indices = filter_batch(cached_batch)

        pbar.update(len(remove_indices))

        for idx in remove_indices:
            prompt_duration = time.time() – batch_start_time
            prompt_tokens = len(remember_this_batch[‘responses’][idx].split())
            prompt_latency = prompt_duration / prompt_tokens
            latencies.append(prompt_latency)

            print(f”\nCompleted prompt {idx}:”)
            print(f”Output: {remember_this_batch[‘responses’][idx]}”)
            print(f”Tokens: {prompt_tokens}”)
            print(f”Latency: {prompt_latency:.4f}”)

            pbar.set_postfix({‘latency’: f'{prompt_latency:.4f}’, ‘tokens’:
f'{prompt_tokens}’})
            pbar.update(1)

        # to stay updated on the filtered part.
        remember_this_batch = cached_batch

duration_s = time.time() – t0
print(“duration”, duration_s)
print(latencies)

## Output:    

bs = 8:   0%|          | 0/32 [00:00<?, ?it/s]
bs = 8:   6%|▋         | 2/32 [00:01<00:03,  8.48it/s, latency=0.0666, tokens=24]

Completed prompt 2:
Output: Explain how a computer processor works.

How to use the computer to perform a task.

How to use the computer to perform a task.

Tokens: 24
Latency: 0.0666
bs = 8:  12%|█▎        | 4/32 [00:02<00:03,  7.15it/s, latency=0.0655, tokens=29]

Completed prompt 5:
Output: Write a recipe for chocolate chip cookies.

I’ve been making these for a while now, and I’ve been making them for a while now, and I’ve been making them
Tokens: 29
Latency: 0.0655
bs = 8:  22%|██▏       | 7/32 [00:02<00:04,  5.12it/s, latency=0.0621, tokens=42]

Completed prompt 5:
Output: Describe the water cycle of the world.
The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle
Tokens: 32
Latency: 0.0754

Completed prompt 1:
Output: Write a short story about a time traveler who is stranded on a planet where the only way to get back to Earth is to travel back in time.

The story is set in the year 2099, and is set in the
Tokens: 42
Latency: 0.0621
bs = 8:  31%|███▏      | 10/32 [00:04<00:08,  2.63it/s, latency=0.1791, tokens=23]

Completed prompt 6:
Output: Describe the structure of a cell.

The cell is a structure that is composed of two parts: a nucleus and a nucleus-containing cell.
Tokens: 23
Latency: 0.1791
bs = 8:  38%|███▊      | 12/32 [00:05<00:07,  2.54it/s, latency=0.0868, tokens=58]

Completed prompt 2:
Output: Describe the process of photosynthesis in the laboratory.

The photosynthetic process is the process of photosynthesis that occurs when the photosynthetic material is exposed to sunlight. The photosynthetic material is a mixture of carbon dioxide and hydrogen, which is the primary fuel for the photosynthetic process. The photosynthetic material is then heated to a temperature of about 1,000
Tokens: 58
Latency: 0.0868
bs = 8:  44%|████▍     | 14/32 [00:05<00:05,  3.47it/s, latency=0.3183, tokens=17]

Completed prompt 6:
Output: Compose a haiku about autumn and winter.
The first thing you’ll notice is that the haiku is
Tokens: 17
Latency: 0.3183
bs = 8:  50%|█████     | 16/32 [00:06<00:04,  3.56it/s, latency=0.0898, tokens=67]

Completed prompt 1:
Output: Write a brief biography of Marie Curie.

The book is a collection of essays, essays, and interviews with Marie Curie, the author of “The Great American Novel.”

The book is available in paperback and Kindle.

The book is available in print and digital formats.

The book is available in print and digital formats.

The book is available in print and digital formats.

The book is available in
Tokens: 67
Latency: 0.0898
bs = 8:  56%|█████▋    | 18/32 [00:06<00:04,  3.39it/s, latency=0.1195, tokens=56]

Completed prompt 3:
Output: Write a short essay on climate change, and then write a short essay on climate change.

The first step is to write a short essay on climate change.

The second step is to write a short essay on climate change.

The third step is to write a short essay on climate change.

The fourth step is
Tokens: 56
Latency: 0.1195
bs = 8:  62%|██████▎   | 20/32 [00:07<00:03,  3.45it/s, latency=0.0881, tokens=83]
Completed prompt 1:
Output: Explain the rules of chess.

The rules of chess are not the rules of chess. They are the rules of chess.

The rules of chess are not the rules of chess. They are the rules of chess.

The rules of chess are not the rules of chess. They are the rules of chess.

The rules of chess are not the rules of chess. They are the rules of chess.

The rules of chess are not the rules of chess. They are the rules
Tokens: 83
Latency: 0.0881
bs = 8:  69%|██████▉   | 22/32 [00:07<00:02,  3.92it/s, latency=0.4585, tokens=17]

Completed prompt 6:
Output: Explain the concept of supply and demand.

The first thing to understand is that the supply and
Tokens: 17
Latency: 0.4585
bs = 8:  78%|███████▊  | 25/32 [00:08<00:01,  4.48it/s, latency=0.1238, tokens=70]

Completed prompt 1:
Output: Explain the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the
Tokens: 77
Latency: 0.1106

Completed prompt 1:
Output: List the main events of World War II.

The first event was the Battle of the Bulge, which was fought on the day of the Battle of the Bulge. The Battle of the Bulge was the first major battle in the history of the United States. The Battle of the Bulge was the first major battle in the history of the United States.

The second event was the Battle of
Tokens: 70
Latency: 0.1238
bs = 8:  88%|████████▊ | 28/32 [00:09<00:01,  3.80it/s, latency=0.2046, tokens=46]

Completed prompt 2:
Output: Describe the process of making coffee.

The process of making coffee is a very simple one. You start with a cup of coffee and then you add a few drops of water. The water is a mixture of water and sugar. The sugar is a mixture
Tokens: 46
Latency: 0.2046
bs = 8:  94%|█████████▍| 30/32 [00:12<00:01,  1.43it/s, latency=0.1985, tokens=61]

Completed prompt 2:
Output: Write a short story about a time traveler who is stranded on a planet where the only way to get back to Earth is to travel back in time.

The story is set in the year 2099, and is set in the year 2099.

The story is set in the year 2099, and is set in the year 2099.

The story
Tokens: 61
Latency: 0.1985
bs = 8: 34it [00:12,  3.37it/s, latency=0.4027, tokens=31]                       

Completed prompt 4:
Output: Write a brief biography of Marie Curie.

The book is a collection of essays, essays, and interviews with Marie Curie, the author of “The Great American Novel.”

The book is available in paperback and Kindle.

Tokens: 35
Latency: 0.3566
Completed prompt 6:
Output: Explain the rules of chess.

The rules of chess are not the rules of chess. They are the rules of chess.

The rules of chess are not the rules of chess
Tokens: 31
Latency: 0.4027
bs = 8: 35it [00:13,  2.83it/s, latency=0.2699, tokens=49]

Completed prompt 3:
Output: Explain how a computer processor works.

How to use the computer to perform a task.

How to use the computer to perform a task.

How to use the computer to perform a task.

How to use the computer to perform a task.

How to use the computer to perform
Tokens: 49
Latency: 0.2699
bs = 8: 39it [00:16,  1.56it/s, latency=0.2552, tokens=66]

Completed prompt 2:
Output: Describe the water cycle of the world.

The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle of the world is a cycle of water.

The water cycle of the
Tokens: 78
Latency: 0.2137

Completed prompt 2:
Output: Compose a haiku about autumn and winter.

The first thing you’ll notice is that the haiku is a bit of a mess. It’s not a haiku about the weather, but about the way things are. It’s not a haiku about the weather, but about the way things are. It’s not a haiku about the weather, but about the way things are. It’s not a haiku about
Tokens: 66
Latency: 0.2552
bs = 8: 42it [00:18,  1.70it/s, latency=0.4268, tokens=43]

Completed prompt 3:
Output: Describe the structure of a cell.

The cell is a structure that is composed of two parts: a nucleus and a nucleus-containing cell. The nucleus is a structure that is composed of two parts: a nucleus and a nucleus-containing cell.

The nucleus is
Tokens: 43
Latency: 0.4268
bs = 8: 43it [00:20,  1.10s/it, latency=0.3333, tokens=63]

Completed prompt 4:
Output: List the main events of World War II.

The first event was the Battle of the Bulge, which was fought on the day of the Battle of the Bulge. The Battle of the Bulge was the first major battle in the history of the United States. The Battle of the Bulge was the first major battle in the history of the United States.
Tokens: 63
Latency: 0.3333
bs = 8: 45it [00:21,  1.51it/s, latency=0.3119, tokens=68]

Completed prompt 2:
Output: Write a recipe for chocolate chip cookies.

I’ve been making these for a while now, and I’ve been making them for a while now, and I’ve been making them for a while now, and I’ve been making them for a while now, and I’ve been making them for a while now, and I’ve been making them for a while now, and I’ve been making them for a while
Tokens: 68
Latency: 0.3119
bs = 8: 47it [00:21,  2.21it/s, latency=0.6126, tokens=35]

Completed prompt 4:
Output: Write a short essay on climate change, and then write a short essay on climate change.

The first step is to write a short essay on climate change.

The second step is to write a
Tokens: 35
Latency: 0.6126
bs = 8: 50it [00:23,  1.77it/s, latency=1.1144, tokens=21]

Completed prompt 5:
Output: Describe the process of making coffee.

The process of making coffee is a very simple one. You start with a cup
Tokens: 21
Latency: 1.1144
bs = 8: 52it [00:24,  1.71it/s, latency=0.1230, tokens=201]

Completed prompt 0:
Output: Explain the concept of supply and demand.

The first thing to understand is that the supply and demand are not the same thing. The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.
The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the
Tokens: 201
Latency: 0.1230
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings…
bs = 8: 54it [00:26,  1.75it/s, latency=0.4133, tokens=63]

Completed prompt 3:
Output: Explain the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory of relativity, and the theory
Tokens: 63
Latency: 0.4133
bs = 8: 56it [00:26,  2.27it/s, latency=0.3875, tokens=69]

Completed prompt 2:
Output: Describe the process of photosynthesis in the laboratory.

The photosynthetic process is the process of photosynthesis that occurs when the photosynthetic material is exposed to sunlight. The photosynthetic material is a mixture of carbon dioxide and hydrogen, which is the primary fuel for the photosynthetic process. The photosynthetic material is then heated to a temperature of about 1,000 degrees Fahrenheit (1,200 degrees Celsius) and then the photosynthetic material is
Tokens: 69
Latency: 0.3875
bs = 8: 58it [00:29,  1.22it/s, latency=0.3699, tokens=80]

Completed prompt 2:
Output: Write a short story about a time traveler who is stranded on a planet where the only way to get back to Earth is to travel back in time.
The story is set in the year 2099, and is set in the year 2099.

The story is set in the year 2099, and is set in the year 2099.

The story is set in the year 2099, and is set in the year 2099.

The story is set in the
Tokens: 80
Latency: 0.3699
bs = 8: 60it [00:32,  1.02s/it, latency=0.1095, tokens=297]

Completed prompt 0:
Output: Summarize the plot of Romeo and Juliet, and the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the
Tokens: 297
Latency: 0.1095
bs = 8: 62it [00:33,  1.37it/s, latency=0.1470, tokens=228]

Completed prompt 0:
Output: Summarize the plot of Romeo and Juliet, and the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the story of the
Tokens: 228
Latency: 0.1470
bs = 8: 64it [00:52,  1.22it/s, latency=0.1513, tokens=346]

Completed prompt 0:
Output: Explain the concept of supply and demand.

The first thing to understand is that the supply and demand are not the same thing. The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

The supply and demand are not the same thing.

Tokens: 346
Latency: 0.1513
duration 52.36468815803528
[0.06657714645067851, 0.06551452340750859, 0.0754406750202179, 0.06205427646636963, 0.17905473709106445, 0.086844304512287, 0.3183482394498937, 0.08977151272901848, 0.11949176447732109, 0.08813720151602504, 0.4585063176996568, 0.11057375623034192, 0.12381176948547364, 0.20455515902975333, 0.19853858478733752, 0.356649569102696, 0.4026893723395563, 0.2699056012289865, 0.21369612216949463, 0.25516481471784186, 0.42684131444886675, 0.3332569674840049, 0.311912196524003, 0.6126059600285121, 1.1143998645600819, 0.12299618554945609, 0.4132767745426723, 0.3874759985053021, 0.36986699402332307, 0.10945840636487762, 0.14702578594810084, 0.15132888824264437]