Text Streaming: 10x Speed Boost

Nov 5, 2024

Matthew Fedoseev

How We Achieved a 10x Speedup in Text Streaming with Simple Optimization

In our fast-paced world of AI-driven applications, ensuring high performance and scalability is essential. Here's how we tackled and resolved a significant performance bottleneck in our text streaming pipeline.

While working on a recent AI project that involved streaming text from endpoints, we noticed a significant bottleneck. While loading small chunks of text seemed almost instant, larger files (>1MB) took nearly 25 seconds to stream. This was surprising, considering the data was coming from Redis deployed on premise within the same network.

After some investigation, we identified the bottleneck:

from starlette.responses import StreamingResponse

async def foo() -> StreamingResponse:
    # Some logic

    return StreamingResponse(iter(text_file), media_type="text/plain")

At first glance, this implementation might seem fine. Indeed, it works well when streaming tokens from an LLM or binary file objects. However, when handling raw strings, the iter() function processes strings character by character, resulting in:

- Very small chunks (1 character per iteration)

- Excessive I/O operations

- Significant TCP overhead

These inefficiencies caused streaming delays for large files.

The Solution: Chunking the Data

To address this issue, we introduced a simple generator to chunk the strings into smaller pieces before streaming:

from starlette.responses import StreamingResponse

async def foo() -> StreamingResponse:
    # Some logic

    async def chunk_generator(content: str, chunk_size: int = 8192):
        for i in range(0, len(content), chunk_size):
            yield content[i:i + chunk_size]

    return StreamingResponse(chunk_generator(text_file), media_type="text/plain")

This small change reduced the streaming time for a 1.2 MB text file from 25 seconds to about 2.3 - a 10x speed improvement.

Why Chunking Works

Chunking improves performance by:

- Reducing I/O Overhead: Each chunk is sent as soon as it’s ready, avoiding per-character delays.

- Balancing Memory and Speed: The chunk size (8192 characters) aligns with typical I/O buffer sizes (4 KB - 64 KB).

- Scalability: This approach scales seamlessly for both small and large files.

For our use case, we chose 8192 characters per chunk as our default, but this parameter can be tuned based on your workload. Keep in mind encoding: if you’re working with multi-byte characters, you may need to adjust the chunk size.

This optimization was crucial for our AI project, enabling our text processing pipelines to efficiently handle both small and large files. By identifying and resolving the bottleneck, we not only achieved a significant performance boost but also future-proofed the system for growing demands. As AI applications continue to evolve, small yet impactful optimizations like these will be key to maintaining performance and meeting new challenges.

Empowering Your Business with AI. Reach Out Today

Contact

Contact

INNOVATE. IMPACT. INSPIRE.

Privacy

Privacy

Terms

Terms

2024 Y Innovation. All Rights Reserved

Empowering Your Business with AI. Reach Out Today

Contact

Contact

INNOVATE. IMPACT. INSPIRE.

Privacy

Privacy

Terms

Terms

2024 Y Innovation. All Rights Reserved

Empowering Your Business with AI. Reach Out Today

Contact

Contact

INNOVATE. IMPACT. INSPIRE.

Privacy

Privacy

Terms

Terms

2024 Y Innovation. All Rights Reserved

Empowering Your Business with AI. Reach Out Today

Contact

Contact

INNOVATE. IMPACT. INSPIRE.

Privacy

Privacy

Terms

Terms

2024 Y Innovation. All Rights Reserved