lastmileAI logo

In this post

AI proxy middlewares are a hack

Written By
Sarmad Qadri
Published on

The current architecture for monitoring, model routing and prompt management is not enough.

Now that I have your attention with the headline, let me take a moment to explain what’s going on. We are starting to see a new concept emerge in the LLM Ops workflow — AI proxy middleware.

AI proxies are services that stand between your application, and the model inference provider, like OpenAI, Hugging Face, etc. They are responsible for consolidating some important and necessary steps in the generative AI developer workflow:

  • Call different models (LLaMA, GPT*, Mixtral) with a single API — usually reusing the openai.chat.completions API with a custom baseUrl.

  • Monitor usage, latency, cost, etc.

  • Caching & throttling of inference requests

  • API key management for inference providers

We believe this is, unfortunately, the wrong architecture for solving the right problems. Instead, we propose — and demonstrate — that the capabilities these middlewares provide can be handled more gracefully by frameworks and protocols that avoid middleware altogether.

The problem with AI proxy middlewares

The motivating problems that AI proxies are trying to solve are valid and important:

  1. Separation of concerns: Decouple model-specific logic from application code.
    It is important to enable applications to invoke different models with the same consistent API surface, removing the need to wrangle model-specific APIs. Developers can invoke different models in the same application flow while using the same consistent API (e.g. using GPT-4 to handle a complicated prompt, and LLaMA or Mixtral for a simpler one).

  2. Monitoring generative AI usage, latency, and cost for your application.

  3. Caching inference requests using semantic caching, and managing throttling.

  4. Managing API keys for different inference providers

Putting all of this in a proxy middleware introduces a new and avoidable set of problems:

  • Monolithic design — monitoring, observability and caching are already well-known concepts in the existing software development workflow, with dedicated systems for each of them. They don’t need to be subsumed by a proxy service.

  • Security risk: Because of this unnecessary service layer between the application and the model provider, we now have an additional security risk, and need to solve for it by encrypting requests/user-specific data.

  • Latency and lack of debuggability: The proxy causes 2 hops to the LLM provider, which can lead to performance degradations. Additionally, if something goes wrong, there’s limited visibility for debugging.

  • Local vs. remote: This service layer doesn’t work for local models, which is something that is becoming increasingly important as models get smaller and more efficient.

Most importantly, many of these proxy middlewares are closed managed services provided by third party providers. This becomes a critical external dependency with no failover strategy.

A better alternative — Frameworks over Proxies

We believe the AI proxy layer can be replaced by an open source AI framework and storage format that can still provide a uniform API, and connect to the relevant services to handle monitoring, caching and key management separately.

To that effect, we have created AIConfig — a config-driven framework that manages prompts, models and inference settings as JSON-serializable configs.

These configs are the generative AI artifact for an application, and can be version controlled, evaluated, monitored, and edited in a notebook-like playground. In other words, they embed directly into the developer workflow as it is today.

There are other AI frameworks out there, but the 2 key observations of AIConfig are:

  1. prompts, models and inference settings should be saved as config, not code

  2. a common storage format that is model-agnostic and multi-modal allows switching between different models straightforwardly

Comparison with proxy middleware

Let’s revisit the things that the proxy middleware enables and do the same thing without a man-in-the-middle. More specifically, let’s do it in a way that doesn’t require a monolithic architecture.

Breaking down a monolithic service into its constituent parts allows you to use existing service providers for inference, monitoring, caching, KMS, etc.

✅ Separation of concerns — Decoupling model-specific logic from application code.

By design, AIConfig allows you to store and iterate on prompts separately from application code. It provides a uniform API surface across any model and for any modality.

For example, this aiconfig creates a trip planner app using Gemini and GPT-4:

{
  "name": "NYC Trip Planner",
  "description": "Intrepid explorer with ChatGPT and AIConfig",
  "schema_version": "latest",
  "metadata": {
    "models": {
      "gemini-pro": {
        "model": "gemini-pro"
      },
      "gpt-4": {
        "model": "gpt-4",
        "max_tokens": 3000,
        "system_prompt": "You are an expert travel coordinator with exquisite taste."
      }
    },
    "default_model": "gemini-pro"
  },
  "prompts": [
    {
      "name": "get_activities",
      "input": "Tell me 10 fun attractions to do in NYC.",
      "metadata": {
        "model": "gemini-pro"
      }
    },
    {
      "name": "gen_itinerary",
      "input": "Generate an itinerary ordered by {{order_by}} for these activities: {{get_activities.output}}.",
      "metadata": {
        "model": "gpt-4",
        "parameters": {
          "order_by": "geographic location"
        }
      }
    }
  ]
}

You can invoke either model with the same API:

import asyncio
from aiconfig import AIConfigRuntime, InferenceOptions

async def main():
  # Load the aiconfig
  config = AIConfigRuntime.load('travel.aiconfig.json')

  # Run a Google Gemini prompt (with streaming)
  inference_options = InferenceOptions(stream=True)
  await config.run("get_activities", options=inference_options)

  # Run a GPT-4 prompt (same API!)
  await config.run("gen_itinerary", options=inference_options)

asyncio.run(main())

✅ Monitoring

The framework provides callback handlers to register usage tracking.

The main philosophy is that monitoring for generative AI isn’t particularly different from monitoring other services. Monitoring for generative AI should simply hook into wherever you do monitoring for the rest of your application (datadog, cloudwatch, prometheus, etc.)

from aiconfig import CallbackManager, CallbackEvent
import pprint

async def custom_callback(event: CallbackEvent) -> None:
  """
    This is a custom callback that prints the event to stdout.

    Args:
        event (CallbackEvent): The event that triggered the callback.
  """
  print(f"Event triggered: {event.name}")
  pprint.pprint(event, width = 150)

callback_manager = CallbackManager([custom_callback])
config.set_callback_manager(callback_manager)

✅ Caching

There are excellent solutions out there already, like GPTCache, for semantic caching. Having a framework instead of a proxy allows for a straightforward integration with the best tools for the best job.

✅ API Key Management

We don’t think API key management is a new problem — there are already good KMS services for this, and the same ones can be reused for managing inference endpoint keys.

Framework Extensions

In addition to the above, there are other parts of the generative AI workflow that are critical for building production applications. Frameworks enable using existing tools to reason over them:

Evaluation

Having a dedicated config artifact allows developers to define evals for it, and trigger those eval runs as part of CI/CD every time the config changes.

You can read more about generative AI evaluation here.

Local Experimentation

A framework can collapse experimentation and productionization of generative AI into a single workflow. For example, an aiconfig can be opened in a notebook-like playground for visual editing and rapid prototyping, and also used in application code.

Governance and version control

As a version controlled artifact, aiconfig can be used for reproducibility and provenance of the generative AI bits of your application. Learn more here: https://github.com/lastmile-ai/aiconfig.

So why are there so many proxy middlewares?

As stated above, the proxy middlewares are an attempt to solve legitimate problems with the current generative AI workflow. They are also a convenient way to inject into the critical path for generative AI development. In a competitive market, it is a way to add a sticky dependency.

However, if we were just evaluating the design from the perspective of the developer, then the capabilities these proxies provide would be broken down into extensible frameworks that provide hooks to external services.

Protocols over Frameworks

The underlying point in all of this is that we need a standard interaction model that defines how applications work across model providers, how we evaluate generative AI components, how we monitor, etc. (And we haven’t even touched agentic interactions in this blog).

Standardizing this interaction model as a protocol will eventually make the overall developer workflow more streamlined, and lead to a more open developer ecosystem for generative AI. Our early thinking on this is inspired by SMTP for email, and LSP (Language Server Protocol) for IDEs.

There is a need for a Generative AI Protocol that defines:

  • Standardized storage format for prompts, models and inference settings

  • Standardized API for running inference, including routing

  • Evaluating generative AI components

  • Monitoring and callbacks

  • Caching

  • Experimentation and UX

AIConfig is a step in this direction, and we’d love to collaborate with the community to take it forward.


More Posts