Advanced Architecture for AI Application (AKA AAAA!)

Surprise! This is a bonus blog post for the AI for Web Devs series I recently wrapped up. If you haven’t read that series yet, I’d encourage you to check it out.

This post will look at the existing project architecture and ways we can improve it for both application developers and the end user.

I’ll be discussing some general concepts, and using specific Akamai products in my examples.

Basic Application Architecture

The existing application is pretty basic. A user submits two opponents, then the application streams back an AI-generated response of who would win in a fight.

The architecture is also simple:

  1. The client sends a request to a server.

  2. The server constructs a prompt and forwards the prompt to OpenAI.

  3. OpenAI returns a streaming response to the server.

  4. The server makes any necessary adjustments and forwards the streaming response to the client.

I used Akamai’s cloud compute services (formerly Linode) but this would be the same for any hosting service, really.

Architecture diagram showing a client connecting to a server inside the Cloud, which forwards the request to OpenAI, then returns to the server and back to the client.

🤵 looks like a server at a fancy restaurant, and 👁️‍🗨️ is “a eye”, or AI. lolz

Technically this works fine, but there are a couple of problems, particularly when users make duplicate requests. It could be faster and more cost-effective to store responses on our server and only go to OpenAI for unique requests.

This assumes we don’t need every single request to be non-deterministic (the same input produces a different output). Let’s assume it’s OK for the same input to produce the same output. After all, a prediction for who would win in a fight wouldn’t likely change.

Add Database Architecture

If we want to store responses from OpenAI, a practical place to put them is in some sort of database that allows for quick and easy lookup using the two opponents. This way, when a request is made, we can check the database first:

  1. The client sends a request to a server.

  2. The server checks for an existing entry in the database that matches the user’s input.

  3. If a previous record exists, the server responds with that data, and the request is complete. Skip the following steps.

  4. If not, the server follows from step three in the previous flow.

  5. Before closing the response, the server stores the OpenAI results in the database.

Architecture diagram showing a client connecting to a server inside the Cloud, which checks for data in a database, then optionally forwards the request to OpenAI to get the results, then returns the data back to the client.

Dotted lines represent optional requests, and the 💽 kind of looks like a hard disk.

With this setup, any duplicate requests will be handled by the database. By making some of the OpenAI requests optional, we can potentially reduce the amount of latency users experience, plus save money by reducing the number of API requests.

This is a good start, especially if the server and the database exist in the same region. It would make for much quicker response times than going to OpenAI’s servers.

However, as our application becomes more popular, we may start getting users from all over the world. Faster database lookups are great, but what happens if the bottleneck is the latency from the time spent in flight?

We can address that concern by moving things closer to the user.

Bring in Edge Compute

If you’re not already familiar with the term “edge”, this part might be confusing, but I’ll try to explain it simply. Edge refers to content being as close to the user as possible. For some people, that could mean IoT devices or cellphone towers, but in the case of the web, the canonical example is a Content Delivery Network (CDN).

I’ll spare you the details, but a CDN is a network of globally distributed computers that can respond to user requests from the nearest node in the network (something I’ve written about in the past). While traditionally they were designed for static assets, in recent years, they started supporting edge compute (also something I’ve written about in the past).

With edge compute, we can move a lot of our backend logic super close to the user, and it doesn’t stop at compute. Most edge compute providers also offer some sort of eventually-consistent key-value store in the same edge nodes.

How could that impact our application?

  1. The client sends a request to our backend.

  2. The edge compute network routes the request to the nearest edge node.

  3. The edge node checks for an existing entry in the key-value store that matches the user’s input.

  4. If a previous record exists, the edge node responds with that data and the request is complete. Skip the following steps.

  5. If not, the edge node forwards the request to the origin server, which passes it along to OpenAI and yadda yadda yadda.

  6. Before closing the response, the server stores the OpenAI results in the edge key-value store.

The edge node is the blue box and represented by 🔪 because it has an edge, EdgeWorker is Akamai’s edge compute product represented by 🧑‍🏭, and EdgeKV is Akamai’s key-value store represented by 🔑🤑🏪. The edge box is closer to the client than the origin server in the cloud to represent physical distance.

The origin server may not be strictly necessary here, but I think it’s more likely to be there. For the sake of data, compute, and logic flow, this is mostly the same as the previous architecture. The main difference being the previously stored results now exist super close to users and can be returned almost immediately.

(Note: although the data is being cached at the edge, the response is still dynamically constructed. If you don’t need dynamic responses, it may be simpler to use a CDN in front of the origin server and set the correct HTTP headers to cache the response. There’s a lot of nuance here, and I could say more but…well, I’m tired and don’t really want to. Feel free to reach out if you have any questions.)

Now we’re cooking! Any duplicate requests will be responded to almost immediately, while also saving us unnecessary API requests.

This sorts out the architecture for the text responses, but we also have AI-generated images.

Cache Those Images

The last thing we’ll consider today is images. When dealing with images, we need to think about delivery and storage. I’m sure that the folks at OpenAI have their own solutions, but some organizations want to own the entire infrastructure for security, compliance, or reliability reasons. Some may even run their own image generation services instead of using OpenAI.

In the current workflow, the user makes a request that ultimately makes its way to OpenAI. OpenAI generates the image but doesn’t return it. Instead, they return a JSON response with the URL for the image, hosted on OpenAI’s infrastructure. With this response, an <img> tag can be added to the page using the URL, which kicks off another request for the actual image.

If we want to host the image on our own infrastructure, we need a place to store it. We could write the images onto the origin server’s disk, but that could quickly use up the disk space, and we’d have to upgrade our servers, which can be costly. Object storage is a much cheaper solution (I’ve also written about this). Instead of using the OpenAI URL for the image, we could upload it to our own object storage instance and use that URL instead.

That solves the storage question, but object storage buckets are generally deployed to a single region. This echoes the problem we had with storing text in a database. A single region may be far away from users, which could cause a lot of latency.

Having introduced the edge already, it would be pretty trivial to add CDN features for just the static assets (frankly, every site should have a CDN). Once configured, the CDN will pull images from object storage on the initial request and cache them for any future requests from visitors in the same region.

Here’s how our flow for images would look:

  1. Client sends a request to generate an image based on their opponents

  2. Edge compute checks if the image data for that request already exists. If so, it returns the URL.

  3. The image is added to the page with the URL and the browser requests the image.

  4. If the image has been previously cached in the CDN, the browser loads it almost immediately. This is the end of the flow.

  5. If the image has not been previously cached, the CDN will pull the image from the object storage location, cache a copy of it for future requests, and return the image to the client. This is another end of the flow.

  6. If the image data is not in the edge key-value store, the request to generate the image goes to the server and on to OpenAI, which generates the image and returns the URL information. The server starts a task to save the image in the object storage bucket, stores the image data in the edge key-value store, and returns the image data to edge compute.

  7. With the new image data, the client creates the image which creates a new request and continues from step five above.

Architecture diagram showing a client connecting to an edge node which checks the edge key-value store, then optionally passes the request to a cloud server and on to OpenAI before returning the data to the client. Additionally, if the user makes a request for an image, the request will check a CDN first, and if it doesn't exist, will pull it from Object Storage where it was placed from OpenAI

Content delivery network denoted by delivery truck (🚚) and network signal (📶), and object storage denoted by socks in a box (🧦📦), or objects in storage. This caption is probably not necessary, as I think these are clear, but I’m too proud of my emoji game and require validation. Thank you for indulging me. Carry on.

This last architecture is, admittedly, a little bit more complex, but if your application is going to handle serious traffic, it’s worth considering.

Voilà

Right on! With all those changes in place, we have created AI-generated text and images for unique requests and serve cached content from the edge for duplicate requests. The result is faster response times and a much better user experience (in addition to fewer API calls).

I kept these architecture diagrams applicable across various databases, edge compute, object storage, and CDN providers on purpose. I like my content to be broadly applicable. But it’s worth mentioning that integrating the edge is about more than just performance. There are a lot of really cool security features you can enable as well.

For example, on Akamai’s network, you can have access to things like web application firewall (WAF), distributed denial of service (DDoS) protection, intelligent bot detection, and more. That’s all beyond the scope of today’s post, though.

So for now, I’ll leave you with a big “thank you” for reading. I hope you learned something. And as always, feel free to reach out any time with comments, questions, or concerns.

Thank you so much for reading. If you liked this article, and want to support me, the best ways to do so are to share it, sign up for my newsletter, and follow me on Twitter.


Originally published on austingil.com.

Did you find this article valuable?

Support Austin Gil by becoming a sponsor. Any amount is appreciated!