Building AI apps

As a technology matures, the focus gradually shifts from pure research to development aspects such as deploying, scaling and benchmarking. Paralleling the intense general AI usage by non-programmers, tools for AI application development have also progressed significantly, especially in the last few years. And while I think there is still much progress to be made by AI researchers, it’s fun to learn some of the things AI developers deal with if you were thinking about building an AI-powered app yourself in 2025.

Implementing a backend to query LLMs: APIs and vector databases

The major AI vendors host LLMs on their own hardware which users can access via RESTful APIs after buying access tokens. Alternatively, as I mentioned previously, you can also download open-source LLMs and run them locally. Using this approach, you don’t need access tokens and your queries and results remain on your computer. You can launch an LLM and interact with it directly on your local machine using the Ollama client, which is sending and receiving REST requests locally under the hood (vLLM and LLM Studio work similarly). This Ollama chat interface can feel like a functional app already: you can ask the LLM being run to do sentiment analysis on a text passage, or describe an image without writing a single line of code.

In addition to these vendor-provided RESTful APIs, which you can access through Python libraries like LangChain and PydanticAI (introduced to me by my former colleague Dan), the opaquely-named Model Context Protocol (or simply MCP) bills itself as “the USB-C port of AI applications” by providing SDKs for common LLM tasks in popular languages (the name is less opaque when you realize “Model” here refers to LLMs). In addition to serving API calls, MCP also features dynamic self discovery, asking the LLM it interfaces with what functionalities are available, and making those accessible to the programmer.

Typical of application development, you’ll need your AI app to store and retrieve data. Since many LLM operations depend on vectors (particularly embeddings), you’ll want to use a vector database like Chroma, Pinecone, Milvus, etc. Here’s a quick high-level comparison of vector databases.

Putting a pleasant frontend on your AI app: Streamlit and Gradio

Once you have started querying an LLM and getting results back, you technically have your first AI app. However, these prototypes are usually Python scripts or IPython notebooks that are clumsy to use. Streamlit is a common framework for Python apps, as I have written previously, and can also be used for AI apps. Another option is using Gradio to embed interactive elements in your notebook, or have the entire app be a stand-alone webpage.

Building more complex AI apps: workflows, agents and orchestration

As your app becomes more complex, it’s good to consider how you would architect future iterations. Anthropic (who created MCP) has some useful observations and recommendations: you should start building apps up using workflows which chain several LLMs or LLM-querying parts together. This simple design means you can see if any individual part is malfunctioning, easing troubleshooting. As your app grows more complicated, perhaps the parts can augment the knowledge you get from the LLM(s). One such augmentation scheme is Retrieval Augmented Generation (RAG), where additional new information is retrieved by the AI app and used to generate final output. Depending on the size and frequency of these data retrievals, Cache Augmented Generation (CAG) can be more appropriate than RAG, as discussed here.

Eventually, you will encounter a use case where simply chaining tools together isn’t enough: maybe the output from your tools aren’t predictable, or the steps in your workflow change dynamically depending on input. In that situation, you may move beyond workflows onto agents, which can cope with this uncertainty but are also more complex. You can even use more than one agent in a multi-agent system, which requires orchestration.

In the same article above, Anthropic also listed some agent frameworks available at the time: Rivet, Vellum and Amazon Bedrock Agents, though likely there will be others as time goes on.

Benchmarking your AI app

Another common task in application development is measuring performance, and there are many benchmarks available depending on the task your app is performing, as well as benchmarks for its safety: SWE-bench is commonly used for coding agents, AI2 ARC for question answering, WinoGrande for math, HarmBench for assessing AI safety, among others. While very important to app development, AI safety is a large and rapidly evolving field beyond the scope of this blog post.

Weekend project: AIBookButler

I had written back in 2019 that recommender systems are really interesting, and while they haven’t needed AI to work well (in that post I built a basic movie recommender using Spark and scikit-learn), I wanted to make another recommender from scratch as a project to explore AI tools.

Since I loved using Pandora in college, I thought about an AI music recommender. But a bit of digging revealed two problems: music that is royalty-free and contains robust metadata is scarce, and audio data has many higher-order acoustic and temporal properties that greatly increases the number of features the recommender has to train on. An useful, accurate general-purpose music recommender would take money and time, not to mention the headaches involved in licensing commercial music. All this means it’s too much for a weekend project.

So instead I put together the beginnings of AIBookButler in a long afternoon. The app takes some book metadata (title, author, synopsis, etc but not the full text itself) and loads them into a vector database (I used Chroma) with some text embeddings using LangChain. This allows the database to be queried using similarity search which returns books related to your query. As you can see in the project repo, this doesn’t require very much code, partly because I used a very small subset of a public Kaggle dataset which required minimal data cleaning. This is something AI projects have in common with old-school data science: garbage in, garbage out. If this been a production tool with millions of rows of input, the project would have needed a significant amount of data cleaning.

Another part of the app performs some sentiment analysis on text that the user enters. This was simply a matter of wrapping Streamlit around LLM chat functionality, though later it would be nice to train the LLM on books it hasn’t been exposed to.

Written on June 6, 2025