Current Best Solution for Chat-Enabled Services
A technical survey of the state of the art and market for large language models.
I’ve been doing some work on new natural-language or “chat-enabled” services for a company in the finance industry. In contrast to general chatbots, these services require a Large Language Model (LLM) to be augmented with technical features known as (1) Functions, which give the bot access user-specific data from the company’s API, and (2) Embeddings, which enable the bot to incorporate information from proprietary documents. The finance industry also demands a high level of security and privacy for any systems that can touch financial or client data.
Isn’t this just Chat-GPT? By now everybody is familiar with Chat-GPT: OpenAI’s chatbot interface to the company’s LLMs. It’s easy for anyone to license and deploy a chat-enabled service directly from OpenAI. But, as I’ll explain shortly, OpenAI doesn’t meet all the requirements here.
What about open source? Sometimes it’s better to maintain the control you get with open-source systems than to go with a third-party cloud-based solution. And it turns out that the number of open-source LLMs has exploded. (For a good time, try sorting through literally hundreds of thousands indexed here!) Oh, but a LLM alone does not complete a natural-language service. Creating an effective chatbot interface for a LLM is not trivial: Maintaining a context and controlling attention are ongoing areas of research and development. LLM tuning and prompt engineering can also require extensive and specialized work. Techniques for embedding content in a LLM are still rapidly evolving. Do you have a team of researchers keeping up with the state of the art in each of these areas? A big part of the value in going with a large cloud LLM provider is that these technical details are increasingly being handled under-the-hood in their offerings.
I looked at technical solutions incorporating both 3rd and 4th generation LLMs. What is the difference? Gen 3 (example: Chat-GPT 3.5) is like having a professional writer who has the reasoning skills of a high schooler with ADD. Gen 4 is like having a great writer with an advanced degree.
For now, Gen 4 LLMs are only offered as cloud-based services. Most are priced around $20/MMtoken, except that Google seems intent on undercutting all competition on pricing and is presently under $1/MMtoken.1 Here are the current Gen 4 offerings:
Google Gemini Pro: Google’s enterprise AI services meet the same levels of enterprise security as Google Cloud’s other products. I.e., if a company is comfortable storing data in Google Drive then it should be comfortable passing it through Gemini.
OpenAI GPT-4: OpenAI’s enterprise services are SOC 2 compliant. However, as I discovered when I built a prototype, OpenAI’s Functions and Embedding features are still in beta, and the company has been aggravatingly unresponsive to developers wondering when those will be ready for production.
Anthropic Claude-2.1: It is unlikely that Anthropic’s current security policies would meet the requirements of the finance industry.
Other possible options like Falcon-180B and Mistral-Large are a step down in terms of service levels that can be expected from the likes of OpenAI and Google, and are unlikely to be able to meet the security and privacy requirements of the finance sector.
All of these services include native support for embeddings and functions. This is an important consideration because, as mentioned earlier, there is significant technical risk and cost associated with implementing those features using open-source alternatives.
Local / On-Premises Options
Another way to avoid security concerns is to stay off the cloud altogether. Presently, no off-cloud LLMs provide Gen 4 performance. And getting generalized Gen 3 performance requires serious hardware: Consider two of the most established open-source models that perform at the level of OpenAI’s GPT 3.5: Mixtral 8x7B and LLaMa-2 70B. To run at a tolerable speed (about 50 tokens/second) these require dual RTX 3090 or 4090 GPUs, which presently cost about $2k each, or something like an Apple Silicon M2 Ultra, which is also about $4k.
Like all LLMs, open-source models can also be run in the cloud. As a benchmark for cloud pricing: Mistral’s Gen 3 API is currently $.70/MMtoken. (Note that this is comparable to Google’s price for its Gen 4 API.)
Lighter Gen 3 models (e.g., 7B versions of LLaMa-2, Gemma, and Mistral) can be tailored to perform well in specific domains, so a chatbot that doesn’t require as much intelligence could probably be developed to run on a typical PC. But this would require significant technical risk and time to fine-tune a LLM – itself a specialized field of expertise.
The Winner
Right now, Google’s Gemini API is the clear solution to support Function- and Embedding-enabled chatbots. It offers the latest technology and features, meets the highest standards of security, and also happens to be the cheapest option on the market.
Google prices their API by the character, not the token, so I made a rough conversion into token terms. One MMtoken is roughly 750,000 words, and that cost is paid for both words sent and received from the LLM. Where price is a consideration, all of the Gen 4 APIs make it easy to route usage to their less expensive Gen 3 models if/when desired.