Skip to main content

The winning model for knowledge base bots - Llama3 vs Claude3 vs GPT-4

240507 Why Llama 3 is Sendbirds top pick for your AI chatbot
May 10, 2024
Jerry Jeon profile picture
Jerry Jeon
Software Engineer
SBM blog CTA mobile 1

Drive growth and reduce costs with omnichannel business messaging

Introduction

Knowledge base bots have become an essential tool in managing vast information pools and providing instant responses based on accumulated data. These models often have specific restrictions such as a knowledge cut-off date, which limits the model's responses to information available only up to a certain date, and restrictions on accessing private information to ensure security and privacy.

Retrieval-Augmented Generation (RAG) is a technology that enhances knowledge base bots by integrating a retrieval mechanism. This allows the model to fetch relevant documents and use them as references to generate more informed and accurate responses.

At Sendbird, we utilize RAG to enhance our knowledge base bot, supporting a variety of Large Language Models (LLMs). Given the rapid emergence of many models recently, I was curious about which model best serves as a knowledge base bot. To satisfy this curiosity, I planned a small experiment with three LLMs: Llama 3 70b, GPT-4-Turbo-2024-04-09, and Claude 3 Sonnet.

Here’s a brief explanation for each model:

Llama3 70b

Developed by Meta and open-sourced on April 18, 2024, Llama3 70b includes a knowledge base cut-off as of December 2023. As of May 10, it ranks sixth on The LMSYS Chatbot Arena. For deeper insights, refer to my article here.

GPT-4-Turbo-2024-04-09

Developed by OpenAI, GPT-4-Turbo was released on April 9, 2024, and features a knowledge base cutoff of December 2023. It currently holds the first position as of May 10 on The LMSYS Chatbot Arena, making it the most popular model.

Claude3 Sonnet

Released by Anthropic on March 4, 2024, Claude3 Sonnet's knowledge base is cut off in August 2023. It is ranked eighth as of May 10 onThe LMSYS Chatbot Arena. More details can be found in my detailed review here.

Test method - Llama3 vs Claude3 vs GPT-4

While acknowledging the limitations of my test methods, I sought to independently evaluate how each model answered questions using RAG with various sources such as websites, PDFs, and text files.

Testing sources

  • Website: 96th Academy Awards

    This Wikipedia page provides comprehensive details about the 96th Academy Awards, including nominees, winners, and notable events, serving as a rich source for testing the models' ability to extract and summarize event information.

  • PDF file: Commuting in the United States: 2022

    This official report by the U.S. Census Bureau details commuting patterns and trends in the United States for the year 2022, ideal for testing the models' capabilities in processing and summarizing detailed statistical data.

  • Text file: Sendbird FAQ

    The FAQ document for Sendbird, which answers common questions about using Sendbird's services and products. This source tests the models' ability to provide clear, concise answers from a customer support perspective.

  • Website: Sendbird Docs

    This documentation site provides detailed technical information about implementing and using Sendbird's APIs and SDKs, testing the models' ability to interpret and summarize technical instructions and data.

Sample questions from the test

  • 96th Academy Awards: Summarize the awards in 3 bullet points.

  • Sendbird FAQ: What are the benefits of creating a chatbot via Sendbird?

  • PDF file - Commuting in the United States: 2022: Summarize the report in three bullet points.

  • Sendbird Docs: What are the benefits of using MessageCollection?

You can refer to the full set of Q&As from this gist.

It’s too lengthy to write all Q&As here, so I picked these questions to show the differences between these models:

PDF file - Commuting in the United States: 2022: Summarize the report in three bullet points.

Llama 3

Here's how Llama 3 answered:


Here are three bullet points summarizing the report:

• The COVID-19 pandemic led to a significant shift towards home-based work, with the percentage of workers working from home increasing from 5.7% in 2019 to 15.2% in 2022.

• The pandemic also led to changes in commuting behaviors, including a decline in driving alone and public transportation usage, and an increase in carpooling and flexible work arrangements.

• Despite some recovery, public transportation usage remained below pre-pandemic levels, and commute times and departure times also changed, with more workers departing for work during traditional morning rush hour.


Provides concise bullet points, summarizing significant shifts in commuting behavior due to the pandemic, with a clear citation of statistics, offering a neatly organized and comprehensible summary.

Claude 3

Here's how Claude 3 answered:


Here is a three bullet point summary of the key findings from the report:

• The COVID-19 pandemic led to a significant increase in home-based work and corresponding decrease in commuting, especially by public transit, in the United States between 2019 and 2021.

• While commuting levels partially rebounded in 2022 as pandemic restrictions eased, the share of workers commuting by driving alone or public transit remained well below pre-pandemic 2019 levels.

• The shift to home-based work persisted in 2022, contributing to fewer commuters during peak hours and reduced impacts on transportation infrastructure compared to before the pandemic.


Offers a neatly organized summary but lacks specific data points, which might leave some questions about the magnitude of changes described.

GPT-4

Here's how GPT-4 answered:


- Persistent Shift to Home-Based Work: The COVID-19 pandemic led to a significant and lasting increase in home-based work among U.S. workers, contributing to a roughly 10 percent reduction in commuters during core commuting hours from 2019 to 2022.

- Decline in Public Transportation Usage: There was a notable decline in the use of public transportation for commuting, with the share of public transportation commuters not returning to pre-pandemic levels by 2022, reflecting ongoing changes in commuting behaviors.

- Changes in Commuting Times: The pandemic influenced the timing of commutes, with fewer workers traveling during traditional peak hours, leading to altered traffic patterns and impacts on transportation infrastructure.


Delivers a very structured response that not only addresses key shifts succinctly but also categorizes the changes in a manner that enhances understanding, making it the easiest to grasp among the three.

Quality of responses

The overall pattern was quite similar across all three models, with each providing sufficiently good answers. However, GPT-4 consistently delivered the best responses among them.

Speed

In assessing the response times of each model, GPT-4 has generally shown to be the slowest, while Llama3 and Claude3 have demonstrated similar or faster response times in testing, making them more suitable for applications where speed is a priority.

Here's a comparison that shows the response times simultaneously for GPT-4 and Claude-3;


You can see that Claude-3 is significantly faster than GPT-4. Although not shown in the image, Llama3's speed was similar to that of Claude-3, which is fast enough for most applications.

Pricing

Pricing is a significant consideration when deploying technology solutions. The cost efficiency of each model under Sendbird's pricing varies significantly:

  • Claude3 Sonnet requires 3 message credits per bot message.

  • Llama3 is notably more cost-effective, requiring only 1 message credit per message.

  • GPT-4 is the most expensive, costing 10 message credits per bot message.

These differences in cost can significantly influence the decision-making process, especially for businesses operating at scale or those with limited budgets.

Conclusion - Llama3 vs Claude3 vs GPT-4

All three models are sufficiently competent, and one can choose a model based on specific requirements. GPT-4 stands out for its accuracy, and if this is the most crucial factor for a customer, it would be the best choice.
However, if efficiency is considered, Llama3 would be a better option. It costs only 10% of what GPT-4 does and also offers faster response times. 

Personally, I think Llama3 is the best choice given its reasonable price, which delivers good value to the customers and also offers rapid speed.
Sendbird offers a variety of models, so it's beneficial to test and use them based on your specific needs.

The only no-code UI supported solution to use Llama 3 on production

Llama 3 can easily provide you with most-accurate answers for your needs, you still need to have it properly engage with your users. While there are many other solutions that bring chatbot experience using LLMs, users now expect more production-like chat experiences, similar to those in Snapchat, iMessage, or WhatsApp.

To truly maximize the capabilities of Llama 3 in your service, it's imperative to support it with a modern and elegant UI that includes:

  • Message cards to display product images

  • Suggested replies

  • Message status receipts for sent, delivered, and read messages

  • Typing indicators

  • Offline support

Have Llama 3 on your own website within minutes

Sendbird can help you take the next step in building your AI chatbot. While we offer rich messaging features, you can integrate Llama 3 into your website effortlessly, without any need for coding. Knowledge can easily be incorporated using a straightforward dashboard with just a few clicks.

Check out the official integration of Llama 3 for your Sendbird chatbot projects here!

Ebook Grow background mobile

Take customer relationships to the next level.

Ready for the next level?