7 Lessons from building a Large Language Model summarization tool

Discover 7 lessons from building an AI summarization tool using GPT-4 and Claude 3. Explore challenges with dates, memory, input formatting, latency, and tips to optimize LLM performance.

Myriam Debahy

May 21, 2024 . 10 min read

Imagine reaching your inbox after a long day, drowning in a sea of unread emails, Slack threads, and meeting notes. Extracting vital information and taking action feels impossible. This information overload not only consumes valuable time but also makes it difficult to identify critical information and act upon it effectively. To tackle this challenge, we developed LunaVista, an AI-powered extraction and summarization solution using Large Language Models (LLMs).

What initially seemed as straightforward as submitting a thread to ChatGPT, turned out to be a complex project with challenges related to the current state and capabilities of LLM models. After weeks of dedicated work, extensive prompt iterations, and processing millions of tokens, we want to share the key lessons learned in the process.

High level overview of LunaVista’s POC

Use cases:
- Summarize: automate executive summaries tailored to different audiences (e.g. Leadership, Marketing, Sales, Product teams) providing a concise overview of project progress.
- Extract and classify: generate structured project status reports encompassing risks, decisions, recent progress, and upcoming priorities.
- Answer (soon): provide succinct and relevant answers to any project-related questions.
Model: GPT4 on Azure Open AI
- We chose Azure OpenAI because, in addition to granting access to OpenAI models, it has a strong focus on meeting enterprise customers' requirements, such as availability, security, privacy, and compliance in the production environment.
- Note: we are currently heavily testing the latest release of Anthropic - Claude 3 Opus - which delivers very promising results on large structured input (see lesson #3).
We built the app on Streamlit for its fast prototyping capabilities, seamless integration with LLM APIs, and efficient feedback capturing mechanisms.

Main challenges encountered

Incompleteness: the model often omitted important information and struggled to extract all entities, leading to incomplete summaries and reports.
Stability: results diverged across multiple runs, even with temperature set to zero, making it difficult to achieve consistent outputs.
Chronology of events: the model struggled to capture the chronological sequence of events and extract the latest state of a specific event (e.g. a decision or risk) previously mentioned in the thread, impacting the accuracy of the generated reports.
Computing deadlines and timelines: the model struggled to compute exact dates like 'next Tuesday' or 'next Monday' based on the date a message or email was sent, leading to incorrect deadline calculations.‍
Latency: in some cases, latency could exceed 1 minute (!), implying a suboptimal user experience.

⚠️ These challenges were more pronounced with larger input sizes, such as lengthy email threads (~ 10+ emails).

Lesson #1: LLMs struggle with dates

LLMs generate human-like text by predicting the most probable next word or sequence of words based on the input and their training data. They struggle with date-based reasoning because their training data lacks consistent representations of time and chronology, making it difficult for them to learn the logical rules involved in tracking dates, deadlines, and scheduling over time. For instance, when an email mentions a deadline for "next Monday," the model fails to determine the corresponding date.

💡 Tip: to overcome this issue, provide the model with today's date and the corresponding day of the week. Instruct the model on how to compute deadlines:

{"role": "system", "content": "Make sure you compute deadlines and due dates correctly. You should compute them from the date of the message that mentions this deadline or due date. For example, if an email from 4 April 2024, mentions a deadline for 'next Tuesday,' then the deadline = Tuesday 9 April 2024. As a reminder, today's date is "+ today.strftime("%d %B %Y") +" and the day is "+today.strftime("%A")}

Lesson #2: LLMs have a “short memory”

When dealing with long input, LLMs tend to forget parts of the information, especially the middle section. They focus primarily on the beginning and on the end of the input.

💡 Tips:

Repeat multiple times the important guidelines and points where the model fails.
Use capital letters to draw attention to specific guidelines.
Explicitly ask the model to consider the entire document.
End by saying to the model that it missed some information and that it should repeat the work.

You could for example add this instruction in the prompt:

"IMPORTANT: Please make sure to consider the ENTIRE email thread when generating the summary. Make sure to include information from the middle of the thread, as it may contain crucial details."

.. and this last user message:

{"role": "user", "content":“You missed some information. Make sure to read THE WHOLE TEXT, PAY ATTENTION TO THE CHRONOLOGICAL SEQUENCE OF EVENTS and repeat the work"}

Lesson #3: Claude 3 Opus outperforms GPT 4 on large, structured input

With lengthy email threads, the model often omitted crucial information and failed to capture the chronological sequence of events correctly. For instance, it would flag a risk as open although it was mitigated a few messages later. Despite multiple iterations and various prompt engineering techniques - including splitting the prompts into smaller subtasks, providing examples etc. - we struggled to enhance GPT4's performance on long email threads.

However, when we tested the newly released Anthropic Claude 3 Opus model on email threads structured with XML tags, the results showed significantly better recall and precision while achieving lower latency.

For example, for a 38-emails thread of 20k characters:

Differences between Claude 3 Opus and GPT-4

Lesson #4: The quality of the input is as important as the prompt

Using XML tags to encapsulate each email drastically improved performance with Claude 3 Opus because this format aligns with how the model was trained to process lengthy documents.

Here's an example of how to feed your large language model (LLM) an email thread structured with XML tags:

How to structure the input for Claude models

⚠️ We are currently in the process of automating the transformation of an email thread into a XML stream to feed the LLM. This step has not been implemented yet in LunaVista.

Lesson #5: Enable streaming to improve perceived latency

While GPT4 and Claude 3 Opus outperform their predecessors in terms of capabilities, their increased model complexity and size (e.g. GPT4 has > 100 billion parameters) require more computational resources, leading to higher latency.

Quality and speed of top LLMs — Source: Artificial Analysis

💡 Tip: Enable streaming by adding 'stream = True' to the API call. This way, results will be displayed as they are generated, giving the perception of reduced latency.

Lesson #6: You don’t always need LangChain

Initially, we considered using the LangChain library, as it was recommended by many in the community. However, after evaluating it for our specific use cases, which involved relatively straightforward querying of language models, we didn't find significant added value over directly interacting with the model's API. While LangChain can be really helpful in orchestrating complex workflows involving multiple models, data sources, and components, we found that for our requirements, the abstraction layer introduced more complexity than benefits.

Lesson #7: Ground your LLM for more accurate responses

Providing the model with a specific role or persona significantly improved the quality and relevance of its outputs, particularly the summaries. The assigned role also impacted the information the model chose to present and how it formulated that information. By carefully selecting the appropriate persona, you can effectively guide the model to produce more targeted and valuable outputs.

For example, when generating the executive summaries:

With no defined role: the model provided too many unnecessary details in the summary without making an effort to select or arbitrate between the information that brings "useful" project visibility.
When grounded as an analyst: this triggered the model to prioritize factual information and to enumerate them concisely.
When grounded as a project manager: the model reformulated its responses with a process-oriented approach, focusing on milestones, risks, and their mitigation.

Other essential lessons

Positive prompt: phrase prompts positively and avoid negative instructions (e.g. say ‘avoid doing xyz’ instead of "do not do xyz") as LLMs tend to overlook or misinterpret negative instructions.
Temperature setting: for more predictability and stability of the model's outputs, set the temperature to zero. 💡 The temperature is a hyperparameter that controls the randomness of the language model output. A high temperature produces more unpredictable and creative results, while a low temperature produces more deterministic and conservative output.

Looking ahead

As LLMs continue to evolve at an unprecedented pace, it is crucial that we stay informed about the latest developments and best practices in the field. We are continuously experimenting with new model releases and iterating on our solutions to achieve a trusted high degree of precision.

Building upon this POC, we are excited to embark on the next phase of our AI journey. Our roadmap includes:

Further refine our summarization and extraction capabilities, focusing on improving the handling of large email threads. Launch our question-answering functionality.
Productize AI capabilities within the Luna platform, leveraging your own project data.

7 Lessons from building a Large Language Model summarization tool

High level overview of LunaVista’s POC

Main challenges encountered

Lesson #1: LLMs struggle with dates

Lesson #2: LLMs have a “short memory”

Lesson #3: Claude 3 Opus outperforms GPT 4 on large, structured input

Lesson #4: The quality of the input is as important as the prompt

Lesson #5: Enable streaming to improve perceived latency

Lesson #6: You don’t always need LangChain

Lesson #7: Ground your LLM for more accurate responses

Other essential lessons

Looking ahead

Related articles

AI Jira Sprint summaries: save hours on reporting

Stop drowning in busywork.
Start delivering value!

7 Lessons from building a Large Language Model summarization tool

High level overview of LunaVista’s POC

Main challenges encountered

Lesson #1: LLMs struggle with dates

Lesson #2: LLMs have a “short memory”

Lesson #3: Claude 3 Opus outperforms GPT 4 on large, structured input

Lesson #4: The quality of the input is as important as the prompt

Lesson #5: Enable streaming to improve perceived latency

Lesson #6: You don’t always need LangChain

Lesson #7: Ground your LLM for more accurate responses

Other essential lessons

Looking ahead

Related articles

AI Jira Sprint summaries: save hours on reporting

Stop drowning in busywork. Start delivering value!

Stop drowning in busywork.
Start delivering value!