AI-powered Jira Sprint summaries: save hours on reporting
Automate Jira sprint reporting: get real-time insights, detect risks, and eliminate manual updates. Save time & keep teams aligned effortlessly.
Discover 7 lessons from building an AI summarization tool using GPT-4 and Claude 3. Explore challenges with dates, memory, input formatting, latency, and tips to optimize LLM performance.
Imagine reaching your inbox after a long day, drowning in a sea of unread emails, Slack threads, and meeting notes. Extracting vital information and taking action feels impossible. This information overload not only consumes valuable time but also makes it difficult to identify critical information and act upon it effectively. To tackle this challenge, we developed LunaVista, an AI-powered extraction and summarization solution using Large Language Models (LLMs).
What initially seemed as straightforward as submitting a thread to ChatGPT, turned out to be a complex project with challenges related to the current state and capabilities of LLM models. After weeks of dedicated work, extensive prompt iterations, and processing millions of tokens, we want to share the key lessons learned in the process.
⚠️ These challenges were more pronounced with larger input sizes, such as lengthy email threads (~ 10+ emails).
LLMs generate human-like text by predicting the most probable next word or sequence of words based on the input and their training data. They struggle with date-based reasoning because their training data lacks consistent representations of time and chronology, making it difficult for them to learn the logical rules involved in tracking dates, deadlines, and scheduling over time. For instance, when an email mentions a deadline for "next Monday," the model fails to determine the corresponding date.
💡 Tip: to overcome this issue, provide the model with today's date and the corresponding day of the week. Instruct the model on how to compute deadlines:
{"role": "system", "content": "Make sure you compute deadlines and due dates correctly. You should compute them from the date of the message that mentions this deadline or due date. For example, if an email from 4 April 2024, mentions a deadline for 'next Tuesday,' then the deadline = Tuesday 9 April 2024. As a reminder, today's date is "+ today.strftime("%d %B %Y") +" and the day is "+today.strftime("%A")}
When dealing with long input, LLMs tend to forget parts of the information, especially the middle section. They focus primarily on the beginning and on the end of the input.
💡 Tips:
You could for example add this instruction in the prompt:
"IMPORTANT: Please make sure to consider the ENTIRE email thread when generating the summary. Make sure to include information from the middle of the thread, as it may contain crucial details."
.. and this last user message:
{"role": "user", "content":“You missed some information. Make sure to read THE WHOLE TEXT, PAY ATTENTION TO THE CHRONOLOGICAL SEQUENCE OF EVENTS and repeat the work"}
With lengthy email threads, the model often omitted crucial information and failed to capture the chronological sequence of events correctly. For instance, it would flag a risk as open although it was mitigated a few messages later. Despite multiple iterations and various prompt engineering techniques - including splitting the prompts into smaller subtasks, providing examples etc. - we struggled to enhance GPT4's performance on long email threads.
However, when we tested the newly released Anthropic Claude 3 Opus model on email threads structured with XML tags, the results showed significantly better recall and precision while achieving lower latency.
For example, for a 38-emails thread of 20k characters:
Using XML tags to encapsulate each email drastically improved performance with Claude 3 Opus because this format aligns with how the model was trained to process lengthy documents.
Here's an example of how to feed your large language model (LLM) an email thread structured with XML tags:
⚠️ We are currently in the process of automating the transformation of an email thread into a XML stream to feed the LLM. This step has not been implemented yet in LunaVista.
While GPT4 and Claude 3 Opus outperform their predecessors in terms of capabilities, their increased model complexity and size (e.g. GPT4 has > 100 billion parameters) require more computational resources, leading to higher latency.
💡 Tip: Enable streaming by adding 'stream = True' to the API call. This way, results will be displayed as they are generated, giving the perception of reduced latency.
Initially, we considered using the LangChain library, as it was recommended by many in the community. However, after evaluating it for our specific use cases, which involved relatively straightforward querying of language models, we didn't find significant added value over directly interacting with the model's API. While LangChain can be really helpful in orchestrating complex workflows involving multiple models, data sources, and components, we found that for our requirements, the abstraction layer introduced more complexity than benefits.
Providing the model with a specific role or persona significantly improved the quality and relevance of its outputs, particularly the summaries. The assigned role also impacted the information the model chose to present and how it formulated that information. By carefully selecting the appropriate persona, you can effectively guide the model to produce more targeted and valuable outputs.
For example, when generating the executive summaries:
As LLMs continue to evolve at an unprecedented pace, it is crucial that we stay informed about the latest developments and best practices in the field. We are continuously experimenting with new model releases and iterating on our solutions to achieve a trusted high degree of precision.
Building upon this POC, we are excited to embark on the next phase of our AI journey. Our roadmap includes: