AI Jira Fix Version summaries: automate release reporting
AI Jira Fix Version summaries automate progress updates, risk detection, and trade-off decisions: giving PMs and EMs instant visibility.
Learn critical LLM lessons from building an AI Jira analytics tool with GPT-4 & Claude 3.7. Avoid common pitfalls in data, prompts, temperature & reasoning.
Building AI tools with Large Language Models (LLMs) like GPT-4 and Claude 3.7 can feel like navigating uncharted territory. The potential is immense, but the practical challenges of applied AI are very real. Over the past several months, our team at Luna has been developing an AI-powered Jira analytics system. Our goal: extract actionable insights from complex Jira data to help engineering and product teams track progress, identify risks, and predict potential delays more effectively.
What started as an exciting technical challenge quickly showed us the real-world ups and downs of working with LLMs. In this article, I'll share the five most important LLM lessons we learned; insights that would have saved us weeks of troubleshooting and iteration had we known them from the start.
Whether you're actively building AI products, integrating LLMs into existing workflows, looking for LLM best practices, or simply curious about the hurdles in AI project management, these lessons will help you avoid common pitfalls and build more reliable, effective AI systems.
Our final lesson addresses a subtle but critical LLM challenge: models can sometimes be too agreeable, potentially reinforcing user assumptions or biases rather than providing objective, critical analysis. This is sometimes referred to as the "sycophantic" or "yes-man" effect.
We observed that models would occasionally:
Interestingly, we found differences between models in this regard during our LLM testing. In certain analytical tasks, we observed that Claude models sometimes provided more pushback – flagging gaps, surfacing potential blockers unprompted, and offering more critical assessments compared to some other models we tested at the time.
What we learned: To counteract the "yes-man" effect and promote objective AI analysis:
Verify, don't just trust: Remember that fluent, confident-sounding responses don't automatically equate to accuracy. Always cross-reference critical insights, especially those derived from complex Jira data.
When an LLM produces unexpected or incorrect output, the natural reaction is to blame the model or refine the prompt. We spent countless hours tweaking instructions, adding constraints, and breaking tasks into smaller steps – often with minimal improvement in our AI Jira analytics output.
The breakthrough came when we realized that in approximately 80% of problematic cases, the underlying issue wasn't our prompt engineering but the quality and consistency of the data we were feeding into the LLMs.
Common data quality issues impacting LLM performance:
What we learned: Before overhauling your prompts, rigorously audit your input data quality:
We found that when presented with conflicting or ambiguous data, LLMs don't "hallucinate" randomly; they attempt to improvise reasonable interpretations based on flawed input. By cleaning and standardizing our Jira data inputs, we achieved far more reliable and accurate outputs, often without needing overly complex prompt engineering. Ensuring high data quality for AI is key.
One of our most surprising discoveries was that even advanced models like Claude 3.7 and GPT-4 have no inherent understanding of the current date or how to interpret relative time references ("last week," "next sprint") without explicit guidance. This is a critical LLM challenge for time-sensitive analysis.
This became particularly problematic when analyzing sprint deadlines, issue aging, and time-based risk factors within our Jira analytics tool. The models would fail to identify overdue tasks or miscalculate time remaining until deadlines simply based on dates alone.
Time-related LLM issues we encountered:
What we learned: Implement a pre-processing layer to handle all time calculations and provide absolute and relative temporal context before passing data to the LLM:
By pre-computing all time-based calculations and providing explicit context (including the "current date" used for calculations), we eliminated an entire category of errors and significantly improved the accuracy of our AI-powered Jira insights.
For highly structured LLM outputs like analytics reports or risk summaries, our initial assumption was that setting the temperature parameter to 0 (minimizing randomness) would yield the most consistent and accurate results. This proved counterproductive for nuanced analysis.
At temperature = 0, we observed that models often became:
What we learned: For tasks requiring both structure AND analytical judgment (like risk assessment in Jira data), a temperature setting between 0.2 and 0.3 often provides the optimal balance:
This small adjustment to LLM parameters dramatically improved our risk assessment accuracy and reduced the need for extensive prompt engineering just to handle edge cases.
Even with state-of-the-art models like GPT-4 and Claude 3.7, we noticed occasional lapses in reasoning or simple computational errors that affected output quality. The solution came from adapting techniques like "chain-of-thought" prompting: requiring models to explain their reasoning step-by-step before providing the final conclusion.
This approach produced remarkable improvements in our LLM development process:
What we learned: For any prompt involving analysis, calculation, or decision-making, add a specific instruction to force self-reflection:
"Before providing your final analysis/summary/risk score, explain your reasoning step-by-step. Show how you evaluated the input data, what calculations you performed, and how you reached your conclusions."
While this approach consumes additional tokens (increasing cost), the return on investment in terms of LLM accuracy, explainability, and trustworthiness has been significant for our AI Jira analytics use case.
LLMs can do amazing things. But letting them "talk" too much, especially on tasks needing clear, logical steps, can actually open the door for more errors, weird inconsistencies, or just plain unnecessary fluff.
The fix? Get straight to the point:
Why this actually works:
Good to keep in mind:
Helping your LLM focus by asking for less can significantly boost its performance. Sometimes, the best way to get more from your AI is to ask it to say less.
The latest generation of reasoning models like Claude 3.7, GPT o-series and Gemini 2.5 Pro represent a fundamental shift in how LLMs approach problems. They're not just "smarter LLMs" – these models think differently, solving problems step-by-step in a visible, logical chain rather than jumping straight to an answer.
But here's the catch: They don't use that reasoning fully unless you explicitly prompt for it.
It's like having a brilliant colleague who could map out a complex problem on a whiteboard – but instead does it all in their head and hands you a final answer. You don't see the steps. You can't verify the logic. And worse – they're more likely to miss guidelines or make mistakes without working things out step by step.
For reasoning models, explicitly request visible reasoning:
"Explicit your reasoning inside your <thinking> bloc."
The tradeoffs to consider:
Building reliable AI systems with Large Language Models requires more than just technical knowledge of APIs and prompts. It demands practical, hands-on experience and a deep understanding of these models' unique characteristics, strengths, and limitations, especially when dealing with domain-specific data like Jira project management information.
By proactively addressing data quality issues, providing explicit temporal context, carefully choosing LLM parameters like temperature, implementing self-reflection prompts for improved reasoning, constraining output scope for better focus, leveraging reasoning models effectively, and actively guarding against excessive agreeableness (model bias), we've dramatically improved the reliability, accuracy, and overall effectiveness of our AI Jira Fix Version and Sprint summaries
These seven LLM development lessons have transformed how we approach LLM integration at Luna. We're continuously refining our techniques as we build tools that empower product and engineering teams to extract truly actionable insights from their valuable Jira data.
Are you interested in learning more about how Luna can help your team gain deeper insights from your Jira data? Visit withLuna.ai to discover how our AI-powered Jira analytics can identify risks earlier, track engineering progress and product management KPIs more effectively, and help predict potential delays before they impact your roadmap.