Blogs
AI

5 critical LLM lessons building an AI Jira analytics tool

Learn critical LLM lessons from building an AI Jira analytics tool with GPT-4 & Claude 3.7. Avoid common pitfalls in data, prompts, temperature & reasoning.

Myriam Debahy
May 07, 2025 . 10 min read

Building AI tools with Large Language Models (LLMs) like GPT-4 and Claude 3.7 can feel like navigating uncharted territory. The potential is immense, but the practical challenges of applied AI are very real. Over the past several months, our team at Luna has been developing an AI-powered Jira analytics system. Our goal: extract actionable insights from complex Jira data to help engineering and product teams track progress, identify risks, and predict potential delays more effectively.

What started as an exciting technical challenge quickly showed us the real-world ups and downs of working with LLMs. In this article, I'll share the five most important LLM lessons we learned; insights that would have saved us weeks of troubleshooting and iteration had we known them from the start.

Whether you're actively building AI products, integrating LLMs into existing workflows, looking for LLM best practices, or simply curious about the hurdles in AI project management, these lessons will help you avoid common pitfalls and build more reliable, effective AI systems.

Critical LLM development lessons building an AI Jira analytics tool

Lesson 1: Combatting the LLM 'Yes-Man' - ensuring critical analysis & objectivity

Our final lesson addresses a subtle but critical LLM challenge: models can sometimes be too agreeable, potentially reinforcing user assumptions or biases rather than providing objective, critical analysis. This is sometimes referred to as the "sycophantic" or "yes-man" effect.

We observed that models would occasionally:

  • Mirror our implicit assumptions within prompts rather than challenging them based on the data.
  • Agree with flawed premises or leading questions instead of identifying underlying problems.
  • Omit crucial caveats or risks if not explicitly prompted to look for them.
  • Deliver potentially inaccurate or hallucinated information with unnervingly high confidence in their tone.

Interestingly, we found differences between models in this regard during our LLM testing. In certain analytical tasks, we observed that Claude models sometimes provided more pushback – flagging gaps, surfacing potential blockers unprompted, and offering more critical assessments compared to some other models we tested at the time.

Lesson 2: LLM mistakes? Audit your input data first (It's often not the prompt)

When an LLM produces unexpected or incorrect output, the natural reaction is to blame the model or refine the prompt. We spent countless hours tweaking instructions, adding constraints, and breaking tasks into smaller steps – often with minimal improvement in our AI Jira analytics output.

The breakthrough came when we realized that in approximately 80% of problematic cases, the underlying issue wasn't our prompt engineering but the quality and consistency of the data we were feeding into the LLMs.

Common data quality issues impacting LLM performance:

  • Inconsistent calculations: The same engineering metrics (like sprint velocity or cycle time) being calculated differently across various Jira projects or data sources.
  • Missing context: Values or statuses that seemed obvious to human readers but lacked necessary contextual information for the AI to interpret correctly (e.g., 'Done' meaning different things in different workflows).
  • Conflicting information: Subtle contradictions between different data points (e.g., start/end dates) that humans might ignore or reconcile but thoroughly confused the LLM.

What we learned: Before overhauling your prompts, rigorously audit your input data quality:

  • Ensure naming conventions (for fields, statuses, projects) are consistent across all data sources.
  • Verify that key metrics are calculated identically everywhere they appear.
  • Actively check for gaps, ambiguities, or conflicts in your input data before passing it to the model.

We found that when presented with conflicting or ambiguous data, LLMs don't "hallucinate" randomly; they attempt to improvise reasonable interpretations based on flawed input. By cleaning and standardizing our Jira data inputs, we achieved far more reliable and accurate outputs, often without needing overly complex prompt engineering. Ensuring high data quality for AI is key.

Lesson 3: LLMs & time blindness - explicit temporal context is non-negotiable

One of our most surprising discoveries was that even advanced models like Claude 3.7 and GPT-4 have no inherent understanding of the current date or how to interpret relative time references ("last week," "next sprint") without explicit guidance. This is a critical LLM challenge for time-sensitive analysis.

This became particularly problematic when analyzing sprint deadlines, issue aging, and time-based risk factors within our Jira analytics tool. The models would fail to identify overdue tasks or miscalculate time remaining until deadlines simply based on dates alone.

Time-related LLM issues we encountered:

  • Models couldn't determine if dates like "April 10" were in the past or future without the current date as an explicit reference point.
  • Calculating "days ago" or "days remaining" based solely on dates produced inconsistent or incorrect results.
  • Time-based risk assessments (like flagging items stuck in a status too long) were unreliable without pre-calculated durations.

What we learned: Implement a pre-processing layer to handle all time calculations and provide absolute and relative temporal context before passing data to the LLM:

  • Instead of: "Task created on March 20, deadline April 10" (Assuming current date is March 26, 2025)
  • Use: "Task created on March 20, 2025 (6 days ago), deadline April 10, 2025 (15 days remaining from today, March 26, 2025)"

By pre-computing all time-based calculations and providing explicit context (including the "current date" used for calculations), we eliminated an entire category of errors and significantly improved the accuracy of our AI-powered Jira insights.

Lesson 4: Beyond temperature 0 - finding the sweet spot for structured LLM outputs

For highly structured LLM outputs like analytics reports or risk summaries, our initial assumption was that setting the temperature parameter to 0 (minimizing randomness) would yield the most consistent and accurate results. This proved counterproductive for nuanced analysis.

At temperature = 0, we observed that models often became:

  • Excessively rigid: Treating minor potential issues with the same alarming severity as major blockers.
  • Context-blind: Missing important nuances or mitigating factors that required proportional, less absolute responses.
  • Overly deterministic: Making definitive declarations even when data was incomplete or ambiguous.
  • Edge-case fragile: "Panicking" over normal conditions (like zero velocity at the very start of a sprint) instead of recognizing them as expected patterns.

What we learned: For tasks requiring both structure AND analytical judgment (like risk assessment in Jira data), a temperature setting between 0.2 and 0.3 often provides the optimal balance:

  • It maintains consistent formatting and output structure.
  • It allows for appropriate nuance in assessments, recommendations, and severity scoring.
  • It handles edge cases and ambiguity more gracefully.
  • It produces more natural, human-like analysis that users find more trustworthy and actionable.

This small adjustment to LLM parameters dramatically improved our risk assessment accuracy and reduced the need for extensive prompt engineering just to handle edge cases.

Lesson 5: Boost LLM accuracy with forced self-reflection (Chain-of-thought)

Even with state-of-the-art models like GPT-4 and Claude 3.7, we noticed occasional lapses in reasoning or simple computational errors that affected output quality. The solution came from adapting techniques like "chain-of-thought" prompting: requiring models to explain their reasoning step-by-step before providing the final conclusion.

This approach produced remarkable improvements in our LLM development process:

  • Better prioritization: Models became more discerning about what truly constituted a high-priority risk versus a minor observation.
  • Improved computation: Simple mathematical errors (like summing story points or calculating percentages) decreased significantly.
  • More balanced analysis: The tendency to overreact to single data points or minor signals was reduced.
  • Enhanced debugging & Explainable AI (XAI): When errors did occur, the explicit reasoning steps allowed us to quickly pinpoint exactly where the LLM reasoning process went wrong.

What we learned: For any prompt involving analysis, calculation, or decision-making, add a specific instruction to force self-reflection:

  • "Before providing your final analysis/summary/risk score, explain your reasoning step-by-step. Show how you evaluated the input data, what calculations you performed, and how you reached your conclusions."

While this approach consumes additional tokens (increasing cost), the return on investment in terms of LLM accuracy, explainability, and trustworthiness has been significant for our AI Jira analytics use case.

What we learned: To counteract the "yes-man" effect and promote objective AI analysis:

  • Explicitly instruct models to be critical: Add prompts like "Critically evaluate this data. What's missing? What assumptions might be flawed? What are the biggest risks you see?"
  • Frame questions neutrally: Avoid leading questions that suggest a desired answer. Ask "What does the data indicate about sprint progress?" instead of "Is the sprint significantly delayed based on this data?"
  • Consider model diversity: For tasks requiring strong analytical distance or critical evaluation, experiment with different models known for different strengths (always verifying outputs).
  • Verify, don't just trust: Remember that fluent, confident-sounding responses don't automatically equate to accuracy. Always cross-reference critical insights, especially those derived from complex Jira data.

Conclusion: practical insights for building reliable LLM applications

 Building reliable AI systems with Large Language Models requires more than just technical knowledge of APIs and prompts. It demands practical, hands-on experience and a deep understanding of these models' unique characteristics, strengths, and limitations, especially when dealing with domain-specific data like Jira project management information.

By proactively addressing data quality issues, providing explicit temporal context, carefully choosing LLM parameters like temperature, implementing self-reflection prompts for improved reasoning, and actively guarding against excessive agreeableness (model bias), we've dramatically improved the reliability, accuracy, and overall effectiveness of our AI-powered Jira analytics system. Applying these lessons allows us to generate more valuable outputs, such as automated summaries that save teams significant time – for instance, our work on producing concise AI Jira Fix Version and Sprint summaries to streamline release and sprint reporting.

These five LLM development lessons have transformed how we approach LLM integration at Luna. We're continuously refining our techniques as we build tools that empower product and engineering teams to extract truly actionable insights from their valuable Jira data.

Are you interested in learning more about how Luna can help your team gain deeper insights from your Jira data? Visit withLuna.ai to discover how our AI-powered Jira analytics can identify risks earlier, track engineering progress and product management KPIs more effectively, and help predict potential delays before they impact your roadmap.

Launch with Luna AI now!

We would like to listen to your feedback and build features that are important to you!
""
Thanks for showing interest! Your details have been submitted successfully and we will get back to you soon.
Oops! Something went wrong while submitting the form.
Your email is secure and we won’t send you any spam.