7 critical LLM lessons building an AI Jira analytics tool

Learn critical LLM lessons from building an AI Jira analytics tool with GPT-4 & Claude 3.7. Avoid common pitfalls in data, prompts, temperature & reasoning.

Myriam Debahy

May 07, 2025 . 10 min read

Building AI tools with Large Language Models (LLMs) like GPT-4 and Claude 3.7 can feel like navigating uncharted territory. The potential is immense, but the practical challenges of applied AI are very real. Over the past several months, our team at Luna has been developing an AI-powered Jira analytics system. Our goal: extract actionable insights from complex Jira data to help engineering and product teams track progress, identify risks, and predict potential delays more effectively.

What started as an exciting technical challenge quickly showed us the real-world ups and downs of working with LLMs. In this article, I'll share the five most important LLM lessons we learned; insights that would have saved us weeks of troubleshooting and iteration had we known them from the start.

Whether you're actively building AI products, integrating LLMs into existing workflows, looking for LLM best practices, or simply curious about the hurdles in AI project management, these lessons will help you avoid common pitfalls and build more reliable, effective AI systems.

Critical LLM development lessons building an AI Jira analytics tool

Lesson 1: Combatting the LLM 'Yes-Man' - ensuring critical analysis & objectivity

Our final lesson addresses a subtle but critical LLM challenge: models can sometimes be too agreeable, potentially reinforcing user assumptions or biases rather than providing objective, critical analysis. This is sometimes referred to as the "sycophantic" or "yes-man" effect.

We observed that models would occasionally:

Mirror our implicit assumptions within prompts rather than challenging them based on the data.
Agree with flawed premises or leading questions instead of identifying underlying problems.
Omit crucial caveats or risks if not explicitly prompted to look for them.
Deliver potentially inaccurate or hallucinated information with unnervingly high confidence in their tone.

Interestingly, we found differences between models in this regard during our LLM testing. In certain analytical tasks, we observed that Claude models sometimes provided more pushback – flagging gaps, surfacing potential blockers unprompted, and offering more critical assessments compared to some other models we tested at the time.

What we learned: To counteract the "yes-man" effect and promote objective AI analysis:

Explicitly instruct models to be critical: Add prompts like "Critically evaluate this data. What's missing? What assumptions might be flawed? What are the biggest risks you see?"
Frame questions neutrally: Avoid leading questions that suggest a desired answer. Ask "What does the data indicate about sprint progress?" instead of "Is the sprint significantly delayed based on this data?"
Consider model diversity: For tasks requiring strong analytical distance or critical evaluation, experiment with different models known for different strengths (always verifying outputs).

Verify, don't just trust: Remember that fluent, confident-sounding responses don't automatically equate to accuracy. Always cross-reference critical insights, especially those derived from complex Jira data.

Lesson 2: LLM mistakes? Audit your input data first (It's often not the prompt)

When an LLM produces unexpected or incorrect output, the natural reaction is to blame the model or refine the prompt. We spent countless hours tweaking instructions, adding constraints, and breaking tasks into smaller steps – often with minimal improvement in our AI Jira analytics output.

The breakthrough came when we realized that in approximately 80% of problematic cases, the underlying issue wasn't our prompt engineering but the quality and consistency of the data we were feeding into the LLMs.

Common data quality issues impacting LLM performance:

Inconsistent calculations: The same engineering metrics (like sprint velocity or cycle time) being calculated differently across various Jira projects or data sources.
Missing context: Values or statuses that seemed obvious to human readers but lacked necessary contextual information for the AI to interpret correctly (e.g., 'Done' meaning different things in different workflows).
Conflicting information: Subtle contradictions between different data points (e.g., start/end dates) that humans might ignore or reconcile but thoroughly confused the LLM.

What we learned: Before overhauling your prompts, rigorously audit your input data quality:

Ensure naming conventions (for fields, statuses, projects) are consistent across all data sources.
Verify that key metrics are calculated identically everywhere they appear.
Actively check for gaps, ambiguities, or conflicts in your input data before passing it to the model.

We found that when presented with conflicting or ambiguous data, LLMs don't "hallucinate" randomly; they attempt to improvise reasonable interpretations based on flawed input. By cleaning and standardizing our Jira data inputs, we achieved far more reliable and accurate outputs, often without needing overly complex prompt engineering. Ensuring high data quality for AI is key.

Lesson 3: LLMs & time blindness - explicit temporal context is non-negotiable

One of our most surprising discoveries was that even advanced models like Claude 3.7 and GPT-4 have no inherent understanding of the current date or how to interpret relative time references ("last week," "next sprint") without explicit guidance. This is a critical LLM challenge for time-sensitive analysis.

This became particularly problematic when analyzing sprint deadlines, issue aging, and time-based risk factors within our Jira analytics tool. The models would fail to identify overdue tasks or miscalculate time remaining until deadlines simply based on dates alone.

Time-related LLM issues we encountered:

Models couldn't determine if dates like "April 10" were in the past or future without the current date as an explicit reference point.
Calculating "days ago" or "days remaining" based solely on dates produced inconsistent or incorrect results.
Time-based risk assessments (like flagging items stuck in a status too long) were unreliable without pre-calculated durations.

What we learned: Implement a pre-processing layer to handle all time calculations and provide absolute and relative temporal context before passing data to the LLM:

Instead of: "Task created on March 20, deadline April 10" (Assuming current date is March 26, 2025)
Use: "Task created on March 20, 2025 (6 days ago), deadline April 10, 2025 (15 days remaining from today, March 26, 2025)"

By pre-computing all time-based calculations and providing explicit context (including the "current date" used for calculations), we eliminated an entire category of errors and significantly improved the accuracy of our AI-powered Jira insights.

Lesson 4: Beyond temperature 0 - finding the sweet spot for structured LLM outputs

For highly structured LLM outputs like analytics reports or risk summaries, our initial assumption was that setting the temperature parameter to 0 (minimizing randomness) would yield the most consistent and accurate results. This proved counterproductive for nuanced analysis.

At temperature = 0, we observed that models often became:

Excessively rigid: Treating minor potential issues with the same alarming severity as major blockers.
Context-blind: Missing important nuances or mitigating factors that required proportional, less absolute responses.
Overly deterministic: Making definitive declarations even when data was incomplete or ambiguous.
Edge-case fragile: "Panicking" over normal conditions (like zero velocity at the very start of a sprint) instead of recognizing them as expected patterns.

What we learned: For tasks requiring both structure AND analytical judgment (like risk assessment in Jira data), a temperature setting between 0.2 and 0.3 often provides the optimal balance:

It maintains consistent formatting and output structure.
It allows for appropriate nuance in assessments, recommendations, and severity scoring.
It handles edge cases and ambiguity more gracefully.
It produces more natural, human-like analysis that users find more trustworthy and actionable.

This small adjustment to LLM parameters dramatically improved our risk assessment accuracy and reduced the need for extensive prompt engineering just to handle edge cases.

Lesson 5: Boost LLM accuracy with forced self-reflection (Chain-of-thought)

Even with state-of-the-art models like GPT-4 and Claude 3.7, we noticed occasional lapses in reasoning or simple computational errors that affected output quality. The solution came from adapting techniques like "chain-of-thought" prompting: requiring models to explain their reasoning step-by-step before providing the final conclusion.

This approach produced remarkable improvements in our LLM development process:

Better prioritization: Models became more discerning about what truly constituted a high-priority risk versus a minor observation.
Improved computation: Simple mathematical errors (like summing story points or calculating percentages) decreased significantly.
More balanced analysis: The tendency to overreact to single data points or minor signals was reduced.
Enhanced debugging & Explainable AI (XAI): When errors did occur, the explicit reasoning steps allowed us to quickly pinpoint exactly where the LLM reasoning process went wrong.

What we learned: For any prompt involving analysis, calculation, or decision-making, add a specific instruction to force self-reflection:

"Before providing your final analysis/summary/risk score, explain your reasoning step-by-step. Show how you evaluated the input data, what calculations you performed, and how you reached your conclusions."

While this approach consumes additional tokens (increasing cost), the return on investment in terms of LLM accuracy, explainability, and trustworthiness has been significant for our AI Jira analytics use case.

Lesson 6: Is your LLM talking too much? Focus through constraint

LLMs can do amazing things. But letting them "talk" too much, especially on tasks needing clear, logical steps, can actually open the door for more errors, weird inconsistencies, or just plain unnecessary fluff.

The fix? Get straight to the point:

Ask for less: Simply reduce the number of sections or pieces of information you want in the output.
Offload basic tasks: If something can be done easily outside the LLM (like simple math or data sorting), do it there. Let the LLM focus on what only it can do.
One task at a time: If you have several different things you need, use separate LLM calls. Don't try to cram everything into one giant prompt.
Consider shorter output limits: Setting a lower max token limit can force the LLM to be more concise and focused.

Why this actually works:

LLMs aren't infinite: They have limits on how much information they can process well at once (context windows) and how much "thinking" they can do for a single query. Fewer tasks mean they can dedicate more of their capacity to getting each one right.
Focus is key (even for AIs): Just like humans, LLMs don't multitask as well as we might think. Giving them fewer things to do at once usually leads to better quality on each specific task.

Good to keep in mind:

Creative tasks are different – if you're brainstorming or exploring ideas, letting the LLM generate more content can be very useful.
Newer models are getting better at handling more complex requests and reasoning.

Helping your LLM focus by asking for less can significantly boost its performance. Sometimes, the best way to get more from your AI is to ask it to say less.

Lesson 7: Are you really making the most of LLM reasoning models?

The latest generation of reasoning models like Claude 3.7, GPT o-series and Gemini 2.5 Pro represent a fundamental shift in how LLMs approach problems. They're not just "smarter LLMs" – these models think differently, solving problems step-by-step in a visible, logical chain rather than jumping straight to an answer.

But here's the catch: They don't use that reasoning fully unless you explicitly prompt for it.

It's like having a brilliant colleague who could map out a complex problem on a whiteboard – but instead does it all in their head and hands you a final answer. You don't see the steps. You can't verify the logic. And worse – they're more likely to miss guidelines or make mistakes without working things out step by step.

For reasoning models, explicitly request visible reasoning:

"Explicit your reasoning inside your <thinking> bloc."

The tradeoffs to consider:

⏳ Latency: Reasoning takes time. Do you show the thinking, or hide it behind a wait?
💸 Cost: More tokens = higher cost. But for high-stakes analytical calls, the transparency and trust can be worth it.‍
🎯 Accuracy: Visible reasoning leads to more reliable outputs, especially for complex analytical tasks like risk assessment.

Conclusion: practical insights for building reliable LLM applications

Building reliable AI systems with Large Language Models requires more than just technical knowledge of APIs and prompts. It demands practical, hands-on experience and a deep understanding of these models' unique characteristics, strengths, and limitations, especially when dealing with domain-specific data like Jira project management information.

By proactively addressing data quality issues, providing explicit temporal context, carefully choosing LLM parameters like temperature, implementing self-reflection prompts for improved reasoning, constraining output scope for better focus, leveraging reasoning models effectively, and actively guarding against excessive agreeableness (model bias), we've dramatically improved the reliability, accuracy, and overall effectiveness of our AI Jira Fix Version and Sprint summaries

These seven LLM development lessons have transformed how we approach LLM integration at Luna. We're continuously refining our techniques as we build tools that empower product and engineering teams to extract truly actionable insights from their valuable Jira data.

Are you interested in learning more about how Luna can help your team gain deeper insights from your Jira data? Visit withLuna.ai to discover how our AI-powered Jira analytics can identify risks earlier, track engineering progress and product management KPIs more effectively, and help predict potential delays before they impact your roadmap.

7 critical LLM lessons building an AI Jira analytics tool

Lesson 1: Combatting the LLM 'Yes-Man' - ensuring critical analysis & objectivity

Lesson 2: LLM mistakes? Audit your input data first (It's often not the prompt)

Lesson 3: LLMs & time blindness - explicit temporal context is non-negotiable

Lesson 4: Beyond temperature 0 - finding the sweet spot for structured LLM outputs

Lesson 5: Boost LLM accuracy with forced self-reflection (Chain-of-thought)

Lesson 6: Is your LLM talking too much? Focus through constraint

Lesson 7: Are you really making the most of LLM reasoning models?

Conclusion: practical insights for building reliable LLM applications

Related articles

AI Jira Fix Version summaries: automate release reporting

AI Jira Sprint summaries: save hours on reporting

7 Lessons from building a Large Language Model summarization tool

Stop drowning in busywork.
Start delivering value!

7 critical LLM lessons building an AI Jira analytics tool

Lesson 1: Combatting the LLM 'Yes-Man' - ensuring critical analysis & objectivity

Lesson 2: LLM mistakes? Audit your input data first (It's often not the prompt)

Lesson 3: LLMs & time blindness - explicit temporal context is non-negotiable

Lesson 4: Beyond temperature 0 - finding the sweet spot for structured LLM outputs

Lesson 5: Boost LLM accuracy with forced self-reflection (Chain-of-thought)

Lesson 6: Is your LLM talking too much? Focus through constraint

Lesson 7: Are you really making the most of LLM reasoning models?

Conclusion: practical insights for building reliable LLM applications

Related articles

AI Jira Fix Version summaries: automate release reporting

AI Jira Sprint summaries: save hours on reporting

7 Lessons from building a Large Language Model summarization tool

Stop drowning in busywork. Start delivering value!

Stop drowning in busywork.
Start delivering value!