Prompting LLMs for Code: Patterns for Unit Tests and Refactors

Bekah Funning Jun 11 2026 Artificial Intelligence
Prompting LLMs for Code: Patterns for Unit Tests and Refactors

Stop asking your AI assistant to "write a function." That’s like handing a blueprint to a contractor without specifying the materials or the load-bearing walls. You’ll get something that looks like a house, but it won’t stand up when the wind blows-or in our case, when you run your test suite.

We’ve all been there. You paste a vague request into ChatGPT or GitHub Copilot, hit enter, and get back code that compiles but fails every edge case. The problem isn’t the model’s intelligence; it’s the ambiguity of your instruction. Recent research has shifted the focus from complex, multi-step reasoning chains to precise, single-shot prompts that embed context, constraints, and examples directly into the request. This approach saves tokens, reduces latency, and drastically improves the reliability of generated code.

The Shift from Chat to Context

Early advice on using Large Language Models (LLMs) for coding often relied on Chain-of-Thought (CoT) prompting. This technique asks the model to "think step-by-step" before writing code. While useful for debugging logic errors, CoT is inefficient for generation. It bloats token usage, increases costs, and often leads to hallucinations where the model contradicts its own reasoning.

A more effective strategy, supported by studies involving benchmarks like BigCodeBench and HumanEval+, is the "Context and Instruction" pattern. Instead of a conversational back-and-forth, you provide a dense, structured prompt that includes:

  • Method Signatures: Exact input types and return values.
  • Docstrings: Clear descriptions of intended behavior.
  • Pre-conditions: What must be true before the code runs (e.g., "input array is not null").
  • Post-conditions: What must be true after execution (e.g., "returns sorted list").

This method treats the prompt as a specification document rather than a chat message. When you define the boundaries clearly, the model doesn’t need to guess your intent. It simply fills in the implementation details within those strict guardrails.

Patterns for Reliable Unit Test Generation

Generating unit tests is one of the most high-value uses of LLMs, yet it’s also where models fail most spectacularly if prompted poorly. A generic request like "write tests for this class" often results in superficial checks that miss critical edge cases.

To fix this, use the "Recipe" pattern. This involves providing concrete examples of input/output pairs alongside the code you want tested. Here’s how to structure it:

  1. Provide the Source Code: Paste the function or class under test.
  2. Define Edge Cases Explicitly: List specific scenarios the model might overlook, such as empty inputs, null values, maximum integer limits, or concurrent access issues.
  3. Specify the Framework: Mention whether you’re using Jest, PyTest, JUnit, or Go testing packages. Include any specific assertion libraries you prefer.
  4. Show One Example Test: Give the model a template of a passing test so it matches your style and structure.

For instance, instead of saying "test this Python function," try: "Write PyTest cases for calculate_discount(). Include tests for negative prices, discounts over 100%, and null customer objects. Follow the existing pattern in test_utils.py which uses pytest.raises for validation errors."

This specificity forces the model to align with your project’s conventions and coverage goals. Research shows that including various types of implementation details in the prompt significantly increases the pass rate of generated tests on first attempt.

Magnifying glass reveals hidden bugs in code next to protective test shields

Refactoring with Precision

Refactoring is trickier than generation because the model must preserve existing behavior while changing structure. A common pitfall is "semantic drift," where the rewritten code subtly alters logic, breaking features that weren’t explicitly mentioned in the prompt.

To prevent this, anchor your refactor request with behavioral contracts. Use the following structure:

Key Elements for Safe Refactoring Prompts
Element Purpose Example
Current Behavior Defines what must not change "Must maintain O(n) time complexity"
Target Structure Defines the desired outcome "Extract helper methods for validation"
Constraints Limits technical choices "Do not introduce new dependencies"
Edge Case Preservation Ensures robustness "Handle legacy date formats unchanged"

When you ask an LLM to refactor, you are essentially giving it a surgical order. If you don’t specify which arteries to avoid, it will cut them. By listing pre-conditions and post-conditions, you create a safety net. For example: "Refactor this Java service class to use dependency injection. Ensure that the processOrder() method still throws InsufficientFundsException when balance is negative. Do not change the public API signatures."

This level of detail reduces the need for iterative corrections. Studies indicate that well-crafted single prompts can achieve satisfactory results in fewer interactions than prolonged dialogues, saving developers hours of back-and-forth tweaking.

Handling Ambiguity and Security

LLMs are probabilistic engines. They predict the next likely token based on training data. If your prompt contains ambiguous terms, the model will choose the most common interpretation, which may not be yours. Words like "efficient," "clean," or "secure" are subjective. To an LLM, "efficient" might mean fewer lines of code, not faster execution.

Clarify ambiguities by quantifying them. Instead of "make it secure," say "sanitize SQL inputs to prevent injection attacks and validate user roles against JWT claims." Instead of "clean code," say "adhere to SOLID principles, specifically Single Responsibility, by separating data fetching from business logic."

Security is a critical area where poor prompting leads to vulnerabilities. Generic prompts often result in code that ignores authentication checks or exposes sensitive data. Incorporate security constraints directly into your prompt templates. For example: "Generate a Node.js endpoint for user profile updates. Ensure that users can only update their own profiles by verifying the userId in the JWT matches the requested ID. Return 403 Forbidden otherwise."

This proactive approach integrates security considerations into the design phase, reducing the risk of introducing flaws that require later patching.

Robotic arm carefully untangles code wires within geometric safety guardrails

Practical Workflow Integration

Integrating these patterns into your daily workflow requires discipline. Don’t treat the LLM as a magic wand; treat it as a junior developer who needs clear instructions. Create a library of prompt templates for common tasks:

  • Unit Test Template: Includes placeholders for source code, edge cases, and framework specifics.
  • Refactor Template: Includes sections for current behavior, target structure, and constraints.
  • Debugging Template: Includes error logs, expected vs. actual output, and relevant code snippets.

Store these templates in a snippet manager or IDE extension. When you need to generate code, fill in the blanks rather than starting from scratch. This consistency ensures that you’re always providing the necessary context for the model to succeed.

Remember, the goal is not to eliminate human oversight but to reduce the cognitive load of initial drafting. The LLM handles the syntax and boilerplate; you handle the architecture and verification. By refining your prompts, you shift your role from coder to reviewer, allowing you to focus on higher-level design decisions.

Common Pitfalls to Avoid

Even with good patterns, mistakes happen. Here are three common pitfalls that undermine prompt effectiveness:

  1. Overloading Context: Pasting entire files when only a few functions are relevant confuses the model. Trim the context to what’s strictly necessary.
  2. Vague Success Criteria: Saying "it should work" provides no measurable standard. Define exactly what "working" means-does it pass specific tests? Does it meet performance benchmarks?
  3. Ignoring Model Limitations: Different models have different strengths. GPT-4o-mini excels at concise, accurate code generation, while larger models like Llama 3.3 70B might handle complex architectural questions better. Match the tool to the task.

Avoiding these traps keeps your workflow smooth and your codebase clean. The key is iteration-not just in the code, but in your prompting strategy. Analyze failed generations to identify missing context, then update your templates accordingly.

What is the best prompt pattern for generating unit tests?

The "Recipe" pattern is most effective. It involves providing the source code, explicitly listing edge cases (like null inputs or boundary values), specifying the testing framework (e.g., PyTest, Jest), and including one example test case to match your style. This reduces ambiguity and ensures comprehensive coverage.

How do I prevent semantic drift when refactoring code with an LLM?

Anchor your request with behavioral contracts. Clearly define pre-conditions (what must be true before execution) and post-conditions (what must be true after). Specify that public API signatures must remain unchanged and highlight critical edge cases that must be preserved. This acts as a safety net against unintended logic changes.

Why is Chain-of-Thought prompting less ideal for code generation?

Chain-of-Thought (CoT) increases token usage, computational costs, and inference latency. It also exacerbates hallucination risks as the model generates lengthy reasoning steps that may contradict each other. For code generation, single-shot, context-rich prompts are more efficient and reliable.

Can LLMs generate secure code reliably?

Only if security constraints are explicitly included in the prompt. Generic requests often lead to vulnerable code. You must specify requirements like input sanitization, authentication checks, and role-based access control. Treat security as a non-negotiable constraint in your prompt templates.

Which LLMs are best suited for coding tasks?

Models like GPT-4o-mini, Llama 3.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct are highly regarded for code generation. GPT-4o-mini offers speed and accuracy for standard tasks, while larger open-source models like Llama 3.3 excel in complex architectural reasoning. Choose based on your specific needs for latency, cost, and complexity.

Similar Post You May Like