Student Name: Lu Jiaxiang
First of all, the task I choose to use is code.
In my experiments, several test results are observed, as depicted in the image below. The most significant improvement comes from adjusting the prompt, which yields approximately a 1% accuracy increase over the zero-shot baseline.
Implemented Methods
Architecture
Model | Test set | Test set size |
---|---|---|
Llama-3.1-8B Instruct | HumanEval | 164 |
Hyperparameters
Temperature | Top_p | Top_k |
---|---|---|
0.1-0.5 | 0.1-0.5 | model default |
- Across different test cases with varying methods, I have employed the same set of parameters.
- However, within cases that utilize the same method, I have experimented with different parameters to identify the optimal performance configuration.
- For method comparison, I have set Top_p and Temperature to fixed values(0.1) to minimize the influence of other factors.
Code Structure
In this section, essential code snippets provided for understanding the experiments conducted.
api.py
1 | class Model: |
This code defines logic to access model api.
baseline.py
1 | def genMess(assi: bool = False): |
- This function is intend to generate previous model’s input.
- Also, the function input concludes whether to use assistant to better the understanding of format.
main.py
1 | def run_zero_shot(): |
In zero_shot function, it calls api to chat with model and directly output the result to jsonl file.
LEVER
The main function is run which has partly implement LEVER mothod in paper “Lever: Learning to verify language-to-code generation with execution”
- The LEVER (Learning to Verify Language-to-Code Generation with Execution) method involves using a medium-sized Language Model (LLM) to test and refine code. Here are the steps in the LEVER process:
Step Descriptions:
- Input Problem Description: A user provides a problem described in natural language that needs to be translated into code.
- LLM Generates Candidate Code: A Language Model (LLM) generates candidate code based on the problem description.
- Execute Code: The candidate code is executed in a controlled environment, and the outputs or errors are captured.
- Verifier Evaluates: A trained verifier assesses the correctness of the code based on the execution results and the problem description.
- Are Execution Results Correct?: The verifier determines if the execution results meet the expected outcomes.
- If yes, Output Refined Code.
- If no, the LLM regenerates the code based on the feedback and returns to step 3 for re-execution.
- This formula demonstrate the output from model is generated from previous output(Pθ) and test results(Plm).
Codes that implements lever
1 | def run(): |
code comment
- This code can be divided into 3 parts, as mentioned before
First, LLM Generates Candidate Code: A Language Model (LLM) generates candidate code based on the problem description.
Second, Execute Code: The candidate code is executed in a controlled environment, and the outputs or errors are captured.
Third, Verifier Evaluates: A trained verifier assesses the correctness of the code based on the execution results and the problem description.
Then, if error occurs again, go back to the first method.
Unlike Lever, We use the same model(Lamma-8b) for test generation.
Zero-shot baseline
Data size | Correctiveness(%) | Total Tokens | Total Time(s) | Average Tokens | Average Time cost(s) |
---|---|---|---|---|---|
164 | 67.07 | 46759 | 155 | 285.12 | 0.95 |
This is zero-shot result based solely on 2 messages.
- one is from system,
- one is from user
1
2{"role": "system", "content": f"Environment: ipython\n"},
{"role": "user", "content": test_prompt}
Improve Prompt
Based on my experiements, some adjustments can have better results on model’s performance.
Data size | Correctiveness(%) | Total Tokens | Total Time(s) | Average Tokens | Average Time cost(s) |
---|---|---|---|---|---|
164 | 67.68 | 45361 | 204 | 276.59 | 1.24 |
Lever
Data size | Correctiveness(%) | Total Tokens | Total Time(s) | Average Tokens | Average Time cost(s) |
---|---|---|---|---|---|
164 | 66.46 | 46468 | 375 | 283.34 | 2.29 |
Reason for Subpar Results with LEVER
Despite employing the LEVER method, which integrates code generation, execution, and verification into a cohesive workflow, the results did not show a substantial improvement. There are several potential reasons for this outcome:
Model Capacity: The LEVER method, in this case, utilizes the Lamma-8b model for both code generation and verification. It’s possible that the model’s size and capacity are not sufficient to handle the complexity of the coding problems effectively.
Quality of Training Data: The performance of the verifier, a critical component of LEVER, is heavily reliant on the quality and quantity of the training data. If the training data is limited or not diverse enough, the verifier may not generalize well to unseen problems.
Error Handling: While LEVER includes a mechanism to refine code based on execution results, it may not be robust enough to catch all types of errors, especially those that are not immediately apparent through simple execution.
Overfitting to Training Data: The verifier might be overfitting to the specific patterns seen in the training data, leading to poor performance on new, unseen problems.
Complexity of Coding Problems: Some coding problems may be inherently complex, requiring a deeper understanding and more nuanced reasoning that current LLMs, including Lamma-8b, might not be capable of providing.
Integration of Components: The seamless integration of the three components (code generation, execution, and verification) in LEVER might be challenging, and any inefficiencies in this pipeline could lead to suboptimal results.
To improve the outcomes, it may be necessary to explore larger models, enhance the training data for the verifier, or refine the error handling and feedback mechanisms within the LEVER framework.