COMP7607 (Assignment 1)
2024-10-30 15:29:06

Student Name: Lu Jiaxiang
First of all, the task I choose to use is code.

In my experiments, several test results are observed, as depicted in the image below. The most significant improvement comes from adjusting the prompt, which yields approximately a 1% accuracy increase over the zero-shot baseline.
upload successful

Implemented Methods


Architecture

Model Test set Test set size
Llama-3.1-8B Instruct HumanEval 164

Hyperparameters

Temperature Top_p Top_k
0.1-0.5 0.1-0.5 model default
  • Across different test cases with varying methods, I have employed the same set of parameters.
  • However, within cases that utilize the same method, I have experimented with different parameters to identify the optimal performance configuration.
  • For method comparison, I have set Top_p and Temperature to fixed values(0.1) to minimize the influence of other factors.

Code Structure

In this section, essential code snippets provided for understanding the experiments conducted.

api.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class Model:

def __init__(self):

api_keys=[]

self.clients = []

for key in api_keys:

self.clients.append(OpenAI(base_url="https://api.sambanova.ai/v1",api_key=key))

self.clientInd = 0

self.lastcall = 0
self.completions=[]


def chat(self, msg):
curTime = time.time()
wait_time = 1 # start with 1 second wait time

while True:
try:
self.clientInd=(self.clientInd + 1) % len(self.clients)

# print(f"Chatting with the model...msg: {msg[-1]['content']}")

completion = self.clients[self.clientInd].chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=msg,
stream=False,
temperature=0.1,
top_p=0.1
)
self.completions.append(completion)

wait_time = 1
return completion

except Exception as e:
print(f"Error occurred: {e}")
print("Rate limit exceeded. Waiting for", wait_time, "seconds...")
time.sleep(wait_time)
wait_time +=0.5
continue

This code defines logic to access model api.

baseline.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def genMess(assi: bool = False):
problems = read_problems()
# print(problems.keys())

prompts = []
messages=[]

for i,p in enumerate(problems):

prompt = problems[p]["prompt"]
entry_point = problems[p]["entry_point"]
prompts.append(prompt)

test_prompt = prompt
message = [
{"role": "system", "content": f"Environment: ipython\n Function name: {entry_point}\n"},
{"role": "user", "content": test_prompt}
]
if assi == True:

assistantIndex = random.randint(0, len(problems)-1)
if assistantIndex == i:
assistantIndex = (i + 1) % len(problems)


assistantProblem = list(problems.keys())[assistantIndex]
message = [
{"role": "system", "content": f"Environment: ipython\n Function name: {entry_point} \n"},
{"role": "user", "content": problems[assistantProblem]["prompt"]},
{"role": "asistant", "content": problems[assistantProblem]["canonical_solution"]},
{"role": "user", "content": "This is the assistant's solution, you can check it out"},
{"role": "user", "content": test_prompt}
]
messages.append(message)

return messages
  • This function is intend to generate previous model’s input.
  • Also, the function input concludes whether to use assistant to better the understanding of format.

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def run_zero_shot():
with open('response.jsonl', 'w') as jsonl_file:
for i,message in enumerate(messages):
testres = False
testoutput = None
res_dict={}
count = 0

#gen
res_dict = res_to_dict(model.chat(message))

source_code = res_dict["choices"][0]["message"]["content"]

jsonl_file.write(json.dumps(res_dict) + '\n')

In zero_shot function, it calls api to chat with model and directly output the result to jsonl file.

LEVER

The main function is run which has partly implement LEVER mothod in paper “Lever: Learning to verify language-to-code generation with execution”

  • The LEVER (Learning to Verify Language-to-Code Generation with Execution) method involves using a medium-sized Language Model (LLM) to test and refine code. Here are the steps in the LEVER process:

upload successful
Step Descriptions:

  • Input Problem Description: A user provides a problem described in natural language that needs to be translated into code.
  • LLM Generates Candidate Code: A Language Model (LLM) generates candidate code based on the problem description.
  • Execute Code: The candidate code is executed in a controlled environment, and the outputs or errors are captured.
  • Verifier Evaluates: A trained verifier assesses the correctness of the code based on the execution results and the problem description.
  • Are Execution Results Correct?: The verifier determines if the execution results meet the expected outcomes.
    • If yes, Output Refined Code.
    • If no, the LLM regenerates the code based on the feedback and returns to step 3 for re-execution.

upload successful

  • This formula demonstrate the output from model is generated from previous output(Pθ) and test results(Plm).

Codes that implements lever

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def run():
with open('response.jsonl', 'w') as jsonl_file:
for i,message in enumerate(messages):

print("run {}/{}".format(i+1, len(messages)))
testres = False
testoutput = None
res_dict={}
count = 0

#gen
res_dict = res_to_dict(model.chat(message))

source_code = res_dict["choices"][0]["message"]["content"]

# print("Source Code Original:----------------------------------------------------------------\n", source_code)

testInd= random.randint(0,len(messages)-1)
function_name = problems[list(problems.keys())[testInd]]["entry_point"]
message.append({"role": "user","content": f"""
Please provide a test code to test the source code.
- Example test code for function {function_name}:
``python
{tests[testInd]}
``
- Note: You can use the following test code as a reference
- at least 8 test cases should be provided.
- make sure the test cases cover all possible cases, including edge cases.
- the test cases should be independent and not rely on each other.
- this text code will be used to test cases on the previous code you provided
- I will give test result to you if any error occurs, and you shall provide a new source code for test.
- You only need to output code without any discriptions.
"""})


#test
test_res_dict = res_to_dict(model.chat(message))

test_code = test_res_dict["choices"][0]["message"]["content"]

# print("Test Code:-------------------------------------------------\n", test_code,"----------------------------------")
testres, testoutput = run_tests(source_code,test_code)


function_name = problems[list(problems.keys())[i]]["entry_point"]

#LEVER
while testres == False and count < 5:

print("Error occurred in the previous code, I will provide a new source code for test.")
count+=1
message.append({"role": "user","content": testoutput+f"""\n
Test results are listed below, please provide another source code (not test code) that resolve the previous errors.
In the source code you only need to finish the definition of function {function_name}()
Example output:
``python
def {function_name}():
#remaining code here
# ....
``
You only need to output code without any discriptions.
"""})

res_dict = res_to_dict(model.chat(message))

source_code = res_dict["choices"][0]["message"]["content"]
# print("Source Code:----------------------------------------------------------------\n", source_code)
testres, testoutput = run_tests(source_code,test_code)

if testres == True:
print("Error fixed!")

jsonl_file.write(json.dumps(res_dict) + '\n')


code comment

  • This code can be divided into 3 parts, as mentioned before
    First, LLM Generates Candidate Code: A Language Model (LLM) generates candidate code based on the problem description.
    Second, Execute Code: The candidate code is executed in a controlled environment, and the outputs or errors are captured.
    Third, Verifier Evaluates: A trained verifier assesses the correctness of the code based on the execution results and the problem description.
    Then, if error occurs again, go back to the first method.

Unlike Lever, We use the same model(Lamma-8b) for test generation.

Zero-shot baseline


Data size Correctiveness(%) Total Tokens Total Time(s) Average Tokens Average Time cost(s)
164 67.07 46759 155 285.12 0.95

This is zero-shot result based solely on 2 messages.

  • one is from system,
  • one is from user
    1
    2
    {"role": "system", "content": f"Environment: ipython\n"},
    {"role": "user", "content": test_prompt}

Improve Prompt


Based on my experiements, some adjustments can have better results on model’s performance.

Data size Correctiveness(%) Total Tokens Total Time(s) Average Tokens Average Time cost(s)
164 67.68 45361 204 276.59 1.24

Lever

Data size Correctiveness(%) Total Tokens Total Time(s) Average Tokens Average Time cost(s)
164 66.46 46468 375 283.34 2.29

Reason for Subpar Results with LEVER

Despite employing the LEVER method, which integrates code generation, execution, and verification into a cohesive workflow, the results did not show a substantial improvement. There are several potential reasons for this outcome:

  • Model Capacity: The LEVER method, in this case, utilizes the Lamma-8b model for both code generation and verification. It’s possible that the model’s size and capacity are not sufficient to handle the complexity of the coding problems effectively.

  • Quality of Training Data: The performance of the verifier, a critical component of LEVER, is heavily reliant on the quality and quantity of the training data. If the training data is limited or not diverse enough, the verifier may not generalize well to unseen problems.

  • Error Handling: While LEVER includes a mechanism to refine code based on execution results, it may not be robust enough to catch all types of errors, especially those that are not immediately apparent through simple execution.

  • Overfitting to Training Data: The verifier might be overfitting to the specific patterns seen in the training data, leading to poor performance on new, unseen problems.

  • Complexity of Coding Problems: Some coding problems may be inherently complex, requiring a deeper understanding and more nuanced reasoning that current LLMs, including Lamma-8b, might not be capable of providing.

  • Integration of Components: The seamless integration of the three components (code generation, execution, and verification) in LEVER might be challenging, and any inefficiencies in this pipeline could lead to suboptimal results.

To improve the outcomes, it may be necessary to explore larger models, enhance the training data for the verifier, or refine the error handling and feedback mechanisms within the LEVER framework.