Skip to content

Commit b896bd4

Browse files
committed
fix some bugs
1 parent 39e580c commit b896bd4

File tree

16 files changed

+69
-118
lines changed

16 files changed

+69
-118
lines changed

MODEL_LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
The aiXcoder License
1+
The aiXcoder Model License
22

33
1. Definitions
44

README.md

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# aiXcoder-7B Code Large Language Model
22

3+
<p align="center">
4+
🏠 <a href="https://www.aixcoder.com/" target="_blank">Official website</a>|🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aixcoder-plugin.aixcoder" target="_blank">VS Code Plugin</a>|🛠 <a href="https://plugins.jetbrains.com/plugin/13574-aixcoder-code-completer" target="_blank">Jetbrains Plugin</a>|🤗 <a href="https://huggingface.co/aiXcoder/aiXcoder-7b" target="_blank">Model Weights</a>|<a href="" target="_blank">WeChat</a>|<a href="./assets/wechat_2.jpg" target="_blank">WeChat Official Account</a>
5+
</p>
6+
37
Welcome to the official repository of aiXcoder-7B Code Large Language Model. This model is designed to understand and generate code across multiple programming languages, offering state-of-the-art performance in code completion, comprehension, generation, and more tasks about programming languages.
48

59
Table of Contents
@@ -34,13 +38,13 @@ In our ongoing exploration to apply large code models, the release of aiXcoder 7
3438
However, we have plans for further development of the aiXcoder model series already in motion. In the near future, we aim to release new versions of the model that have been meticulously instruct-tuned for a wider range of programming tasks, including but not limited to test case generation and code debugging. Through these instruct-tuned models, we anticipate offering developers more comprehensive and deeper programming support, helping them to maximize efficiency at every stage of software development.
3539

3640
![table_1](./assets/table_1.png)
37-
> aiXcoder 7B surpasses mainstream models in nl2code benchmark.
41+
> aiXcoder 7B surpasses mainstream models in nl2code benchmark. aiXcoder-7B is an enhancement of aiXcoder-7B-Base, fine-tuned on one hundred thousand data entries similar to Evol-instruct for one epoch.
3842
3943
<br>
4044
<br>
4145

4246
![table_3](./assets/table_3.png)
43-
> aiXcoder 7B surpasses mainstream models in code completion scenarios.
47+
> aiXcoder 7B Base surpasses mainstream models in code completion scenarios.
4448
4549
<br>
4650
<br>
@@ -61,6 +65,8 @@ To run the model inference code, you'll need the following environment setup:
6165
Please ensure all dependencies are installed using the following command:
6266

6367
```bash
68+
conda create -n aixcoder-7b python=3.11
69+
conda activate aixcoder-7b
6470
git clone git@github.com:aixcoder-plugin/aiXcoder-7b.git
6571
cd aiXcoder-7b
6672
pip install -r requirements.txt
@@ -180,30 +186,36 @@ print(quick_sort(arr)) # [1, 2, 3, 4, 5]
180186

181187
```python
182188

183-
from transformers import AutoModelForCausalLM, AutoTokenizer
184-
import torch
185189

190+
import torch
191+
import sys
192+
from hf_mini.utils import input_wrapper
193+
from transformers import AutoModelForCausalLM, AutoTokenizer
186194

187195
device = "cuda" # the device to load the model onto
188196

189197
tokenizer = AutoTokenizer.from_pretrained("aiXcoder/aiXcoder-7b")
190198
model = AutoModelForCausalLM.from_pretrained("aiXcoder/aiXcoder-7b", torch_dtype=torch.bfloat16)
191199

192200

193-
text = """▁<AIX-SPAN-PRE>▁<AIX-SPAN-POST>
194-
# 测试
195-
arr = [3, 2, 1, 4, 5]
196-
print(quick_sort(arr)) # [1, 2, 3, 4, 5]▁<AIX-SPAN-MIDDLE># the file path is: test.py
197-
# the code file is written by Python
198-
# 快速排序算法"""
201+
text = input_wrapper(
202+
# for FIM style input, code_string stands for prefix context
203+
code_string="# 快速排序算法",
204+
# for FIM style input, later_code stands for suffix context
205+
later_code="\n# 测试\narr = [3, 2, 1, 4, 5]\nprint(quick_sort(arr)) # [1, 2, 3, 4, 5]",
206+
# file_path should be a path from project to file
207+
path="test.py"
208+
)
199209

210+
if len(text) == 0:
211+
sys.exit()
200212

201213
inputs = tokenizer(text, return_tensors="pt", return_token_type_ids=False)
202214

203215
inputs = inputs.to(device)
204216
model.to(device)
205217

206-
outputs = model.generate(**inputs, max_new_tokens=512)
218+
outputs = model.generate(**inputs, max_new_tokens=256)
207219
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
208220

209221

@@ -240,7 +252,7 @@ def quick_sort(arr):
240252

241253
## Data for aiXcoder 7B
242254

243-
The core dataset for aiXcoder 7B comprises the programming languages commonly used in development, as well as natural languages closely related to code. The core dataset's programming languages mainly include nearly a hundred mainstream languages such as C++, Python, Java, and JavaScript, while the natural language component primarily consists of StackOverflow Q&As, technical blogs, code documentation, and computer science papers.
255+
The data for aiXcoder is divided into a core dataset and an extended dataset. The core dataset comprises the programming languages commonly used in development, as well as natural languages closely related to code. The core dataset's programming languages mainly include nearly a hundred mainstream languages such as C++, Python, Java, and JavaScript, while the natural language component primarily consists of StackOverflow Q&As, technical blogs, code documentation, and computer science papers. The extended data mainly consists of filtered open-source code datasets, high-quality English natural language datasets, and high-quality Chinese natural language datasets.
244256

245257
<!-- <br>
246258
<br>
@@ -327,7 +339,7 @@ Currently, the mainstream evaluation dataset for context-aware code completion i
327339

328340
To further evaluate the code completion capabilities of large language models for code in a more fine-grained manner, aiXcoder has built an evaluation dataset that is larger in size, more diverse in the code being tested, longer in the context length of the code being tested, and closer to real-world development projects. This evaluation dataset will also be open-sourced on GitHub simultaneously. During the evaluation process, we ensure that different large language models for code use the same maximum sequence length of 16K and evaluate the generation performance in different scenarios, such as generating complete method blocks, conditional blocks, loop processing blocks, exception handling blocks, and a total of thirteen cases.
329341

330-
Table 3 shows the average generation performance of different models in different languages. The final evaluation results are the average of all completion scenarios and evaluation samples. The aiXcoder 7B model achieves the best performance across major programming languages and various evaluation criteria, indicating that aiXcoder 7B has the best basic code completion capability among all open-source models of the same scale and is the most suitable base model for providing code completion capabilities in real-world programming scenarios.
342+
Table 3 shows the average generation performance of different models in different languages. The final evaluation results are the average of all completion scenarios and evaluation samples. The aiXcoder 7B Base model achieves the best performance across major programming languages and various evaluation criteria, indicating that aiXcoder 7B Base has the best basic code completion capability among all open-source models of the same scale and is the most suitable base model for providing code completion capabilities in real-world programming scenarios.
331343

332344
![table_3](./assets/table_3.png)
333345

@@ -362,11 +374,12 @@ In Table 8, we first evaluate the generation capability of each large language m
362374
## License
363375

364376

365-
This project is licensed under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) License - see the LICENSE file for details. The model weights are licensed under the Model License.
377+
The source code in this repository is licensed under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) License - see the LICENSE file for details.
378+
The model weights are licensed under the [Model License](./MODEL_LICENSE) for academic research use; for commercial use, please apply by sending an email to support@aiXcoder.com.
366379

367380

368381
## Acknowledgments
369382

370383
We would like to thank all contributors to the open-source projects and datasets that made this work possible.
371384

372-
Thank you for your interest in our Code Large Language Model. We look forward to your contributions and feedback!
385+
Thank you for your interest in our Code Large Language Model. We look forward to your contributions and feedback!

assets/table_1.png

134 KB
Loading

assets/table_2.png

50.1 KB
Loading

assets/table_3.png

165 KB
Loading

assets/table_4.png

76.4 KB
Loading

assets/table_5.png

74.2 KB
Loading

assets/table_6.png

77.4 KB
Loading

assets/table_7.png

83.3 KB
Loading

assets/table_8.png

82.9 KB
Loading

0 commit comments

Comments
 (0)