You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Welcome to the official repository of aiXcoder-7B Code Large Language Model. This model is designed to understand and generate code across multiple programming languages, offering state-of-the-art performance in code completion, comprehension, generation, and more tasks about programming languages.
4
8
5
9
Table of Contents
@@ -34,13 +38,13 @@ In our ongoing exploration to apply large code models, the release of aiXcoder 7
34
38
However, we have plans for further development of the aiXcoder model series already in motion. In the near future, we aim to release new versions of the model that have been meticulously instruct-tuned for a wider range of programming tasks, including but not limited to test case generation and code debugging. Through these instruct-tuned models, we anticipate offering developers more comprehensive and deeper programming support, helping them to maximize efficiency at every stage of software development.
35
39
36
40

37
-
> aiXcoder 7B surpasses mainstream models in nl2code benchmark.
41
+
> aiXcoder 7B surpasses mainstream models in nl2code benchmark. aiXcoder-7B is an enhancement of aiXcoder-7B-Base, fine-tuned on one hundred thousand data entries similar to Evol-instruct for one epoch.
38
42
39
43
<br>
40
44
<br>
41
45
42
46

43
-
> aiXcoder 7B surpasses mainstream models in code completion scenarios.
47
+
> aiXcoder 7B Base surpasses mainstream models in code completion scenarios.
44
48
45
49
<br>
46
50
<br>
@@ -61,6 +65,8 @@ To run the model inference code, you'll need the following environment setup:
61
65
Please ensure all dependencies are installed using the following command:
The core dataset for aiXcoder 7B comprises the programming languages commonly used in development, as well as natural languages closely related to code. The core dataset's programming languages mainly include nearly a hundred mainstream languages such as C++, Python, Java, and JavaScript, while the natural language component primarily consists of StackOverflow Q&As, technical blogs, code documentation, and computer science papers.
255
+
The data for aiXcoder is divided into a core dataset and an extended dataset. The core dataset comprises the programming languages commonly used in development, as well as natural languages closely related to code. The core dataset's programming languages mainly include nearly a hundred mainstream languages such as C++, Python, Java, and JavaScript, while the natural language component primarily consists of StackOverflow Q&As, technical blogs, code documentation, and computer science papers. The extended data mainly consists of filtered open-source code datasets, high-quality English natural language datasets, and high-quality Chinese natural language datasets.
244
256
245
257
<!-- <br>
246
258
<br>
@@ -327,7 +339,7 @@ Currently, the mainstream evaluation dataset for context-aware code completion i
327
339
328
340
To further evaluate the code completion capabilities of large language models for code in a more fine-grained manner, aiXcoder has built an evaluation dataset that is larger in size, more diverse in the code being tested, longer in the context length of the code being tested, and closer to real-world development projects. This evaluation dataset will also be open-sourced on GitHub simultaneously. During the evaluation process, we ensure that different large language models for code use the same maximum sequence length of 16K and evaluate the generation performance in different scenarios, such as generating complete method blocks, conditional blocks, loop processing blocks, exception handling blocks, and a total of thirteen cases.
329
341
330
-
Table 3 shows the average generation performance of different models in different languages. The final evaluation results are the average of all completion scenarios and evaluation samples. The aiXcoder 7B model achieves the best performance across major programming languages and various evaluation criteria, indicating that aiXcoder 7B has the best basic code completion capability among all open-source models of the same scale and is the most suitable base model for providing code completion capabilities in real-world programming scenarios.
342
+
Table 3 shows the average generation performance of different models in different languages. The final evaluation results are the average of all completion scenarios and evaluation samples. The aiXcoder 7B Base model achieves the best performance across major programming languages and various evaluation criteria, indicating that aiXcoder 7B Base has the best basic code completion capability among all open-source models of the same scale and is the most suitable base model for providing code completion capabilities in real-world programming scenarios.
331
343
332
344

333
345
@@ -362,11 +374,12 @@ In Table 8, we first evaluate the generation capability of each large language m
362
374
## License
363
375
364
376
365
-
This project is licensed under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) License - see the LICENSE file for details. The model weights are licensed under the Model License.
377
+
The source code in this repository is licensed under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) License - see the LICENSE file for details.
378
+
The model weights are licensed under the [Model License](./MODEL_LICENSE) for academic research use; for commercial use, please apply by sending an email to support@aiXcoder.com.
366
379
367
380
368
381
## Acknowledgments
369
382
370
383
We would like to thank all contributors to the open-source projects and datasets that made this work possible.
371
384
372
-
Thank you for your interest in our Code Large Language Model. We look forward to your contributions and feedback!
385
+
Thank you for your interest in our Code Large Language Model. We look forward to your contributions and feedback!
0 commit comments