java-bench
diff --git a/‎README.md‎
Lines changed: 30 additions & 2 deletions b/‎README.md‎
Lines changed: 30 additions & 2 deletions
@@ -2,18 +2,25 @@
 
 JavaBench is a project-level Java benchmark that contains four projects at graduate-level difficulty. The difficulty and quality of JavaBench is validated and guaranteed by graduate students across four years. Please check our [Leaderboard](https://java-bench.github.io/leaderboard.html) for the visualization of the evaluation results.
 
+## Updates
+
+- 2024-06-08 Publish benchmark and leaderboard
+- 2024-07-24 Add instructions for submitting results
+
 ## Benchmark Dataset
 
 The four Java projects in JavaBench are designed for undergraduate students throughout the four aca-demic years from 2019 to 2022. We then use students’ overall scores as evidence of difficulty levels.
 
 ![Dataset](./paper_plot/images/projects.png)
 
 The benchmark dataset is accessible at `./datasets`. We provide three types of datasets with difference context settings.
+
 - Maximum Context: The dataset contains the context information as much as possible (Limited by LLMs).
 - Minimum Context: The dataset contains the no context information.
 - Selective Context: The dataset contains the context information that only includes method signatures of dependencies extracted by [jdeps](https://docs.oracle.com/en/java/javase/11/tools/jdeps.html).
 
 Below is the structure of the dataset:
+
 - `task_id`: The ID of the completion task, composed of the assignment number and class name.
 - `target`: The file path of the task in the Java project.
 - `code`: The code snippet that needs to be completed with `// TODO`.
@@ -127,7 +134,7 @@ For example:
 
 ```bash
 python evaluation.py test-wise \
-    --output output/result-PA19/gpt-3.5-turbo/test-wise_result.json \
+    --output output/result-PA19/gpt-3.5-turbo/result-full.json \
     --tests data/dataset/testcase/test-PA19.jsonl \
     output/result-PA19/gpt-3.5-turbo/samples.jsonl
 ```
@@ -142,6 +149,27 @@ Below are the instructions for the class-wise evalution output format:
 - `has_todo`: Indicates whether the inference result contains `// TODO`, as LLMs may exhibit laziness.
 - `can_replace`: Indicates whether the inference result contains a complete class.
 
+### Submission
+
+Now you have three files:
+
+- `samples.jsonl`: Completed code generated by LLMs.
+- `single_class.json`: Evaluation results of class-wise granularity.
+- `result-full.json`: Evaluation results of test-wise granularity.
+
+**If you're having trouble with the evalution step, you can just upload `samples.jsonl` and we'll evaluate it for you!**
+
+The next step is to submit a pull request for the project:
+
+1. [Fork](https://help.github.com/articles/fork-a-repo/) the repository into your own GitHub account.
+2. [Clone](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) the repository to your local.
+3. Checkout a new branch from main.
+4. Make a new directory under the output folder corresponding to the dataset(e.g. `./output/holistic-selective/result-PA19/gpt-3.5-turbo-1106`) and copy all the files above.
+5. Submit the Pull Request.
+6. The maintainers will review your Pull Request soon.
+
+Once your pull request is accepted, we will update the [Leaderboard](https://java-bench.github.io/leaderboard.html) with your results.
+
 ## Contributors
 
-## Citation
+## Citation