This repository, "test_code_eval," contains software to test if AI-generated code can be used to help test software. This software is also the scoring and validation code for the NIST GenAI Code Pilot Evaluation. Although the test cases include some sample data that can be used to explore this package, the NIST GenAI Code Pilot Evaluation Data is not in this repository.
This code contains the genai_code_test package.
Full documentation of the genai_code_test package is in the pre-built documentation.
For specific software questions, please contact Sonika Sharma sonika.sharma@nist.gov or Peter Fontana peter.fontana@nist.gov. For questions related to the NIST Generative AI (GenAI) Program, please contact genai_poc@nist.gov
The contributors to this code repository are:
- Sonika Sharma sonika.sharma@nist.gov
- Peter Fontana peter.fontana@nist.gov
The majority of configurations will be provided in a .ini file. The typical default file,
config.ini is provided as an example.
Additionally, this program uses one environment variables for convenience:
GENAI_CODE_CONFIG_PATHGENAI_CODE_CONFIG_PATH is the absolute path to the default config file, which may be
the config.ini file. Within the configuration file, is the variable repo_dir, which provides the absolute path
to this repository directory.
Additionally, for convenience, we store the root path to this repository in the environment variable.
GENAI_CODE_REPO_DIRThe code is tested on python 3.12.6 and 3.12.8 and is pip installable with:
pip install .or can be installed in editable mode with:
pip install -e .The package installed is named genai_code_test. The package can be uninstalled with:
pip uninstall genai_code_testor uninstalled if installed in editable mode with:
rm -r genai_code_test.egg-info
pip uninstall genai_code_testThe organization structure of folders is here. There is also this README.md, and a CHANGELOG.md.
test_code_eval (<root directory of repository referred to as $GENAI_CODE_REPO_DIR in this README.md>)
- docs (pre-built documentation)
- genai_code_test (root code director for python code for genai code experiment and evaluation)
- baseline_system_creation (scripts to create human-generated baseline submissions from
code bank files)
- evaluation_environment (directory with scorer and validator)
- evaluate_submission.py (the scorer)
- validate_submission.py (the validator)
- utils (utility scripts helpful for ease of working with json submissions and creating data files.
Relevant scripts include)
- create_code_files_from_json_input.py (Takes a validly-formatted json submission and converts
it to a folder of folders and files to allow for human-readable viewing of the code strings
as files)
- create_json_files_from_code_files.py (Takes a folder of folders and files and converts to a
json submission. Useful to use when one converted a json using the
create_code_files_from_json_input.py script to folders and then modified the files in that folder.
This allows one to convert it back to a json.)
- extract_test_code_from_test_output.json (The tool we used to extract validly formatted test_code
from LLM outpus. This converts LLM output when LLMs followed our prompt instructions but this script
will not handle arbitrary LLM output.)
- test_data (directory with test data for test cases)
- test_working_space (working directory where test cases create temporary working files and produce
output files)
- tests (location of test suite code)
For an example submission, look at the file test_data/submissions_test/test_smoke_various/test1_v0d99_smoke.json
The scripts evaluate_submission.py and validate_submission.py are command-line scripts and their help menu
can be accessed with
python evaluate_submission.py -hand
python validate_submission.py -hIf you are in the directory $GENAI_CODE_REPO_DIR/genai_code_test/evaluation_environment, the script to validate the
example test submission is
# Create output directory
mkdir -p $GENAI_CODE_REPO_DIR/test_code_eval/scratch_working_space
# Switch to test eval environment directory
cd $GENAI_CODE_REPO_DIR/genai_code_test/evaluation_environment
# Run validation
python validate_submission.py -i $GENAI_CODE_REPO_DIR/test_data/code_files_test/prob_data/input_smoke_v1d00.json -o $GENAI_CODE_REPO_DIR/scratch_output/evaluation -w $GENAI_CODE_REPO_DIR/scratch_working_space -s $GENAI_CODE_REPO_DIR/test_data/submissions_test/test_smoke_various/test1_smoke.json -v```and the script to score the test submission is:
python evaluate_submission.py -k $GENAI_CODE_REPO_DIR/test_data/code_files_test/key_data/key_smoke_v1d00.json -o $GENAI_CODE_REPO_DIR/scratch_output/evaluation -w $GENAI_CODE_REPO_DIR/scratch_working_space -s $GENAI_CODE_REPO_DIR/test_data/submissions_test/test_smoke_various/test1_smoke.jsonIn both of these scripts, we provide a working directory with -w <working_dir_path> where the code can both
write temporary files and delete any files in those directories. We also specified the output with -o output_dir_path
Both validate_submission.py and evaluate_submission.py may remove all files in the working directory specified by
-w <working_dir_path>. Please provide -w with an empty directory where the code can create and delete files.
The test suite, generation of rendered API documentation, and code checking for formatting using a lint code tool can all be run locally. Instructions are below.
We have a test suite with the pytest package and code coverage with coverage. This requires the package coverage
and pytest, both of which can be installed with pip.
The following command runs all the unit tests and outputs code coverage into htmlcov/index.html
coverage run --branch --source=./genai_code_test -m pytest -s tests/ -v
coverage report -m
coverage htmlWhen running the tests, there is a fixture defined in /tests/conftest.py that removes all of the files in test_working_space/temp_working_space
and test_working_space/temp_output
The CI uses flake8 to check for the code formatting with the command
flake8 --extend-ignore=E712,E402 genai_code_test tests --max-line-length=120 --exclude=docs,./.* For automatic styling, we will use the autopep8 package. To style the code, use
autopep8 --max-line-length 120 --aggressive --aggressive --ignore E226,E24,W50,W690,E712,E402 -r genai_code_test tests --in-placeIf you wish to see what the changes are without making them use the --diff option with
autopep8 --max-line-length 120 --aggressive --aggressive --ignore E226,E24,W50,W690,E712,E402 -r genai_code_test tests --diffTo build the documentation with sphinx and autodoc, run
# A Pip installation may be necessary to generate the docs. Install the package with:
# pip install -U -e .
sphinx-apidoc -fMeT -o docs/api genai_code_test
sphinx-build -av --color -b html docs docs/_buildto generate the docs. The pip install command is needed for sphinx to recognize the genai_code_test module.
(If you wish to document what is installed by pip, use the commented line to upgrade PIP)
See the Sphinx Installation Documentation
for more information on how to install Sphinx. You will also need the m2r package which is a requirement of this
package.
We are using Sphinx 7.
The license is documented in the LICENSE file and on the NIST website.
Certain commercial equipment, instruments, software, or materials are identified in this document to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor necessarily the best available for the purpose. The descriptions and views contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NIST or the U.S. Government.