Getting Started


Installation

Install the package from PyPi:

# (Recommended) Create a new conda environment.
conda create -n tail python=3.10 -y
conda activate tail

# Install tailtest
pip install tail-test

Set your OPENAI_API_KEY as an environment variable.

export OPENAI_API_KEY="..."

For more details, see Installation Guide.

Prepare a long document

TAIL generates QAs for benchmark generation based on the document users inputs. Users need to prepare the input document in a JSON file, in the format of [{"text: YOUR_LONG_TEXT}] (YOUR_LONG_TEXT is a long string). We prepare a example input document file in /data/example_input.json, if you don't have time to collect your own document, you can use it to generate benchmarks.

Generate your own benchmark

The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document.

Provide path for your long document and path to save your benchmark, specify document_length and depth,and then run tail-cli.build to start benchmark generation! Here's an example:

tail-cli.build --raw_document_path "/data/raw.json" --QA_save_path "/data/QA.json" --document_length 8000 32000 64000 --depth_list 25 50 75

Test LLMs on your benchmark

After generation your benchmark, it's time to evaluate LLMs on it. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and store visualizations in test_result_save_dir.

tail-cli.eval --QA_save_path "/data/QA.json" --test_model_name "gpt-4o" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir "/data/result/"