Tutorial on Dataset Preparation

Tutorial on Dataset Preparation#

This hands-on tutorial demonstrates how to prepare and upload a dataset for fine-tuning machine learning models in AMD AI Workbench. You’ll learn to download a dataset from Hugging Face, convert it to the proper JSONL format, and upload it to your workspace using both command-line and Python methods.

By the end of this tutorial, you’ll have a complete workflow for preparing datasets that can be used for model fine-tuning and other AI workloads.

Prerequisites#

Before starting this tutorial, ensure you have:

AMD AI Workbench access: An active JupyterLab workspace in your project
Python knowledge: Basic familiarity with Python programming and Jupyter notebooks

Dataset Preparation and Upload#

Dataset preparation is a crucial step in the machine learning workflow. Your data needs to be in the correct format and structure before it can be used for training or fine-tuning models.

In this tutorial, we’ll work with the JSONL (JSON Lines) format, which is commonly used for conversational AI and instruction-following datasets. We’ll walk through the complete process: downloading a dataset from Hugging Face, converting it to the proper format, and uploading it to AMD AI Workbench for use in your projects.

Step 1: Prepare Your Dataset#

We’ll use the BAAI/OPI dataset from Hugging Face, which contains instruction-following examples perfect for demonstrating the conversion process. This dataset includes structured data with instruction, input, and output fields that we’ll transform into a conversation format.

What this code does:

Downloads the dataset from Hugging Face Hub
Converts the JSON format to JSONL with proper message structure
Creates a smaller sample for testing (optional)

Copy and execute the following code in a new cell in your JupyterLab notebook:

from huggingface_hub import hf_hub_download
import json
import os
import random

def convert_opi(input_file, output_file):
    """
    Converts a JSON array of objects with 'instruction', 'input', and 'output' fields
    into a JSONL file formatted for chat-based fine-tuning.
    
    This function transforms the OPI dataset format into the message format
    expected by modern LLM training frameworks.
    """
    with open(input_file, 'r') as f:
        data = json.load(f)

    with open(output_file, 'w') as f:
        for row in data:
            if all(k in row for k in ("instruction", "input", "output")):
                # Format as conversation with user instruction and assistant response
                line = {
                    "messages": [
                        {"role": "user", "content": f"{row['instruction']} Sequence: {row['input']}"},
                        {"role": "assistant", "content": row["output"]}
                    ]
                }
                f.write(json.dumps(line) + "\n")

def create_sample(input_jsonl, output_jsonl, n):
    """Create a random sample of n lines from input_jsonl for faster experimentation."""
    with open(input_jsonl, 'r') as f:
        lines = f.readlines()
    sample = random.sample(lines, min(n, len(lines)))
    with open(output_jsonl, 'w') as f:
        f.writelines(sample)

# Configuration
repo_id = "BAAI/OPI"  # Hugging Face dataset repository
target_dir = "./datasets"  # Local directory for downloaded files
output_dir = "./datasets"  # Directory for processed datasets
data_in = [
    "OPI_DATA/OPI_updated_160k.json",  # Main dataset file
]
# Create a smaller sample for faster experimentation (set to None to use full dataset)
create_sample_n = 1000

# Process each dataset file
for file in data_in:
    # Download from Hugging Face
    hf_hub_download(repo_id=repo_id,
                    filename=file,
                    repo_type="dataset",
                    local_dir=target_dir)
    print(f'✓ Downloaded {file}')
    
    # Convert to JSONL format
    file_out = file.split('/')[1].replace(".json", ".jsonl")
    out_path = os.path.join(output_dir, file_out)
    convert_opi(os.path.join(target_dir, file), out_path)
    print(f'✓ Converted {file} to {file_out}')

    # Create a sample for testing (optional)
    if create_sample_n is not None:
        sample_out = out_path.replace(".jsonl", f".sample{create_sample_n}.jsonl")
        create_sample(out_path, sample_out, create_sample_n)
        print(f'✓ Created random sample of {create_sample_n} lines: {sample_out}')

Expected output: After running this code, you should see messages confirming:

✓ Downloaded OPI_DATA/OPI_updated_160k.json
✓ Converted OPI_DATA/OPI_updated_160k.json to OPI_updated_160k.jsonl
✓ Created random sample of 1000 lines: ./datasets/OPI_updated_160k.sample1000.jsonl

Your ./datasets/ directory will now contain the formatted dataset files ready for upload.

Step 2: Upload Your Dataset#

Now that your dataset is properly formatted, the next step is to upload it to AMD AI Workbench. This makes the dataset available across your workspace and ready for use in fine-tuning workflows.

The AMD AI Workbench provides a REST API for dataset uploads. We’ll show you two methods: a direct cURL command for quick uploads, and a Python script for programmatic integration into your workflows.

Note

Make sure to replace YOUR_PROJECT_UUID_HERE and YOUR_TOKEN_HERE with your actual project ID and authentication token.

Option A: API Call Using cURL#

This method is ideal for quick, one-time uploads directly from your terminal.

curl -X 'POST' \
  'https://airmapi.silogen-demo.silogen.ai/v1/datasets/upload?project_id=YOUR_PROJECT_UUID_HERE' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer YOUR_TOKEN_HERE' \
  -H 'Content-Type: multipart/form-data' \
  -F 'name=string' \
  -F 'description=string' \
  -F 'type=Fine-tuning' \
  -F 'jsonl=@dataset.jsonl'

Option B: API Call Using Python#

This method is better suited for automated workflows or when you need error handling and response processing.

from pathlib import Path
import requests, certifi

BASE_URL = "https://airmapi.silogen-demo.silogen.ai/v1/datasets/upload?project_id=ADD_YOUR_PROJECT_ID"
file_path = Path("path_to_your_dataset")
headers = {"accept": "application/json", "Authorization": "Bearer ADD_YOUR_TOKEN"}
data = {"name": "dataset_name", "description": "dataset_decription", "type": "Fine-tuning"}

with file_path.open("rb") as f:
    response = requests.post(
        url=BASE_URL,
        headers=headers,
        data=data,
        files={"jsonl": f},
        verify=certifi.where(),
        timeout=300,
    )

# Check the response
if response.status_code == 200:
    print("✓ Dataset uploaded successfully!")
    print(response.json())
else:
    print(f"✗ Upload failed with status {response.status_code}")
    print(response.text)

Conclusion#

Congratulations! You’ve successfully completed the dataset preparation workflow. You now have:

Technical Skills:

Experience downloading and converting datasets from Hugging Face
Knowledge of JSONL formatting for conversational AI datasets
Familiarity with AMD AI Workbench’s dataset upload API
Hands-on experience with JupyterLab in AMD AI Workbench

Practical Outcomes:

A formatted dataset ready for fine-tuning workflows
Understanding of the complete data preparation pipeline
Reusable code templates for future dataset preparation tasks

Next Steps:

With your dataset now uploaded to AMD AI Workbench, you can:

Explore fine-tuning tutorials using your prepared dataset
Experiment with different dataset sizes
Apply these techniques to your own custom datasets
Integrate this workflow into your AI development pipeline