Skip to main content

Alchemist

Alchemist is a powerful platform designed to streamline the process of creating instruction fine-tuning datasets for language models. By focusing on dataset curation rather than the fine-tuning process itself, Alchemist empowers users to build high-quality datasets efficiently, setting the stage for more effective model training.

Sample Ingestion

The journey with Alchemist begins by uploading prompt logs from existing systems. These logs serve as the raw material for creating your fine-tuning dataset. Alchemist’s user-friendly interface makes it easy to import large volumes of data, ensuring that you have a rich pool of samples to work with. This initial step is crucial as it lays the foundation for the entire curation process.

Data Curation

Once your data is uploaded, Alchemist provides a robust set of tools for searching through and curating a subset of samples that best represent your desired outcomes. This curation process combines manual search capabilities with Alchemist’s proprietary algorithms, creating a semi-automated, human-in-the-loop workflow. Users can leverage advanced search functionalities to identify relevant samples, while Alchemist’s algorithms assist in surfacing potentially valuable data points that might otherwise be overlooked. This hybrid approach ensures both efficiency in processing large datasets and the quality that comes from human oversight.

Instruction Generation

After curating your dataset, Alchemist takes the process a step further by automatically generating instructions from the samples in your curated set. These instructions are tailored to fit the format required by your chosen model or platform, whether it’s the Llama format for AWS or the OpenAI format for Azure. This versatility ensures that your curated dataset can be seamlessly integrated into various fine-tuning pipelines. While Alchemist doesn’t perform the actual fine-tuning, it provides you with a meticulously prepared dataset, formatted and ready for use, significantly reducing the time and effort required to prepare data for model fine-tuning.