Space Logo

This Space is designed to provide you with an easy way to get started generating synthetic datasets using Spaces compute to host open LLMs. The Space comes with a ready-to-go environment and a series of notebooks showing various examples of generating synthetic datasets. You can read more about the aims of the Space in this blog post.

What's covered?

Currently this Space has notebooks covering the following topics:

Creating synthetic text similarity datasets

A set of notebooks covering the steps for creating a synthetic dataset for fine-tuning a sentence similarity model. These notebooks cover:

  • How to do structured generation using the outlines library to have more control on the outputs generated by a LLM.
  • How to use Llama-index to chunk texts to fit into the context length of sentence embedding models.
  • Using vLLM to efficiently create a dataset that can be used to fine-tune a Sentence similarity model.

Using the Space

To use this Space, you should duplicate it. To ensure your work is saved it's suggested to enable persistent storage for your Space. To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to run larger LLMs or generate more data. Reminder you can preview the notebooks in the Space without running them. You can find the Jupyter Notebooks in the notebooks folder .

Duplicate the Space to run your own instance


Duplicate Space

The default token is huggingface

This template was created by camenduru and nateraw, with contributions of osanseviero and azzr