Today, we delve into the process of setting up data sets for fine-tuning large language models (LLMs). Starting from the initial considerations needed before dataset construction, we navigate through various pipeline setup questions, such as the need for embeddings. We discuss how to structure raw text data for fine-tuning, exemplified with real coding and medical appeals scenarios.
We also explore how to leverage embeddings to provide additional context to our models, a crucial step in building more general and robust models. The video further explains how to transform books into structured data sets using LLMs, with an example of transforming the book 'Twenty Thousand Leagues Under the Sea' into a question-and-answer format.
In addition, we look at the process of fine-tuning LLMs to write in specific programming languages, showing a practical application with a Cipher query for graph databases. Lastly, we demonstrate how to enhance the performance of a medical application with the use of embedded information utilizing the Superbooga platform.
Whether you're interested in coding, medical applications, book conversion, or simply fine-tuning LLMs in general, this video provides comprehensive insights. Tune in to discover how to augment your models with advanced techniques and tools. Join us on our live stream for a deep dive into how to broaden the context in local models and results from our book training and comedy sets.
0:00 Intro
0:44 Considerations For Finetuning Datasets
2:45 Reviewing Embeddings
5:35 Finetuning With Embeddings
8:31 Creating Datasets From Raw/Books
12:08 Coding Finetuning Example
14:02 Medicare/Medicaid Appeals Example
17:01 Outro
Training datasets: https://github.com/tomasonjo/blog-dat...
Massive Text Embeddings: https://huggingface.co/blog/mteb
Github Repo: https://github.com/Aemon-Algiz/Datese...
#machinelearning #ArtificialIntelligence #LargeLanguageModels #FineTuning #DataPreprocessing #Embeddings