Part 1: Deep Dive into Instruction Tuning

The effectiveness of large language models on code has largely been a result of pre-training on billions of lines of code, with only a minor data budget if at all being spent on the post-training phase for code generation (Touvron et al., 2023; Chowdhery et al., 2022 ). However, over the last few years this trend has been changing, and a number of post-training recipes have been proposed for improving the performance of language models on code, math and reasoning (Rozière et al., 2023; Dubey et al., 2024; Gunter et al, 2024).

In this post, we study post-training techniques for code-generation that boost performance and instruction following abilities for code using instruction tuning.

Instruction tuning for code generation

Most pipelines for generation of instruction tuning data can be divided into four parts:

  1. Seed data collection
  2. Synthetic data generation step (usually using an LLM)
  3. Augmentation/filtering/post-processing
  4. Feedback enrichment

Seed Data Collection

Seed data plays a crucial role in determining the quality of the samples synthesized in further steps down the process. The earliest attempts like CodeAlpaca and WizardCoder used manually annotated data. This approach, although ubiquitous, has limitations due to bias in the available data as well as being heavily money intensive.

Magic-coder (Wei et al., 2024) identified this issue and instead of using human-generated labels, used code snippets selected from large code-bases (such as The Stack) as seed. This idea sounds great given the abundance of code; however, it is difficult to control for quality/complexity and nature of problems that might be created. Furthermore, the distribution of problems created in this manner might require substantial processing to represent real world issues.

Dubey et al., 2024, Wavecoder (Yu et. al., 2024) and StarCoderInstruct (Wei et al., 2024) use a similar approach for seed data but employ extra steps for filtering and curating to preserve dataset diversity. Understanding the limitations of using API-based approaches for generating the data, Octopack (Muennighoff et al., 2024) collects and filters the commit data across GitHub, collecting the largest dataset of its kind - CommitPack.

Broadly speaking, sources of seed data in the current open literature are as follows:

However, if OSS models need to catch up with closed source models, we need to start thinking beyond these sources and consider applications for which these models are being directly used, such as forums, documentations, and debugging.

This leads to the following additional resources where we can collect seed data without the need for direct annotation pipelines - (but filtering and curation would still be required).