Most foundational LLMs today are consistent in their ability to produce coherent, conversational and insightful responses. They excel at connecting disparate ideas and offering insights and novel solutions. However, they are limited by their inability to perform complex logic and reasoning, particularly in tasks requiring multi-step reasoning. With ongoing model refinement through advanced fine-tuning and evaluation, these capabilities can be improved, propelling LLMs toward becoming truly agentic.
The challenge
Enhancing LLMs’ reasoning abilities requires high-quality and validated data that covers complex topics and domains. Further, the data needs to reflect the sequential nature of logic and reasoning processes in order to enable transformative LLM improvements. Finally, as an LLM learns, model developers require benchmarks that can be used to showcase its progress, highlighting where the model excels and areas for improvement. This requires sourcing the right contributors with qualified skills to generate this training data, as well as validating the quality of this data to be suitable for model training. Identifying a gap in the availability of such data, TELUS Digital sought to fulfill the need by creating an off-the-shelf dataset to further the evolution of LLMs.
The TELUS Digital solution
To develop the dataset containing 50,000 prompt-response pairs (PRPs) based on science, technology, engineering and mathematics (STEM), we sourced approximately 300 highly-skilled contributors, including university students, graduates and professors, from around the world. We ensured candidates were screened, verified and had the right expertise to curate a high-quality, reliable dataset. The contributors created the PRPs, which were specifically designed to fine-tune and evaluate a model’s ability to handle complex calculations and reasoning. The dataset contained pairs with simple explanations of complex problems, with each step clearly detailed to aid the model’s ability to understand and perform effectively.
Our team of experts carefully designed the dataset to offer a well-balanced distribution across various sub-topics and complexities, covering fundamental concepts based on university-level curriculum. Each entry in the dataset underwent automated quality checks on our platform, Fine-Tune Studio, to identify inconsistencies. Further, a team of subject matter experts conducted detailed reviews and editing to validate content accuracy and ensure it met high academic standards.
The results
While the dataset was created for supervised fine-tuning efforts that enhance artificial intelligence (AI) models’ precision in tackling STEM-related questions, it also serves as a robust benchmark for evaluating LLM performance. It can be used to compare capabilities across models, as well as track progress in reasoning and problem-solving.
One of our existing clients, a leading enterprise AI platform and LLM builder, required only the physics and mathematics PRPs from the licensable dataset to use for fine-tuning their model that powers their open science initiatives. We tailored the original dataset to their needs, enabling the model to perform better in STEM-related tasks. This partnership not only advanced their mission of delivering cutting-edge AI, but also showcased the dataset's transformative potential.