Business

Empowering AI advancements through data-centric strategies: A triumph of large language models and tools like Cleanlab

The AI revolution was decades in the making. It was a field filled with excitement, yet often punctuated by disappointments and “AI winters.” But recently, something shifted. Large Language Models (LLMs) like ChatGPT, Claude, and Bard catapulted AI from laboratory curiosity to the mainstream.

This shift wasn’t solely a triumph of AI but also a victory over the intricacies of large and messy data. As the saying goes, “garbage in, garbage out.” New tools are emerging that focus on improving the underlying data, therefore improving LLMs.

The Double Challenge of LLMs

The term “Large Language Models” holds within it two great challenges. First, the sheer volume of data. We’re talking upwards of a petabyte (a million gigabytes) of data for GPT-4, encompassing millions of books, blogs, social media posts, video transcripts, and more. This colossal scale offers vast potential but also poses significant logistical considerations.

Second, the complexity of natural language. Context-dependent, ambiguous, and diverse, language data is a wild beast that even the best algorithms struggle to tame. It’s impossible to accurately label all this data, which inevitably means that even state-of-the-art LLMs are trained on tons of incorrectly-labeled data.

In facing these challenges, new data-centric tools and methodologies emerged, enabling a true leap in what AI is capable of. Solutions like Cleanlab and others began to offer ways to collect diverse data, automate quality control, and process language into a form suitable for AI models.

These tools did not merely offer incremental improvements; they fundamentally reshaped the approach to AI data handling. They transformed the task of handling large-scale language data from a manual, error-prone process into an automated, precise one, democratizing the field and enabling advancements at an unprecedented pace.

Why Data-Centric AI is Needed (With a Python Demo)

In AI, real-world datasets contain annotation errors ranging from 7-50%. These imperfections significantly hamper training and evaluation. Data-centric AI emphasizes improving the quality of the dataset itself.

OpenAI’s strategy, for instance, illustrates this emphasis: “We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”

An approach of manually filtering data, however, is time-consuming and expensive. The Cleanlab package is an open-source framework popular for practicing data-centric AI today. It allows you to run data quality algorithms on your trained ML model’s outputs to detect common dataset issues like label errors, outliers, drift, and more.

With just a few lines of code, you can automatically find and identify problems in various types of data, such as image, text, tabular, and audio. By using the Cleanlab package, you can decide how to improve your dataset and model, re-train your ML model, and see its performance improve without any changes to your existing code.

Cleanlab Studio, on the other hand, is more than just an extension of the Cleanlab package; it’s a no-code platform designed to find and fix problems in real-world datasets. It doesn’t just stop at detecting issues but goes further in handling data curation and correction, and even automates almost all the hard parts of turning raw data into reliable ML or Analytics.

Let’s use the Cleanlab package to demonstrate the power of data-centric AI.

1. Preparing data and fine-tuning

We start with the Stanford Politeness Dataset. Ensure you have the train and test sets loaded. In this demo, we’ll fine-tune the Davinci LLM for 3-class classification, first without Cleanlab, and then see how we can improve accuracy with data-centricity. We can run a simple bash command to train a model.

!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "baseline"

When that’s done, we can query a fine_tunes.results endpoint to see the test accuracy.

!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv

`df = pd.read_csv(‘baseline.csv’)

baseline_acc = df.iloc[-1][‘classification/accuracy’]`

We get a result of 63% accuracy. Let’s see if we can improve this.

2. Obtain Predicted Class Probabilities

Now, let’s use OpenAI’s API to compute embeddings and fit a logistic regression model to obtain out-of-sample predicted class probabilities.

# Get embeddings from OpenAI. from openai.embeddings_utils import get_embedding

embedding_model = "text-similarity-davinci-001" train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model)) embeddings = train["embedding"].values

# Get out-of-sample predicted class probabilities via cross-validation.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() labels = train["completion"].values pred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, cv=10, method="predict_proba")

With just one line of code, Cleanlab estimates which examples have label issues in our training dataset.

from cleanlab.filter import find_label_issues

Now we can get indices of examples estimated to have label issues:

issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') # sort indices by likelihood of label error

3. Filter Label Issues and Re-Train

Now, we’ve automatically extracted the indices of potentially mislabeled examples, so we can remove them and train a new classifier.

# Remove the label errors

train_cl = train.drop(issue_idx).reset_index(drop=True) format_data(train_cl, "train_cl.jsonl")

Now let’s train a more robust classifier with better data.

!openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"

# Evaluate model on test data

!openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > cleanlab.csv df = pd.read_csv('cleanlab.csv') dropped_acc = df.iloc[-1]['classification/accuracy']

We get an accuracy of over 66%, improving a state-of-the-art fine-tunable model (GPT-3, as you can’t fine-tune GPT-4), merely by automatically improving the dataset, without any change to the model.

With Cleanlab Studio, it’s also possible to automatically fix the incorrect labels instead of just removing them outright, improving accuracy even further. A guide by Cleanlab shows that this takes accuracy up to 77%.

Takeaways

Using data-centric tools like Cleanlab, you can efficiently find and fix data and label issues, leading to significant improvements in the performance of LLMs like Davinci. This approach does not alter the model architecture or hyperparameters and focuses only on enhancing the quality of the training data.

The approach outlined in this guide could be the key to unlocking even greater accuracy and robustness in AI models, even with future advanced LLMs like GPT-5.


This article was originally published by Frederik Bussler on Hackernoon.

HackerNoon

Recent Posts

OpenAI submitted models to the hardest math test yet for AI

OpenAI published its proof attempts on February 14 for First Proof, a challenge put together…

1 day ago

The hidden costs of sedentary work: Why prevention starts at your desk

We all know that a sedentary lifestyle is harmful to our health. But recent studies…

1 day ago

Solving the headache of migrating cloud-based mailboxes for the enterprise

As organizations increasingly operate across hybrid and cloud-based email systems, migrating enterprise mailboxes has become…

1 day ago

Digital ID, programmable money pave way for ‘dystopian hellhole, horrific totalitarian regimes’: ESN at European Parliament

Digital ID, programmable money like Central Bank Digital Currencies (CBDCs), and AI are paving the…

3 days ago

Elon Musk says tariffs make solar artificially expensive in the U.S. But there is much more at play: Op-ed

Earlier this year, Elon Musk was direct about what comes next for the global economy:…

4 days ago

Britive Unified PAM Integrates with New Extended Plan for AWS Security Hub

Britive, provider of a unified privileged access management (PAM) platform, today announced its unified PAM…

4 days ago