Fine-Tuning Transformers with On-Device GPU Resources

by Ekaterina Butyugina

An abstract image of books being digitalized

The Importance of Summarization in NLP

Summarization in Natural Language Processing (NLP) is a crucial task due to its applications in research and business. It addresses the issue of information overload and enhances time efficiency by distilling long documents into their essence.

Transformer Models — The Backbone of Generative AI

Transformer model is a type of generative AI. They can generate new content based on the input they receive. This architecture is behind many well-known AI models like GPT (Generative Pretrained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and others, which, besides the summarization, are capable of generating text, answering questions, and more.

Why a Local GPU?

Choosing between a local GPU and a cloud service depends on factors like performance needs, data privacy concerns, cost, internet stability, and the nature of your work. For many professionals and researchers, the advantages of a local GPU make it a preferred choice for demanding computational tasks.

Here you can find the full code, explanations, and examples.

Abstractive Text Summarization

In contrast to the Extractive text summarization technique, where only a key subset of content is extracted and no new content is generated, abstractive text summarization involves use of natural language generation techniques where the machine makes use of knowledge bases and semantic representations to generate text on its own and create summaries just like a human would write them.

Setting up the environment

First, we install libraries for handling datasets and transformers.

install datasets

We’ll use the CNN/Daily Mail news articles dataset, which includes full articles and summaries, perfect for training our model.

datasets import load

The example article from this dataset will look like a dictionary with the following content:

dataset for a dictionary

As you have seen, we have also imported the following function load_metric which I will explain in the section below.

Evaluating with ROUGE Score

To assess our model’s performance, we use the ROUGE score, a metric that compares AI-generated summaries against human-produced ones.

install rouge_score

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization in natural language processing.

To make this more precise, suppose we want to compare the following two summaries:


Let’s focus on ROUGE-1, which looks at the overlap of each individual word (1-gram).

Recall: in ROUGE-1 is calculated as the number of overlapping words in the generated summary that are also in the reference summary, divided by the total number of words in the reference summary. In this case, words like “I”, “loved”, “reading”, “the”, and “Hunger Games” are in both summaries. The reference summary has 5 words, and all 5 are in the generated summary, so the recall is 100%.

Precision is the number of overlapping words in the generated summary that are also in the reference summary, divided by the total number of words in the generated summary. The generated summary has 6 words, and 5 of these appear in the reference summary. So, the precision is 5/6 or approximately 85.71%.

F-Measure (F1 Score) combines both precision and recall to provide a single score. It’s calculated as the harmonic mean of precision and recall. For this example, the F1 score would be 92.3%.

The score range is from 0 to 1, and in our example all of them are close to 1, so our generated summary is likely capturing the key points effectively

Data Preprocessing and Model Selection

Before we can feed those texts to our model, we need to preprocess them. This is done by a Huggingface Transformers Tokenizer which will (as the name indicates) tokenize the inputs and put it in a format the model expects.

tokenizing the input

Basically — we turn the words into numbers.


This code is built to run with any model checkpoint from the Model Hub as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the t5-small checkpoint.


T5 model has a variety of tasks it can work on: translation, summarisation, QA and so on.

If you are using the T5 model, we have to prefix the inputs with “summarize:” (the model can also translate and it needs the prefix to know which task it has to perform).


Fine-Tuning the Transformer Model:

Now that our data is ready, we can download the pre-trained model and fine-tune it.

T5 model was pre-trained on a different type of text, that’s why we need to fine-tune it to our dataset.

First, we want to adjust parameters like learning rate and batch size:
full code
  • we set the evaluation to be done at the end of each epoch
  • tweak the learning rate
  • customize the weight decay
Since the Seq2SeqTrainer will save the model regularly and our dataset is quite large, we tell it to make three saves maximum.

Lastly, we use the predict_with_generate option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last thing to define for our Seq2SeqTrainer is how to compute the metrics from the predictions. We need to define a helper function for this, which you can find in the original notebook.

Then we just need to pass all of this along with our datasets to the Seq2SeqTrainer:trainer

Running the Model locally

We can now fine-tune our model by just calling the train method:


To run it locally you need to have a cuda-compatible GPU in your machine.

I have NVIDIA RTX A5000 on my HP ZBook Fury, which allows me to run the model with the following training history:

3 Epochs of Running the Model on Local GPU
3 Epochs of Running the Model on Local GPU

The runtime varies between 20 to 26 minutes for three epochs which is quite fast for such a big dataset.

I was curious about the GPU usage, so here you can see it as well.

Memory Usage during Training
Memory Usage during Training

A 100% GPU utilization is generally a good sign when performing intensive tasks like training machine learning models, as it indicates that the GPU is being fully leveraged for the computations. However, consistent, prolonged usage at 100% could potentially lead to overheating issues if the cooling system isn’t adequate, so be careful and don’t burn your machines!

Testing and observing results

Now is the time to test our fine-tuned model with recent news, such as Google’s Gemini model launch. The full text can be found here.

And that’s our result:

generated_summary vrs.reference_summary


We’ve just explored the fascinating journey of applying Generative AI to the task of text summarization, with a focus on local execution. As we’ve seen, while the process shares similarities with cloud-based modeling, running it locally offers unique advantages that enhance your work. We have not only explored the capabilities of Generative AI but also highlighted the efficiency and effectiveness of leveraging local resources for such sophisticated tasks.

I will be happy to read your opinion on the result we’ve got. Stay tuned for more insights into the exciting world of AI!

This demo was made in collaboration with Dipanjan Sarkar, our lead data scientist;

*As a Z by HP Global Data Science Ambassador, I have been provided with HP products to facilitate our innovative work.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Read more