New transformer variants continue to flood the market, here’s one from Microsoft called Fastformer

The Transformer is one of the de facto models for text comprehension, besides other areas of artificial intelligence including NLP, computer vision, video, audio processing, etc. But it also presents a shortfall. It can be inefficient and computationally heavy, as it suffers from quadratic complexity to handle long input sequences.

Since the introduction of Transformer in 2017, there are many ways to speed up the model. Some of them include Longformer, Linformer, BigBird, Reformer, Long Range Arena, Poolingformer, etc. For example, BigBird calculates sparse attention instead of dense attention. Instead, it uses a mixture of local attention, global attention to certain positions, and random attention between a number of tokens.

Register for our upcoming AI Conference>>

However, scattered attention usually cannot fully model the global context. The informer exploits the low rank characteristics of the self-attention matrix by calculating the approximate ones. It projects the attention key and the value into low dimensional matrices that are independent of the length of the sequence. However, the approximation is independent of the context, which can weaken the transformer’s ability to model the context. In addition, most of these methods are not efficient enough when the length of the input sequence is very long.

To solve this problem, a team from Microsoft Research Asia and Tsinghua University proposed Fastformer, an efficient transformer variant based on additive attention. The new method provides efficient context modeling with linear complexity.

“In Fastformer, instead of modeling pairwise interactions between tokens, we first use additive attention mechanisms to model global contexts, and then further transform each token representation based on its interaction with the context representations. global, ”the team explained. In this way, Fastformer can achieve efficient context modeling with linear complexity.

Fastformer architecture (Source: arXiv)


The researchers experimented with five benchmark datasets in a variety of tasks, including sentiment classification, topic prediction, news recommendation, and text synthesis.

The dataset used includes Amazon (review rating prediction), IMDB (review movie rating prediction), MIND (recommendation and news intelligence), CNN / DailyMail (text summary) and PubMed ( text summary dataset with much longer documents).

In their experiments, the researchers used GloVe integrations to initialize the token integration matrix. Additionally, to achieve the integrations in the news classification and recommendation tasks, the team applied an additive attention network to convert the matrix produced by Fastformer into an integration. Finally, in news recommendation tasks, they first used Fastformer in a hierarchical fashion to learn news integrations from news headlines, and then to learn user integrations from integrations of historical clicked news.

The team used Adam for model optimization and ran their experiments on the NVIDIA Tesla V100 GPU with 32 GB of memory. The researchers repeated each experiment five times and reported the average performance and standard deviations, as shown below.

(Source: arXiv)

For classification tasks, the researchers used precision and macro-F scores as measures of performance. For news recommendation tasks, the team used AUC, MRR, [email protected] and [email protected] than metrics. Finally, they used the RED-1, RED-2 and RED-L metrics to evaluate the summaries generated for the text summary tasks.

See also


The experiment conducted by researchers at Microsoft Research Asia and Tsinghua University obtained fairly competitive results in long text modeling. In addition, the results showed that the Fastformer was much more efficient than many Transformer models.

Here are some of the main strengths of Fastformer:

  • Fastformer is the most efficient Transformer architecture.
  • It models the interaction between global contexts and token representations via product-by-item, which can help model contextual information more efficiently.
  • Experiments with five datasets show that it is much more efficient than many Transformer models. It can achieve competitive performance.

Going forward, the researchers said they plan to pre-train the Fastformer-based language models to better empower NLP tasks with long document modeling. In addition to this, they plan to explore the application of Fastformer to other scenarios such as ecommerce recommendation and ad CTR prediction to improve user modeling based on long sequences of user behavior.

Dealing with news recommendation bias

News recommendation is essential for personalized access to news. Existing news recommendation methods infer users’ self-interest based on their historical news / articles clicked and train news recommendation models by predicting future news clicks. In other words, news click behaviors indicate “user interest”.

In reality, it can also be affected by other factors such as the way information is presented on the online platform. For example, news with higher positions and large sizes are generally more likely to be clicked on. News clicks bias can disrupt user interest modeling and model training, thereby harming the personalized news recommendation model.

To solve this problem, the team also proposed a personalized and bias-sensitive news recommendation method called “DebiasRec”. This new technique can handle bias information for more accurate inference of user interests and model training. It includes a bias representation module, a bias sensitive user modeling module and a bias sensitive click prediction module. “Experiments on two ‘real world’ datasets show that our method can effectively improve the performance of news recommendations,” the researchers said.

Join our Discord server. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Source link

Comments are closed.