Realistic Document Generation using Generative Adversarial Networks

Published in

Ixor

6 min readJul 7, 2021

Introduction

Over the past century, the amount of digital documents such as contracts, invoices and receipts has grown exponentially. Traditionally all those documents were handled manually, but recently there has been a shift towards automating the processing using artificial intelligence (AI) methods. This generates cost savings and other benefits on challenges concerning document sensitivity, language barriers and timezone delays.

Unfortunately, modern AI methods require a large amount of (often hand-labelled) data for good generalisation. Therefore, it is interesting to research and implement artificial data enlarging techniques to expand the dataset with anonymous documents at little cost. This blog summarises the findings of a Master’s thesis done at Ixor, on solving this problem using Generative Adversarial Networks (GANs).

GAN: How it works

GANs are an approach to generative modelling using deep learning methods [1]. The baseline idea is to train a model that automatically discovers and learns the patterns in given input data such that it can reproduce or generate new examples as if they were drawn from the original dataset.

A GAN consists of two components: a generator and a discriminator, which are pitted against each other. Both networks are trained on the same dataset.

The goal of the discriminator (D) is to classify if a sample is drawn from the dataset or not. The generator (G) has to fool the discriminator by creating samples that are similar to the samples present in the dataset. The input for the generator is a noise vector.

Both networks play a minimax game in which D is trained to maximize the probability of assigning the correct label to fake and real samples while generator G is trained to minimize the recognition of fake samples as fake.

Text Representation

Since GANs were first proposed in 2014, they have been the centre of attention in numerous visual applications to generate new images based on a given dataset. However, since GANs are developed to solve problems in a continuous context (e.g. pixels in an image), minor research has been done to apply these types of networks to discrete data such as text. But in theory, it should also work by replacing the usual pixel values with some kind of word representation value(s).

The easiest would be to represent the text as a one-hot encoded vector. If a sentence contains four words, a four-dimensional representation vector is needed and each unique word is associated with an index in this vector, as shown in the figure.

This method quickly becomes very computationally inefficient for large vocabularies. Moreover, the one-hot encoded approach is discrete in nature and does not allow for information-sharing across words with similar properties.

To overcome these limitations modern methods use distributed representations, also referred to as word embeddings. The advantage is that with this technique, dependencies between the different words are introduced, as embeddings allow to capture relationships in language — grammatically (e.g. invoice/invoices) and semantically (e.g. school/college) — that are very difficult to capture otherwise [3]. Additionally, these dense vectors imply fewer parameters to estimate.

Model architecture

Real samples are converted to word vectors (or word labels) before feeding them into the discriminator. The generators’ task is to generate a grid with those embeddings representing each individual pixel. In this system, content and layout are generated simultaneously and evaluated in one joint loss function. This proposed system focuses all attention on the capabilities of a GAN-network, without incorporating additional techniques such as an LSTM-network.

The generator and discriminator are both deep convolutional networks.

Training

While training a few issues surfaced rapidly:

Discriminator loss converging to zero
Repetitive patterns within the predicted grid
Mode Collapse

The discriminator loss quickly converges to zero when the discriminator is too powerful in comparison to the generator. Resulting in little to no feedback to the generator to improve its images. Explained in terms of the images: it is too easy for the discriminator to determine whether an image is real or fake.
Solutions to this hurdle could be to simply tune the learning rate of the generator and discriminator separately. This is called the Two Time-scale Update Rule (TTUR) which means that both networks have an individual learning rate. A learning rate of D two to four times lower than that of G creates more space for the generator to learn before the discriminator stabilizes again.

To deal with the too repetitive pattern, a relatively easy idea is to add the positional information for each pixel to the pixel itself [4]. A CNN-network learn to recognize patterns in the given dataset in a location-independent manner. To re-introduce this knowledge, the amount of channels are extended by two extra channels (respectively the x- and y-coordinate of each pixel) before feeding them into the discriminator.

The architecture of the discriminator defines which parts of the images are read and also how these are interpreted. Smaller feature sizes do not capture a lot of information about neighbouring pixels, while too large feature sizes may overlook important details to construct a similar image to that of the given invoices dataset. A solution to combine the benefits of different networks is to create two separate discriminators that evaluate the same samples, but through a differently designed network, and bring these two results in a weighted manner together as the final discriminator’s output. The discriminator with more features/channels in its layers is responsible for the general structure while the smaller discriminator looks at the smaller compositions that are present on the invoices.

Finally, the most impactful issue to acknowledge during training is mode collapse, both when using the label embeddings as well as the word vectors. This means that during training, the generator has collapsed into consistent almost identical output generation, no matter what the latent input vector is. In this case, the generator may succeed in its task to continuously fool the discriminator with this one sample but it fails to represent the full distribution of the given dataset.

A major improvement regarding these issues was obtained when using a Wasserstein GAN where the loss evaluation is measured in terms of the Wasserstein distance (between the distribution of the data observed in the training set and the distribution as observed in the generated examples [5]). Although it originates from a completely different theoretical grounding than the original GAN, the changes to make in the actual implementation are limited while the effects are considerable.

Results

The following figures (left — word embeddings, right — label embeddings) show some final outputs using the above-described improvements (and some more, smaller adjustments as described in the thesis).

Not only is the generated layout of importance, but also the content that is contained within that layout. Unfortunately in most of the generated samples, the content does not make sense; often one word would be repeated on several locations throughout the full invoice. Although, in some cases, parts of a real invoice such as a price list or an address block could be extracted (as illustrated below).

Conclusion

The results show potential that invoices and other documents could be created using a GAN, although drastic further improvements are required to enable its use in production.

At IxorThink, the machine learning practice of Ixor, we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.