In the fast-evolving world of generative models, there have been some exciting breakthroughs, especially in models based on Transformers and Diffusion. These models have shown exceptional performance in tasks related to generating images. However, when it comes to video generation, they have encountered a roadblock that we’ve observed mostly in Large Language Models (LLMs), known as the hallucination issue.
As of the current year, 2023, in the context of video generation tasks utilizing the img2img framework, the Vanilla Autoencoder architecture remains the preferred choice. Despite the inherent challenge associated with interpolating the self-reconstruction vectors in contrast to the more intricate highly interpolable Variational Autoencoder vectors, it is noteworthy that once this hurdle is overcome, these vectors exhibit a high degree of stability within the temporal domain. It’s a reminder that sometimes, the simplest solution can be the most effective! Thanks to Geoffrey Hinton et al.
Although autoencoders alone are not typically considered full-fledged generative models like GANs or VAEs, they are used as generative aspects in deepfakes.
The term “deepfake” gained attention in 2017 when an open-source faceswap model appeared on Reddit. Since then, many new tools have been developed based on this architecture.
Today, one of the most widely used tools, DeepFaceLab, has been created by its producer, Ivan Petrov. Despite some controversy surrounding him, he has successfully developed tools that the film industry can utilize. Especially with models that surpass traditional deepfake models, he has come a step closer to the quality VFX companies have been waiting for, such as shadow and light transfer.
However Identity leakage, more consistent shadow and light transfer and huge memory requirements for high resolution are still known challenges.
Today, we are announcing PAGI Gen that overcomes these challenges. We designed our model architecture based on ID conditional autoencoders, thanks to that we kept all the layers shared, including the intermediate layers. This new architecture brings us the following benefits.
1) Outstanding Light and Expression Transfer
In contrast to other face-swap models, all weights in this model, including the intermediate layers, are shared for both the target and source. This enables the model to capture significantly more precise shadow and light details.
2) High Resolution
Reducing the model’s components and parameters enables training at higher resolutions and batch sizes without compromising quality. With an Nvidia RTX 3090 graphics card, the model can process 2 batches at 1024×1024 resolution in just 900 milliseconds.
This makes it possible for indie VFX artists to produce content of Hollywood quality using the GPUs at their disposal. Theoretically, this model has no resolution limitations as long as you have sufficient graphics memory. A minimum of 24 GB VRAM is required to train this model at Full HD resolution.
3) Overcome ID-Leaks
Identity leakage is a common concern in face-swapping tools today.
Given that our model operates as an ID conditional network, it enables the utilization of the ID information multiplier as a hyperparameter during the prediction phase. This hyperparameter can be changed within limits where it does not disrupt expressions and lights.
This new architecture ensures that the final output truly resembles your intended target, not an impersonator.
4) ID Interpolability and Few Shots Train
While this usage is still in the research stage, the images below were created using only 30 frames and with no post-processing.
5) Quick Addition of New Targets
Since there are no different inter or decoder components, the model can be finetuned very quickly for a new target. There is no need to reset any components or weights, ensuring the preservation of learned expressions, lighting and shadows.
6) Auto Blend
Auto Blend does not require a manual post-process to blend faces. This feature is still in the research stage but can be experimented with using advanced options. The following video is just one of experiments with it.
7) Screen Swap
Screen Swap does not require a manual post-process and blending phase. This feature is also in the research stage but can be experimented with using advanced options. The following video is just one of experiments with it.
8) Greater than 8 bit Color Depth Support
Due to the model’s compact structure, we consider it suitable for modification, and we prioritize the importance of this aspect in producing 4k content. However, until we make improvements in this regard, you can convert your 4k datasets to 8-bit color depth. In our tests, we believe the ffmpeg branch that performs this conversion with the least visual loss is jellyfin, and we include this branch in the dataset extraction process within PAGI Gen.
In addition to model features, PAGI Gen encompasses modules such as realtime and voice-swap within an end-to-end generation framework. We will delve into the features of these tools, including Dataset Builder, Curation and Blender, in our upcoming blog posts.
Responsibility
Up until now, generative applications on the cloud have been given the green light, while offline faceswap tools have been condemned, and there’s a clear reluctance to release such tools to the public. In the age of GenAI’s rapid evolution, we, with our extensive experience in cybersecurity, view this approach as similar to allowing “Microsoft Visual Studio to compile exclusively in the cloud.” We consider this stance outdated, and that’s why we’ve decided to make our products accessible to all by taking responsibility;
We can see both the advantages and disadvantages of presenting our product for public usage. However, we do not believe that keeping such tools from being made public or implementing certain filtering measures on the cloud side of GenAI tools can help to prevent the threat of deepfake.
We are aware of our responsibilities in this regard and have taken some measures to prevent the misuse of our software to produce deepfakes and we plan to improve them further.
1) Adult Content Detection
As everyone will agree, the deepfake threat targeted women the most. The privacy of women, who have already been turned into advertising commodities by capitalism, has been violated even more with synthetic images. PAGI Gen does not allow the training of adult content by scanning datasets against such contents.
2) Invisible Watermark
PAGI Gen adds invisible watermarks to the outputs, ensuring they are recognized as machine-generated by third parties.
Contact [email protected] for the detection module.
We hope that soon Google will release to the public their AI based watermark technology, SynthID enabling us to incorporate more robust watermarks into our products. Additionally, we plan to integrate Adobe’s authentication into our products as a whitelisting strategy.
3) Deepware Deepfake Scanner
At Deepware, we have released a basic deepfake scanner.
You can scan suspicious videos for free at: scanner.deepware.ai
Try PAGI Gen Beta
If you’re interested in trying out PAGI Gen, please submit a beta request.