Voice Gender Conversion using a Hierarchical Autoencoder with Adversarial Learning

This paper presents a structured neural VC architecture that allows the manipulation of voice attributes (e.g., gender and age) based on the adversarial learning of a hierarchically structured speech and speaker encoding. The proposed VC architecture employs multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations, which are learned adversarially and can be manipulated during VC. Moreover, the proposed architecture is time-synchronized so that the original voice timing is preserved during conversion which allows lip-sync applications. A set of objective and subjective evaluations conducted on the VCTK dataset shows the efficiency of the proposed framework on the task on voice gender manipulation. Further work will investigate the generalization of the proposed framework to other voice attributes, such as age, attitudes, and emotions. [Read More](https://arxiv.org/pdf/2107.12346)