With more artificial intelligence applications being built, we need text-to-speech(TTS) engine API. The good news, there are a lot of open-source modules opensource for text-to-speech (TTS). This story will talk about python’s top text-to-speech(TTS) libraries.
gTTS
gTTS (Google Text-to-Speech) is a Python library that allows you to convert text to speech using Google’s Text-to-Speech API. It’s designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the speed of the speech, and the volume.
When I wrote this post, The project had 1.7k stars on GitHub.
Usage
To use gTTS, you will need to install the library using pip:
pip install gTTS
Then, you can use the gTTS
class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the gTTS
constructor. For example:
from gtts import gTTS
tts = gTTS("Hello this is a normal text and i am a python package. lol")
Once you have an instance of the gTTS
class, you can use the save
method to save the speech to a file. For example:
tts.save("speech.mp3")
You can also use the gTTS
class to change the speech output’s language and speech speed. For example:
tts = gTTS("Bonjour, ceci est un texte normal et je suis un paquet python. lol", lang='fr')
tts.save("speech.mp3")
tts = gTTS("Hello this is a normal text and i am a python package. lol", slow=True)
tts.save("speech.mp3")
Complete code and output
from gtts import gTTS
tts = gTTS("Hello this is a normal text and i am a python package. lol")
tts.save("speech.mp3")
Many other options are available for controlling the speech output, such as setting the volume and pitch of the speech. You can find more information about these options in the gTTS documentation.
CoquiTTS
I already have a series of videos and posts about coquiTTS that you can find here.
CoquiTTS is a neural text-to-speech (TTS) library developed in PyTorch. It is designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the pitch, and the duration of the speech.
It is the most popular package, with 7.4k stars on GitHub.
To use CoquiTTS, you will need to install the library using pip:
pip install TTS
Once you have installed the library, you can use the coquiTTS
class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the Synthesizer
constructor. For example:
# import all the modules that we will need to use
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
path = "/path/to/pip/site-packages/TTS/.models.json"
model_manager = ModelManager(path)
model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")
voc_path, voc_config_path, _ = model_manager.download_model(model_item["default_vocoder"])
syn = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
vocoder_checkpoint=voc_path,
vocoder_config=voc_config_path
)
Once you have an instance of the Synthesizer
class, you can use the tts
method to generate speech. You can save the speech to a file using a the save_wav
method. For example:
text = "Hello from a machine"
outputs = syn.tts(text)
syn.save_wav(outputs, "audio-1.wav")
Complete code and output
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
import site
location = site.getsitepackages()[0]
path = location+"/TTS/.models.json"
model_manager = ModelManager(path)
model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")
voc_path, voc_config_path, _ = model_manager.download_model(model_item["default_vocoder"])
synthesizer = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
vocoder_checkpoint=voc_path,
vocoder_config=voc_config_path
)
text = "Hello from a machine"
outputs = synthesizer.tts(text)
synthesizer.save_wav(outputs, "audio-1.wav")
You can find the documentation here.
TensorFlowTTS
TensorFlowTTS (TensorFlow Text-to-Speech) is a deep learning-based text-to-speech (TTS) library developed by TensorFlow, an open-source platform for machine learning and artificial intelligence. It is designed to be easy to use and provides a range of features for building TTS systems, including support for multiple languages and customizable models.
It has 3k stars on gihub.
To use TensorFlowTTS, you will need to install the library using pip:
pip install TensorFlowTTS
Sample code
import numpy as np
import soundfile as sf
import yaml
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
# initialize fastspeech2 model.
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")
# initialize mb_melgan model
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en")
# inference
processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")
input_ids = processor.text_to_sequence("Hello from a computer.")
# fastspeech inference
mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)
# melgan inference
audio_before = mb_melgan.inference(mel_before)[0, :, 0]
audio_after = mb_melgan.inference(mel_after)[0, :, 0]
# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")
TensorFlowTTS also provides pre-trained models for various languages, including English, Chinese, and Japanese. You can use these models to perform speech synthesis without the need to train your model. You can find more information about how to use TensorFlowTTS and the available options on GitHub.
pyttsx3
pyttsx3 is a Python text-to-speech (TTS) library that allows you to convert text to speech using a range of TTS engines, including the Microsoft Text-to-Speech API, the Festival, and the eSpeak TTS engine. pyttsx3 is designed to be easy to use and provides a range of options for controlling speech output.
It has 1.3k stars on github.
To use pyttsx3, you will need to install the library using pip:
pip install pyttsx3
Once you have installed the library, you can use the pyttsx3.init
function to create an instance of the text-to-speech converter. You can pass the TTS engine you want to use as an argument to the init
function. For example:
import pyttsx3
engine = pyttsx3.init()
Once you have an instance of the TTS engine, you can use the say
method to generate speech from text. The say
method takes the text you want to synthesize as an argument. For example:
engine.say("Hello from a machine")
engine.runAndWait()
larynx
Larynx is a text-to-speech (TTS) library written in Python that uses the Google Text-to-Speech API to convert text to speech.
To use Larynx, you will need to install the library using pip:
pip install larynx
Once you have installed the library, you can use the text_to_speech
function for the text-to-speech converter. You can pass many parameters like:
text: str,
lang: str,
tts_model: typing.Union[TextToSpeechModel, Future],
vocoder_model: typing.Union[VocoderModel, Future],
audio_settings: AudioSettings,
number_converters: bool = False,
disable_currency: bool = False,
word_indexes: bool = False,
inline_pronunciations: bool = False,
phoneme_transform: typing.Optional[typing.Callable[[str], str]] = None,
text_lang: typing.Optional[str] = None,
phoneme_lang: typing.Optional[str] = None,
tts_settings: typing.Optional[typing.Dict[str, typing.Any]] = None,
vocoder_settings: typing.Optional[typing.Dict[str, typing.Any]] = None,
max_workers: typing.Optional[int] = 2,
executor: typing.Optional[Executor] = None,
phonemizer: typing.Optional[gruut.Phonemizer] = None,
You can save the speech to a file using the **wavfile**
function. For example:
from larynx import text_to_speech
from larynx import wavfile
import numpy as np
text_and_audios = text_to_speech(**params)
audios = []
print(list(text_and_audios))
for _, audio in text_and_audios:
audios.append(audio)
wavfile.write(data=np.concatenate(audios), rate=1, filename="a.wav")
Leave a Reply