Speech synthesis - production of the human voice from non human objects. It's one of the more popular developments through computer history is the reproduction of the human sound. The basic idea starts in 18th century when Russian professor Christian Kratzenstein created an apparatus based on the human vocal tract to demonstrate the physiological differences involved in the production of five long vowel sounds.
Today speech synthesis found usage in some areas of the life. Public transport stations or airports for example uses this technology to generate speech for passengers. This technology also can be used by people that can not speak themselves with the most famous one being Stephen Hawking.
Introduction
Speech synthesis is part of computer science and there are two main types of implementation - software and hardware. Both of types having the same working logic.
- Text normalization by conversion of anything that is not text to an equivalent number representation.
- Assigning the phonetic transcriptions to each word, and divides and marks the text into prosodic units like phrases and sentences.
- Sound generation.
From the point of the end user that is listening to speech, differences between types of implementation does not matter. From the point of a developer the difference is obvious. Software implementation requires development of a solution that is going to be hosted by a PC. It could have many dependencies on the host OS for example. Today, popular sites are online text to speech. When offering a hardware solution - the TextToSpeech click can be hosted by the MCU and allow the processor on the hardware solution to do the heavy lifting.
Text to Speech Click
The TextToSpeech click board that caries the S1V30120 and is a Speech Synthesis IC that provides a solution for adding Text To Speech ( TTS ) and ADPCM speech processing applications to a range of portable devices. The IC is powered by the Fonix DECtalk® v5 speech synthesis engine that can make your robot or portable device talk in US English, Castilian Spanish or Latin American Spanish, in one of nine pre-defined voices.
The S1V30120 contains all the required analogue codecs, memory, and EPSON-supplied embedded algorithms. All applications are controlled over a single SPI allowing control from a wide range of hosts. Our click board supports 3.3 V and 5 V power supplies.
As a cost effective solution the click board doesn't carry it's own MCU and because of that you will need to be very careful when selecting a MCU to use. You have to choose the MCU that have at least 45 KB of flash. Reason for that is initialization data that should be uploaded to the click board any we power on the click board.
DECtalk is the world’s most intelligible TTS synthesizer with the most natural sounding voice. DECtalk has the smallest memory footprint in the industry for a full featured, multi-language voice synthesizer. So it is an excellent embedded solution.
Feel free to explore the links provided in the references, you will find several tutorials on how to make a voice for your robot more natural. There are also ready made songs that uses DECtalk.
Library
In this section we will make a simple example and explain how to use our library. The library covers all functionalities of the S1V30120. Datasheet and documentation is provided could be helpful if you want to explore the functionality of the library and click board. Library packed for our compilers can be found on Libstock.
First of all, like many times before, we want to assign proper pins to our click board. Initialization of the SPI bus for the communication between the click board and MCU and call the execution of tts_init()that must be executed first before any other function.
One more thing that is not a must but could be helpful is assignment of the callbacks to the library. There are three callbacks that can be added to the functionality. First one is a callback that is going to be executed every time we receive message block response. This response is usually received when the system controller identifies that there is insufficient system resource to service the request.
Second and third one and more important because it is an error indication. The errors from the device could be divided to two types - fatal and non-fatal errors . In the case of a fatal error all further messages sent to device will cause the error response again. The best solution if that kind of error occurs is hardware reset of the device.
#include "text_to_speech.h" #include "text_to_speech_img.h" sbit TTS_RST at GPIOC_ODR.B2; sbit TTS_CS at GPIOD_ODR.B13; sbit TTS_MUTE at GPIOA_ODR.B4; sbit TTS_RDY at GPIOD_IDR.B10; void msg_blk( uint16_t *req, uint16_t *err ); void fatal_err( uint16_t *err ); void system_init( void ); void msg_blk( uint16_t *req, uint16_t *err ) { } void fatal_err( uint16_t *err ) { } void system_init() { GPIO_Digital_Output( &GPIOC_ODR, _GPIO_PINMASK_2 ); GPIO_Digital_Output( &GPIOD_ODR, _GPIO_PINMASK_13 ); GPIO_Digital_Output( &GPIOA_ODR, _GPIO_PINMASK_4 ); GPIO_Digital_Input( &GPIOD_IDR, _GPIO_PINMASK_10 ); Delay_ms( 200 ); SPI3_Init_Advanced( _SPI_FPCLK_DIV128, _SPI_MASTER | _SPI_8_BIT | _SPI_CLK_IDLE_HIGH | _SPI_SECOND_CLK_EDGE_TRANSITION | _SPI_MSB_FIRST | _SPI_SS_DISABLE | _SPI_SSM_ENABLE | _SPI_SSI_1, &_GPIO_MODULE_SPI3_PC10_11_12 ); Delay_ms( 200 ); } void main() { system_init(); tts_init(); tts_msg_block_callback( msg_blk ); tts_fatal_err_callback( fatal_err ); }
The device expects the few rules at the booting up the firmware and entering the main mode. For this procedure we will create a function tts_setup() that will be executed after uploading of the initialization data, and entering the main mode..
After the power on mode device enters the boot mode which can be used to upload the initialization data to the device. Inside this mode first thing you should do is to send firmware version request just to check does device entered the boot mode. If the device responses positive we can start the upload of the initialization data.
Uploading of the data process is executed in transfer sequences. The size of each sequence is the BOOT_MESSAGE_MAX number of bytes. If you haven't noticed, there is a header file provided with this library named text_to_speech_img.h. That file is actually binary data that will be uploaded. If you MCU have no enough flash space to place this header you can upload the data from the external EEPROM or FLASH. After uploading of the init data we can test the device with the tts_interface_test() this test actually means the confirmation for the upload process and switch to the device to main working mode.
Once we entered the main working mode we can start the configure the device. Library have built in functions for default configuration that we are going to implement for this example.
void tts_setup() { if ( !tts_version_boot( &version ) ) { UART1_Write_Text( "Firmware version : " ); UART1_Write_Text( version.hwver ); UART1_Write_Text( "rnn" ); } tts_image_load( ( uint8_t* )TTS_INIT_DATA, sizeof( TTS_INIT_DATA ) ); tts_image_exec(); tts_interface_test(); if( !tts_version_main( &version ) ) { UART1_Write_Text( "Hardware version : " ); UART1_Write_Text( version.hwver ); UART1_Write_Text( "rnn" ); UART1_Write_Text( "Firmware version : " ); UART1_Write_Text( version.fwver ); UART1_Write_Text( "rnn" ); } tts_power_default_config(); tts_audio_default_config(); tts_volume_set( 0 ); tts_default_config(); }
Alongside this default configuration there are functions that allows you to configure the device in the way you want. The most important is tts_config where you are configuring the most vital parameters of the device like: language, speach rate, voice type and epson parser usage. Audio configuration has its own name for configuring audio parameters. Power and codec configuration have no settable parameters but is recommended to write any needed values inside that particular register.
After the setup we can send the first string to hear our TTS for the first time. This is an example of how to execute a simple TTS conversion so feel free and explore the library.
#include "text_to_speech.h" #include "text_to_speech_img.h" sbit TTS_RST at GPIOC_ODR.B2; sbit TTS_CS at GPIOD_ODR.B13; sbit TTS_MUTE at GPIOA_ODR.B4; sbit TTS_RDY at GPIOD_IDR.B10; void msg_blk( uint16_t *req, uint16_t *err ); void fatal_err( uint16_t *err ); void system_init( void ); void tts_setup( void ); void msg_blk( uint16_t *req, uint16_t *err ) { char txt[ 6 ]; UART1_Write_Text( "< Request blocked >rn" ); sprinti( txt, "%xrn", *req ); UART1_Write_Text( txt ); sprinti( txt, "%xrn", *err ); UART1_Write_Text( txt ); } void fatal_err( uint16_t *err ) { UART1_Write_Text( "< Fatal Error Detected >" ); tts_init(); tts_fatal_err_callback( fatal_err ); } void system_init() { GPIO_Digital_Output( &GPIOC_ODR, _GPIO_PINMASK_2 ); GPIO_Digital_Output( &GPIOD_ODR, _GPIO_PINMASK_13 ); GPIO_Digital_Output( &GPIOA_ODR, _GPIO_PINMASK_4 ); GPIO_Digital_Input( &GPIOD_IDR, _GPIO_PINMASK_10 ); Delay_ms( 200 ); SPI3_Init_Advanced( _SPI_FPCLK_DIV128, _SPI_MASTER | _SPI_8_BIT | _SPI_CLK_IDLE_HIGH | _SPI_SECOND_CLK_EDGE_TRANSITION | _SPI_MSB_FIRST | _SPI_SS_DISABLE | _SPI_SSM_ENABLE | _SPI_SSI_1, &_GPIO_MODULE_SPI3_PC10_11_12 ); Delay_ms( 200 ); UART1_Init( 57600 ); Delay_ms( 200 ); } void tts_setup() { if ( !tts_version_boot( &version ) ) { UART1_Write_Text( "Firmware version : " ); UART1_Write_Text( version.hwver ); UART1_Write_Text( "rnn" ); } tts_image_load( ( uint8_t* )TTS_INIT_DATA, sizeof( TTS_INIT_DATA ) ); tts_image_exec(); tts_interface_test(); if( !tts_version_main( &version ) ) { UART1_Write_Text( "Hardware version : " ); UART1_Write_Text( version.hwver ); UART1_Write_Text( "rnn" ); UART1_Write_Text( "Firmware version : " ); UART1_Write_Text( version.fwver ); UART1_Write_Text( "rnn" ); } tts_power_default_config(); tts_audio_default_config(); tts_volume_set( 0 ); tts_default_config(); } void main() { system_init(); tts_init(); tts_msg_block_callback( msg_blk ); tts_fatal_err_callback( fatal_err ); tts_setup(); tts_speak( "Hello world" ); }
Example
After successful test we can now implement something more complex than linear application that is just going to speak the text provided before compiling.
The idea is to make the application that will be able to speak the text provided through the serial port. For that kind of job MikroC have tools and built-in functions inside the UART library that could be very helpful. First we have UART Terminal, the tool that will be used for writing the text we want to convert to speech. Second very helpful thing is function UART_Read_Text that have delimiter argument which will help us to stop reading the UART and provide text to the TextToSpeech.
void main() { system_init(); display_init(); tts_init(); tts_setup(); while( 1 ) { if( UART1_Data_Ready() ) { UART1_Read_Text( tmp_txt, "n", 255 ); UART1_Write_Text( tmp_txt ); tts_speak( tmp_txt ); } } }
Now you can compile the application and start the terminal by Tools > USART Terminal and connect to the proper USB UART port. Don't forget to check Append New Line option because our application expects the new line character to stop reading the UART.
Summary
This device have one more advantage - there many tools that will help you to make your own dictionary for this click board. You can find it on the manufacturer documentation. With our library and tools you can make your robot speak exactly you wish in very short amount of time. You will have more time to concentrate on building the brain for your HAL 9000, the solution for voice is already here.
References
Libstock Libries for TextToSpeech click 2016
Speech synthesis Wikipedia 2016
Christian Gottlieb Kratzenstein Wikipedia 2016
S1V30120 Epson Text to Speech 2016
DECtalk Guide 2016
Songs for DECtalk 2016