TTS interface

This document describes the use of the Readit TTS interface (Text-to-Speech).

For synthesis, there are two different ways to use the interface:

  1. Regular synthesis
    • Intended for producing shorter audio samples. Max 1000 characters.
    • The audio data is returned directly in the response.
  2. Batch synthesis
    • Intended for producing longer audio samples.
    • Batch synthesis is done with two different requests. The first request provides the text to be synthesized, which starts the synthesis in the background. Depending on the input length, this can take several minutes. The second request retrieves the audio data when it's ready.

Requests

Regular Synthesis

HTTP Request

POST https://api.aimater.com/tts/v1/synthesize

Request Body

Contains the input text to be synthesized in a JSON object.

{
  "input": {
    "text": string,
    "ssml": string
  },
  "inputConfig": {
    "phonemized": boolean
  },
  "voice": {
    "languageCode": string,
    "name": string
  },
  "audioConfig": {
    "audioEncoding": string,
    "speakingRate": number,
    "volumeGainDb": number
  },
  "responseConfig": {
    "responseType": string,
    "includePhonemizedText": boolean
  },
  "auth": {
    "key": string
  }
}

Request fields

  • input Input can be provided either as raw text or in SSML format.
    • text: string
      • Raw text to be synthesized. Maximum 1000 characters.
    • ssml: string
      • Input in SSML format. Maximum 1500 characters. See SSML.
  • inputConfig
    • phonemized: boolean
      • Whether the input text is already in phonemized form. Default is false.
      • Note! Not supported for English language.
  • voice
    • languageCode: string
      • Language code for the material to be synthesized. Options are fi, sv-fi and en. Default is fi.
    • name: string
      • If you have a custom voice, you can switch to it by providing the voice name here.
  • audioConfig
    • audioEncoding: string
      • Encoding of the synthesized audio. Options are WAV/LINEAR16, MP3 and RAW (WAV without header). Default is WAV.
    • speakingRate: number
      • Speed multiplier for synthesized audio between [0.5, 2.0]. Default is 1.0.
    • volumeGainDb: number
      • Volume gain in decibels for synthesized audio. Default is 0.0.
  • responseConfig
    • responseType: string
      • Response type. Options are JSON and BINARY. Default is JSON.
    • includePhonemizedText: boolean
      • Whether to return the phonemized form of the text along with the audio. Works only when responseType is JSON. Default is false.
      • Note! Not supported for English language.
  • auth
    • key: string
      • API authentication key. If missing or incorrect, HTTP code 401 is returned.

Response Body

The requested response type affects the response body. When the response type is binary, the body is directly synthesized data.

If the type is JSON, the body looks like this:

{
  "audioContent": string,
  "phonemizedText": string // Only if `includePhonemizedText` parameter was used.
}

Response fields

  • audioContent: string
    • Audio data base64 encoded.
  • phonemizedText: string
    • Input text in phonemized form from which the audio was generated.

Error Situations

In error situations, the response body is always JSON, even if the requested type was binary. In error situations, the response body contains an error response:

{
  "error": string,
  "errorCode": string,
  "status": number
}

Error response fields

  • error: string
    • Error message explaining what is not allowed in the given parameters.
  • errorCode: string
    • Error code corresponding to the error message.
  • status: number
    • HTTP status of the error response.

Possible Errors

errorCode status error
empty-request-body 400 Invalid argument: empty request body.
invalid-audio-encoding 400 Invalid argument: AudioEncoding.
invalid-input 400 Invalid argument: only one of text or ssml allowed.
invalid-language 400 Invalid argument: invalid language given.
invalid-request-body 400 Invalid argument: invalid JSON.
invalid-response-type 400 Invalid argument: responseType must be json or binary.
invalid-speaking-rate 400 Invalid argument: speakingRate has to be between 0.5 and 2.0.
invalid-ssml 400 Invalid argument: ssml field maximum length is 1500 characters.
invalid-text 400 Invalid argument: text field maximum length is 1000 characters.
invalid-voice 400 Invalid argument: The given voice is not allowed.
invalid-volume-gain-db 400 Invalid argument: volumeGainDb must be between -100 and 100.
missing-input 400 Invalid argument: text or ssml must be specified.
invalid-api-key 401 Invalid API key.

Batch Synthesis Start

HTTP Request

POST https://api.aimater.com/tts/v1/batch/synthesize

Request Body

Contains the input text to be synthesized in a JSON object.

{
  "input": {
    "text": string,
    "ssml": string
  },
  "voice": {
    "languageCode": string,
    "name": string
  },
  "audioConfig": {
    "audioEncoding": string,
    "speakingRate": number,
    "volumeGainDb": number
  },
  "responseConfig": {
    "responseType": string
  },
  "auth": {
    "key": string
  }
}

Request fields

  • input Input can be provided either as raw text or in SSML format.
    • text: string
      • Raw text to be synthesized.
    • ssml: string
      • Input in SSML format. See SSML.
  • voice
    • languageCode: string
      • Language code for the material to be synthesized. Options are fi, sv-fi and en. Default is fi.
    • name: string
      • If you have a custom voice, you can switch to it by providing the voice name here.
  • audioConfig
    • audioEncoding: string
      • Encoding of the synthesized audio. Options are WAV/LINEAR16, MP3 and RAW (WAV without header). Default is WAV.
    • speakingRate: number
      • Speed multiplier for synthesized audio between [0.5, 2.0]. Default is 1.0.
    • volumeGainDb: number
      • Volume gain in decibels for synthesized audio. Default is 0.0.
  • responseConfig
    • responseType: string
      • Response type. Options are JSON and BINARY. Default is JSON.
  • auth
    • key: string
      • API authentication key. If missing or incorrect, HTTP code 401 is returned.

Response Body

If the request is successful, the response body contains JSON data formatted as follows:

{
  "audioId": string
}

Response fields

  • audioId: string
    • UUID identifier used to retrieve the audio data.

Error Situations

In error situations, the response body contains an error message:

{
  "error": string,
  "errorCode": string,
  "status": number
}

Error response fields

  • error: string
    • Error message explaining what is not allowed in the given parameters.
  • errorCode: string
    • Error code corresponding to the error message.
  • status: number
    • HTTP status of the error response.

Possible Errors

errorCode status error
empty-request-body 400 Invalid argument: empty request body.
invalid-audio-encoding 400 Invalid argument: AudioEncoding.
invalid-input 400 Invalid argument: only one of text or ssml allowed.
invalid-language 400 Invalid argument: invalid language given.
invalid-request-body 400 Invalid argument: invalid JSON.
invalid-response-type 400 Invalid argument: responseType must be json or binary.
invalid-speaking-rate 400 Invalid argument: speakingRate has to be between 0.5 and 2.0.
invalid-voice 400 Invalid argument: The given voice is not allowed.
invalid-volume-gain-db 400 Invalid argument: volumeGainDb must be between -100 and 100.
missing-input 400 Invalid argument: text or ssml must be specified.
invalid-api-key 401 Invalid API key.

Retrieving Completed Batch Synthesis Audio

HTTP Request

POST https://api.aimater.com/tts/v1/batch/fetch

Request Body

Contains the UUID identifier corresponding to the synthesized audio data in a JSON object.

{
  "input": {
    "audioId": string
  },
  "auth": {
    "key": string
  }
}

Request fields

  • input
    • audioId: string
      • Corresponding UUID identifier received from the previous request.
  • auth
    • key: string
      • API authentication key. If missing or incorrect, HTTP code 401 is returned.

Response Body

The requested response type affects the response body. When the response type is binary, the body is directly the synthesized data.

If the type is JSON, the body looks like this:

{
  "audioContent": string
}

Response fields

  • audioContent: string
    • Audio data base64 encoded.

Error Situations

In error situations, the response body is always JSON, even if the requested type was binary. If synthesis is still in progress or has failed, the response body contains an error message:

{
  "error": string,
  "errorCode": string,
  "status": number,
  "details": string | null
}

Error response fields

  • error: string
    • Error message. This changes depending on whether synthesis is still in progress or has failed.
  • errorCode: string
    • Error code corresponding to the error message.
  • status: number
    • HTTP status of the error response.
  • details: string | null
    • May contain a more detailed error description if available.

The response status codes directly indicate the synthesis status:

  • Complete: 200
  • In Progress: 202
  • Failed: 400

Possible Errors

errorCode status error
audio-not-ready 202 Audio generation for the given UUID is not ready yet.
audio-generation-failed 400 Audio generation for the given UUID has failed.
empty-request-body 400 Invalid argument: empty request body.
invalid-request-body 400 Invalid argument: invalid JSON.
invalid-uuid 400 Invalid UUID or the UUID is not allowed for the given API key.
invalid-api-key 401 Invalid API key.

Fetching Available Voices

You can fetch all available voices for your project by calling this endpoint with your API key.

HTTP Request

GET https://api.aimater.com/tts/v2/voices/<api_key>

Replace <api_key> with your API key.

Response Body

{
    "voices": [
        {
            "background": string, // Deprecated
            "color": string,
            "languageCode": string,
            "name": string
        },
        ...
    ]
}

Response fields

  • voices
    • List of available voices.
    • These objects can be used in the voice field of synthesis requests.
    • The objects have the following fields:
      • background: string | null
        • Voice background as the CSS background attribute.
        • Not used actively anymore. May be removed in future versions.
      • color: string
        • Voice color.
        • Used in Readit's products to distinguish voices from each other.
      • languageCode: string
        • Voice language code.
      • name: string
        • Voice name.

Other

General Notes

  • A sentence can never be longer than 1000 characters.

SSML input

SSML input must be enclosed within <speak> tags.

<speak>Test.</speak>

Supported SSML Tags

Break

<break>

Attributes: - time - Defines the pause duration either in milliseconds (ms) or seconds (s).

Example:

<speak>One second pause.<break time="1s" />Half second pause.<break time="500ms" /></speak>

Examples

Regular Synthesis

Input

{
  "input": {
    "text": "Tämä on testi."
  },
  "voice": {
    "languageCode": "fi"
  },
  "responseConfig": {
    "includePhonemizedText": true
  },
  "auth": {
    "key": "12345678-9abc-def1-2345-6789abcdef12"
  }
}

Output

{
  "audioContent": "UklGRlYHAQBXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YTIHAQAAAAAAAAA...",
  "phonemizedText": "Tämä on testi."
}

Regular Synthesis with SSML input

Input

{
  "input": {
    "ssml": "<speak>Tämä on testi. <break time=\"1s\" /></speak>"
  },
  "voice": {
    "languageCode": "fi"
  },
  "auth": {
    "key": "12345678-9abc-def1-2345-6789abcdef12"
  }
}

Output

{
  "audioContent": "UklGRlYHAQBXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YTIHAQAAAAAAAAA..."
}

Batch Synthesis Start

Input

{
  "input": {
    "text": "This is a long text that is more than 1000 characters..."
  },
  "voice": {
    "languageCode": "en"
  },
  "auth": {
    "key": "12345678-9abc-def1-2345-6789abcdef12"
  }
}

Output

{
  "audioId": "e8cb3912-543b-43a2-89c5-48b0867c54dd"
}

Retrieving Completed Batch Synthesis Audio

Input

{
  "input": {
    "audioId": "e8cb3912-543b-43a2-89c5-48b0867c54dd"
  },
  "auth": {
    "key": "12345678-9abc-def1-2345-6789abcdef12"
  }
}

Output

{
  "audioContent": "UklGRlYHAQBXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YTIHAQAAAAAAAAA..."
}