aix.audio

Audio operations interface for AIX.

Provides text-to-speech (TTS) and speech-to-text (transcription) capabilities.

Examples

Text to speech: >>> from aix.audio import text_to_speech >>> audio = text_to_speech(“Hello, world!”) # doctest: +SKIP >>> audio.save(“hello.mp3”) # doctest: +SKIP

Transcription: >>> from aix.audio import transcribe >>> text = transcribe(“speech.mp3”) # doctest: +SKIP >>> print(text) # doctest: +SKIP ‘Hello, this is a test recording.’

class aix.audio.GeneratedAudio(data: bytes, model: str = None, text: str = None, voice: str = None, format: str = 'mp3')[source]

Wrapper for generated audio.

Provides convenient access to audio data and saving.

Examples

>>> audio = GeneratedAudio(data=b'...', model="tts-1")
>>> audio.save("output.mp3")
>>> data = audio.as_bytes()

as_bytes() → bytes[source]

Get audio as bytes.

Returns:: Audio data as bytes

play()[source]

Play the audio.

Requires a system audio player or library like pygame/pyaudio.

Examples

>>> audio.play()

save(path: str | Path)[source]

Save audio to file.

Parameters:: path – Output file path

Examples

>>> audio.save("output.mp3")
>>> audio.save("speech.wav")

class aix.audio.TranscriptionResult(text: str, language: str = None, duration: float = None, segments: list = None, model: str = None)[source]

Result of audio transcription.

Contains the transcribed text and optional metadata like segments and timestamps.

Examples

>>> result = TranscriptionResult(text="Hello world")
>>> print(result.text)
'Hello world'

aix.audio.text_to_speech(text: str, *, model: str = None, voice: str = None, speed: float = None, response_format: str = 'mp3', api_key: str = None, **kwargs) → GeneratedAudio[source]

Convert text to speech audio.

Parameters:

text – Text to convert to speech
model – TTS model to use (e.g., ‘tts-1’, ‘tts-1-hd’)
voice – Voice to use (‘alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’)
speed – Playback speed (0.25 to 4.0)
response_format – Audio format (‘mp3’, ‘opus’, ‘aac’, ‘flac’)
**kwargs – Additional provider-specific parameters

Returns:

GeneratedAudio object

Raises:

ImportError – If LiteLLM is not installed

Examples

>>> from aix.audio import text_to_speech
>>> audio = text_to_speech("Hello, how are you?")
>>> audio.save("greeting.mp3")

>>> # Different voice and speed
>>> audio = text_to_speech(
...     "This is a test.",
...     voice="nova",
...     speed=1.2
... )

>>> # High quality
>>> audio = text_to_speech(
...     "Important announcement",
...     model="tts-1-hd",
...     voice="onyx"
... )

aix.audio.transcribe(audio: str | Path | BinaryIO | bytes, *, engine: str = None, model: str = None, language: str = None, prompt: str = None, response_format: str = 'text', temperature: float = None, timestamp_granularities: list[str] = None, api_key: str = None, **kwargs) → str | TranscriptionResult[source]

Transcribe audio to text.

By default this routes through LiteLLM (OpenAI-style transcription). Pass engine= to instead delegate to a scribed backend — one façade over many ASR engines (local Whisper / faster-whisper / vosk, or cloud Deepgram / AssemblyAI / Groq / ElevenLabs / Google …) with speaker diarization and SRT/VTT output. The return type is unchanged either way, so existing callers are unaffected.

Parameters:

audio – Audio file path, file object, or bytes.
engine – Optional scribed backend id (e.g. "faster-whisper", "deepgram"). When given, transcription is delegated to scribed (which resolves that engine’s own credentials); the LiteLLM path is bypassed. Requires pip install 'aix[scribed]'. See scribed.list_backends().
model – Transcription model (e.g. 'whisper-1'); for a scribed engine, the engine-specific model (e.g. a Whisper size).
language – Source language (ISO-639-1 code, e.g. 'en', 'es').
prompt – Optional text to guide the model’s style (LiteLLM path).
response_format – 'text' (default) → str; 'srt'/'vtt' → subtitle str (scribed path); else → TranscriptionResult.
temperature – Sampling temperature (LiteLLM path).
timestamp_granularities – Timestamp types (‘word’, ‘segment’) (LiteLLM path).
**kwargs – Additional parameters (forwarded to LiteLLM, or to the scribed backend — e.g. diarize=True).

Returns:

str for response_format in {text, srt, vtt}, else a TranscriptionResult.

Examples

>>> from aix.audio import transcribe
>>> text = transcribe("recording.mp3")
>>> # delegate to a scribed engine (local, free, diarized SRT):
>>> srt = transcribe(
...     "meeting.wav", engine="faster-whisper", response_format="srt"
... )
>>> dg = transcribe(
...     "call.mp3", engine="deepgram", diarize=True,
...     response_format="verbose_json",
... )

aix.audio.transcribe_with_timestamps(audio: str | Path | BinaryIO | bytes, *, granularity: str = 'segment', model: str = None, **kwargs) → TranscriptionResult[source]

Transcribe audio with detailed timestamps.

Parameters:

audio – Audio file path, file object, or bytes
granularity – Timestamp granularity (‘word’ or ‘segment’)
model – Transcription model
**kwargs – Additional parameters for transcribe()

Returns:

TranscriptionResult with detailed segments

Examples

>>> from aix.audio import transcribe_with_timestamps
>>> result = transcribe_with_timestamps("lecture.mp3")
>>> for segment in result.segments:
...     start = segment['start']
...     end = segment['end']
...     text = segment['text']
...     print(f"[{start:.2f}-{end:.2f}] {text}")

aix.audio.translate_audio(audio: str | Path | BinaryIO | bytes, *, model: str = None, prompt: str = None, api_key: str = None, **kwargs) → str[source]

Translate audio from any language to English.

Note: Currently uses Whisper’s translation capability which translates to English.

Parameters:

audio – Audio file path, file object, or bytes
model – Translation model (typically ‘whisper-1’)
prompt – Optional text to guide translation
**kwargs – Additional provider-specific parameters

Returns:

Translated text in English

Examples

>>> from aix.audio import translate_audio
>>> english_text = translate_audio("spanish_audio.mp3")
>>> print(english_text)
'This is the English translation.'