aix.audio
Audio operations interface for AIX.
Provides text-to-speech (TTS) and speech-to-text (transcription) capabilities.
Examples
Text to speech: >>> from aix.audio import text_to_speech >>> audio = text_to_speech(“Hello, world!”) # doctest: +SKIP >>> audio.save(“hello.mp3”) # doctest: +SKIP
Transcription: >>> from aix.audio import transcribe >>> text = transcribe(“speech.mp3”) # doctest: +SKIP >>> print(text) # doctest: +SKIP ‘Hello, this is a test recording.’
- class aix.audio.GeneratedAudio(data: bytes, model: str = None, text: str = None, voice: str = None, format: str = 'mp3')[source]
Wrapper for generated audio.
Provides convenient access to audio data and saving.
Examples
>>> audio = GeneratedAudio(data=b'...', model="tts-1") >>> audio.save("output.mp3") >>> data = audio.as_bytes()
- class aix.audio.TranscriptionResult(text: str, language: str = None, duration: float = None, segments: list = None, model: str = None)[source]
Result of audio transcription.
Contains the transcribed text and optional metadata like segments and timestamps.
Examples
>>> result = TranscriptionResult(text="Hello world") >>> print(result.text) 'Hello world'
- aix.audio.text_to_speech(text: str, *, model: str = None, voice: str = None, speed: float = None, response_format: str = 'mp3', api_key: str = None, **kwargs) GeneratedAudio[source]
Convert text to speech audio.
- Parameters:
text – Text to convert to speech
model – TTS model to use (e.g., ‘tts-1’, ‘tts-1-hd’)
voice – Voice to use (‘alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’)
speed – Playback speed (0.25 to 4.0)
response_format – Audio format (‘mp3’, ‘opus’, ‘aac’, ‘flac’)
**kwargs – Additional provider-specific parameters
- Returns:
GeneratedAudio object
- Raises:
ImportError – If LiteLLM is not installed
Examples
>>> from aix.audio import text_to_speech >>> audio = text_to_speech("Hello, how are you?") >>> audio.save("greeting.mp3")
>>> # Different voice and speed >>> audio = text_to_speech( ... "This is a test.", ... voice="nova", ... speed=1.2 ... )
>>> # High quality >>> audio = text_to_speech( ... "Important announcement", ... model="tts-1-hd", ... voice="onyx" ... )
- aix.audio.transcribe(audio: str | Path | BinaryIO | bytes, *, engine: str = None, model: str = None, language: str = None, prompt: str = None, response_format: str = 'text', temperature: float = None, timestamp_granularities: list[str] = None, api_key: str = None, **kwargs) str | TranscriptionResult[source]
Transcribe audio to text.
By default this routes through LiteLLM (OpenAI-style transcription). Pass
engine=to instead delegate to ascribedbackend — one façade over many ASR engines (local Whisper / faster-whisper / vosk, or cloud Deepgram / AssemblyAI / Groq / ElevenLabs / Google …) with speaker diarization and SRT/VTT output. The return type is unchanged either way, so existing callers are unaffected.- Parameters:
audio – Audio file path, file object, or bytes.
engine – Optional
scribedbackend id (e.g."faster-whisper","deepgram"). When given, transcription is delegated to scribed (which resolves that engine’s own credentials); the LiteLLM path is bypassed. Requirespip install 'aix[scribed]'. Seescribed.list_backends().model – Transcription model (e.g.
'whisper-1'); for a scribed engine, the engine-specific model (e.g. a Whisper size).language – Source language (ISO-639-1 code, e.g.
'en','es').prompt – Optional text to guide the model’s style (LiteLLM path).
response_format –
'text'(default) →str;'srt'/'vtt'→ subtitlestr(scribed path); else →TranscriptionResult.temperature – Sampling temperature (LiteLLM path).
timestamp_granularities – Timestamp types (‘word’, ‘segment’) (LiteLLM path).
**kwargs – Additional parameters (forwarded to LiteLLM, or to the scribed backend — e.g.
diarize=True).
- Returns:
strforresponse_formatin {text, srt, vtt}, else aTranscriptionResult.
Examples
>>> from aix.audio import transcribe >>> text = transcribe("recording.mp3") >>> # delegate to a scribed engine (local, free, diarized SRT): >>> srt = transcribe( ... "meeting.wav", engine="faster-whisper", response_format="srt" ... ) >>> dg = transcribe( ... "call.mp3", engine="deepgram", diarize=True, ... response_format="verbose_json", ... )
- aix.audio.transcribe_with_timestamps(audio: str | Path | BinaryIO | bytes, *, granularity: str = 'segment', model: str = None, **kwargs) TranscriptionResult[source]
Transcribe audio with detailed timestamps.
- Parameters:
audio – Audio file path, file object, or bytes
granularity – Timestamp granularity (‘word’ or ‘segment’)
model – Transcription model
**kwargs – Additional parameters for transcribe()
- Returns:
TranscriptionResult with detailed segments
Examples
>>> from aix.audio import transcribe_with_timestamps >>> result = transcribe_with_timestamps("lecture.mp3") >>> for segment in result.segments: ... start = segment['start'] ... end = segment['end'] ... text = segment['text'] ... print(f"[{start:.2f}-{end:.2f}] {text}")
- aix.audio.translate_audio(audio: str | Path | BinaryIO | bytes, *, model: str = None, prompt: str = None, api_key: str = None, **kwargs) str[source]
Translate audio from any language to English.
Note: Currently uses Whisper’s translation capability which translates to English.
- Parameters:
audio – Audio file path, file object, or bytes
model – Translation model (typically ‘whisper-1’)
prompt – Optional text to guide translation
**kwargs – Additional provider-specific parameters
- Returns:
Translated text in English
Examples
>>> from aix.audio import translate_audio >>> english_text = translate_audio("spanish_audio.mp3") >>> print(english_text) 'This is the English translation.'