Multimodal Assistant: Speech, Text, and Image Interaction

Interact with the assistant by recording audio and uploading an image. The assistant will describe the image and respond to your query in audio.