Using multimedia in WhatsApp bots: Sending and handling video, audio and documents

6.10.2025

In the ever-evolving landscape of digital communication, WhatsApp has emerged as a powerful tool for businesses to engage with their customers. With over two billion users worldwide, the platform's bots, powered by the WhatsApp Business API, offer a seamless way to automate interactions. The inclusion of multimedia elements such as videos, audio files and documents transforms these bots from basic text responders into dynamic tools capable of sharing tutorials, product demonstrations, voice memos, contracts and much more. This improves the user experience, boosts engagement, and streamlines processes such as customer support, marketing, and sales.

The WhatsApp Cloud API, which is hosted by Meta, forms the basis for developing these bots. It enables developers to send and receive messages without having to manage servers themselves, as it scales automatically to handle high volumes. Unlike the on-premises version, the Cloud API simplifies the setup process and provides businesses with free access after verification. It supports a range of multimedia formats, enabling bots to deliver rich content directly in chats.

This expert article explores the options for sending and processing multimedia in WhatsApp bots. We will cover API mechanics, code examples, best practices, limitations and security considerations. Leveraging these features enables developers to create bots that feel personal and interactive, driving better business outcomes in 2025 and beyond.

‍

An overview of the WhatsApp Cloud API for multimedia in bots.

The WhatsApp Cloud API offers a robust framework for incorporating multimedia into bots. To begin using it, businesses must register for a WhatsApp Business Account via the Meta Business Suite, obtain API access and set up webhooks for real-time notifications. The API uses RESTful endpoints with bearer token authentication.

Multimedia messages fall into the following categories: audio (including voice), documents, images, stickers and videos. Supported formats ensure compatibility across devices. For example, audio files can be in AAC, MP3 or OGG format (using the OPUS codec), documents can be in PDF, DOCX or XLSX format, images can be in JPEG or PNG format, videos can be in MP4 or 3GP format (using the H.264 codec) and stickers can be in WebP format. Size limits vary: 16 MB for audio and video, 5 MB for images, 100 MB for documents and smaller for stickers (100–500 KB).

Uploading media involves POSTing to /PHONE_NUMBER_ID/media, which returns an ID for reuse. This ID or a public URL can be used to attach media to messages. Retrieval uses the GET method to retrieve a temporary download URL (valid for five minutes) along with details such as the MIME type and SHA-256 hash. Media persists for 30 days, which promotes efficiency in bot workflows.

Bots built using frameworks such as Node.js, Python or PHP can integrate with this API. Tutorials emphasise setting up webhooks for incoming events and using libraries such as Flask or Express to handle requests. This setup enables bots to respond contextually; for example, they can send a video tutorial in response to a query.

‍

Sending Multimedia Messages

Sending multimedia via the Cloud API uses the POST /PHONE_NUMBER_ID/messages endpoint. The payload specifies the type (e.g., "video") and includes either a media ID or link, plus optional captions (up to 1024 characters for non-audio/sticker types).

For videos: Use "type": "video" with an object containing "id" or "link", and "caption". Example cURL:

‍

curl -X POST 'https://graph.facebook.com/v23.0/FROM_PHONE_NUMBER_ID/messages' \
-H 'Authorization: Bearer ACCESS_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
  "messaging_product": "whatsapp",
  "to": "RECIPIENT_PHONE",
  "type": "video",
  "video": {
    "link": "https://example.com/video.mp4",
    "caption": "Product Demo"
  }
}'

‍

This sends a video preview with playback controls. Bots can use this for tutorials or promotions.

Audio messages ("type": "audio") support voice notes without captions. Example:

‍

curl -X POST 'https://graph.facebook.com/v23.0/FROM_PHONE_NUMBER_ID/messages' \
-H 'Authorization: Bearer ACCESS_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
  "messaging_product": "whatsapp",
  "to": "RECIPIENT_PHONE",
  "type": "audio",
  "audio": {
    "id": "AUDIO_ID"
  }
}'

‍

Ideal for personalized responses like confirmations.

Documents ("type": "document") include "filename" for display. No captions in Cloud API, but filenames help identification. Example:

‍

curl -X POST 'https://graph.facebook.com/v23.0/FROM_PHONE_NUMBER_ID/messages' \
-H 'Authorization: Bearer ACCESS_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
  "messaging_product": "whatsapp",
  "to": "RECIPIENT_PHONE",
  "type": "document",
  "document": {
    "link": "https://example.com/contract.pdf",
    "filename": "Contract.pdf"
  }
}'

‍

This enables sharing invoices or guides.

In bot development, integrate with languages like Python. Using requests library:

‍

import requests

url = "https://graph.facebook.com/v23.0/PHONE_ID/messages"
headers = {"Authorization": "Bearer TOKEN"}
payload = {
    "messaging_product": "whatsapp",
    "to": "RECIPIENT",
    "type": "image",
    "image": {"link": "https://example.com/image.jpg", "caption": "Info"}
}
response = requests.post(url, headers=headers, json=payload)

‍

This modular approach allows bots to dynamically select media based on user input, enhancing interactivity.

Handling Incoming Multimedia

Receiving multimedia occurs via webhooks, configured in the app settings. When a user sends media, a POST notification hits your server with a JSON payload.

The payload's "messages" array details the type and media object. For video ("type": "video"):

‍

{
  "object": "whatsapp_business_account",
  "entry": [{
    "changes": [{
      "value": {
        "messages": [{
          "type": "video",
          "video": {
            "id": "VIDEO_ID",
            "mime_type": "video/mp4",
            "sha256": "HASH",
            "caption": "User Video"
          }
        }]
      }
    }]
  }]
}

‍

Bots retrieve the media using GET /MEDIA_ID, then download from the URL.

For audio:

‍

{
  "messages": [{
    "type": "audio",
    "audio": {
      "id": "AUDIO_ID",
      "mime_type": "audio/ogg"
    }
  }]
}

‍

Process by downloading and analyzing, e.g., transcribing voice for sentiment.

Documents include "filename" and "caption":

‍

{
  "messages": [{
    "type": "document",
    "document": {
      "id": "DOC_ID",
      "mime_type": "application/pdf",
      "sha256": "HASH",
      "filename": "File.pdf",
      "caption": "Attached Doc"
    }
  }]
}

‍

In code, use Node.js with Express:

‍

app.post('/webhook', (req, res) => {
  const message = req.body.entry[0].changes[0].value.messages[0];
  if (message.type === 'document') {
    // Retrieve and process document
  }
  res.sendStatus(200);
});

‍

This enables bots to store, analyze, or respond to media, like OCR on documents or keyword extraction from audio.

‍

Best Practices and Limitations

Best practices include using rich media sparingly to avoid overwhelming users, personalizing content (e.g., dynamic videos), and tracking engagement via webhooks. Integrate multimedia with text for context, and test across devices. For D2C brands, use urgency in media messages to boost conversions.

Limitations: File sizes cap at 100 MB max, with stricter per-type limits; no end-to-end editing of sent media; caching links for 10 minutes requires query strings for refreshes. Bots must comply with messaging policies to avoid bans, and template messages are needed outside 24-hour windows. Overcome by compressing files and using cloud storage for links.

In 2025, prioritize concise, bite-sized multimedia and gather user feedback for optimization.

‍

Security Considerations

Security is paramount. The API uses end-to-end encryption via Signal Protocol, ensuring only sender and recipient access content. Implement 2FA, verified profiles, and regular audits. Limit API access, comply with GDPR, and monitor for spam to prevent restrictions. For multimedia, hash verification (SHA-256) ensures integrity during transfers.

‍

Conclusion

Multimedia in WhatsApp bots transforms basic automation into engaging experiences. By mastering sending via APIs, handling through webhooks, and adhering to best practices, developers can build scalable, secure bots. As adoption grows in 2025, expect advancements like enhanced AI integration for media analysis. Embrace these tools to foster deeper customer connections and drive innovation.

‍

Using multimedia in WhatsApp bots: Sending and handling video, audio and documents

An overview of the WhatsApp Cloud API for multimedia in bots.

Sending Multimedia Messages

Handling Incoming Multimedia

Best Practices and Limitations

Security Considerations

Conclusion

Related articles/news

NLP Chatbot for WhatsApp: Open-Source Examples, Architecture, and a Practical Starter Guide

WhatsApp ↔ Google Sheets connector for prototypes: Build, test and learn quickly.

Integrating Helpdesk-FAQ and Bot Responses in WhatsApp: Automating Customer Support

Enhancing Customer Loyalty: Automated WhatsApp Feedback After Ticket Resolution

NLP Chatbot for WhatsApp: Open-Source Examples, Architecture, and a Practical Starter Guide

WhatsApp ↔ Google Sheets connector for prototypes: Build, test and learn quickly.

Integrating Helpdesk-FAQ and Bot Responses in WhatsApp: Automating Customer Support

WhatsApp Business API free trial request