npx skills add https://github.com/adaptationio/skrillz --skill gemini-3-multimodalSKILL.md
Gemini 3 Pro Multimodal Input Processing
Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).
Overview
Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Key Capabilities
- Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
- Video Processing: Up to 1 hour of video, frame analysis, OCR
- Audio Processing: Up to 9.5 hours of audio, speech understanding
- PDF Documents: Native PDF support, multi-page analysis, text extraction
- Media Resolution Control: Low/medium/high resolution for token optimization
- Token Optimization: Granular control over processing costs
When to Use This Skill
- Analyzing images, photos, or screenshots
- Processing video content for insights
- Transcribing or understanding audio/speech
- Extracting information from PDF documents
- Building multimodal applications
- Optimizing media processing costs
Quick Start
Prerequisites
- Gemini API setup (see
gemini-3-pro-apiskill) - Media files in supported formats
Python Quick Start
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
# Upload and analyze image
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
"What's in this image?",
image_file
])
print(response.text)
Node.js Quick Start
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");
// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
mimeType: "image/jpeg"
});
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
"What's in this image?",
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Core Tasks
Task 1: Analyze Image Content
Goal: Extract information, objects, text, or insights from images.
Use Cases:
- Object detection and recognition
- OCR (text extraction from images)
- Visual Q&A
- Code generation from UI screenshots
- Chart/diagram analysis
- Product identification
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure model with high resolution for best quality
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "high" # 1,120 tokens per image
}
)
# Upload image
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)
# Analyze with specific prompt
response = model.generate_content([
"""Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose
4. If code/UI: describe the functionality
""",
image_file
])
print(response.text)
# Check token usage
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
mimeType: "image/png"
});
// Configure model with high resolution
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
thinking_level: "high",
media_resolution: "high" // Best quality for OCR
}
});
const result = await model.generateContent([
`Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose`,
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Resolution Options:
| Resolution | Tokens per Image | Best For |
|---|---|---|
low | 280 tokens | Quick analysis, low detail |
medium | 560 tokens | Balanced quality/cost |
high | 1,120 tokens | OCR, fine details, small text |
Supported Formats: JPEG,
...
Repository
adaptationio/skrillzParent repository
Repository Stats
Stars1
Forks0