gemini-3-multimodal

from adaptationio/skrillz

No description

1 stars0 forksUpdated Jan 16, 2026
npx skills add https://github.com/adaptationio/skrillz --skill gemini-3-multimodal

SKILL.md

Gemini 3 Pro Multimodal Input Processing

Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).

Overview

Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.

Key Capabilities

  • Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
  • Video Processing: Up to 1 hour of video, frame analysis, OCR
  • Audio Processing: Up to 9.5 hours of audio, speech understanding
  • PDF Documents: Native PDF support, multi-page analysis, text extraction
  • Media Resolution Control: Low/medium/high resolution for token optimization
  • Token Optimization: Granular control over processing costs

When to Use This Skill

  • Analyzing images, photos, or screenshots
  • Processing video content for insights
  • Transcribing or understanding audio/speech
  • Extracting information from PDF documents
  • Building multimodal applications
  • Optimizing media processing costs

Quick Start

Prerequisites

  • Gemini API setup (see gemini-3-pro-api skill)
  • Media files in supported formats

Python Quick Start

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")

# Upload and analyze image
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
    "What's in this image?",
    image_file
])
print(response.text)

Node.js Quick Start

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");

// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
  mimeType: "image/jpeg"
});

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
  "What's in this image?",
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

Core Tasks

Task 1: Analyze Image Content

Goal: Extract information, objects, text, or insights from images.

Use Cases:

  • Object detection and recognition
  • OCR (text extraction from images)
  • Visual Q&A
  • Code generation from UI screenshots
  • Chart/diagram analysis
  • Product identification

Python Example:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure model with high resolution for best quality
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "thinking_level": "high",
        "media_resolution": "high"  # 1,120 tokens per image
    }
)

# Upload image
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)

# Analyze with specific prompt
response = model.generate_content([
    """Analyze this image and provide:
    1. Main objects and their locations
    2. Any visible text (OCR)
    3. Overall context and purpose
    4. If code/UI: describe the functionality
    """,
    image_file
])

print(response.text)

# Check token usage
print(f"Tokens used: {response.usage_metadata.total_token_count}")

Node.js Example:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
  mimeType: "image/png"
});

// Configure model with high resolution
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    thinking_level: "high",
    media_resolution: "high"  // Best quality for OCR
  }
});

const result = await model.generateContent([
  `Analyze this image and provide:
  1. Main objects and their locations
  2. Any visible text (OCR)
  3. Overall context and purpose`,
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

Resolution Options:

ResolutionTokens per ImageBest For
low280 tokensQuick analysis, low detail
medium560 tokensBalanced quality/cost
high1,120 tokensOCR, fine details, small text

Supported Formats: JPEG,

...

Read full content

Repository Stats

Stars1
Forks0