Voice Activity Detection (VAD)
Micdrop uses a VAD (Voice Activity Detection) to detect speech and silence and send chunks of audio to the server only when speech is detected.
Supported VAD Types
Micdrop supports the following VADs by name:
'volume'
: Volume-based VAD (default)'silero'
: AI-based VAD using Silero
You can also pass instances of these VADs, or combine them in an array. See below for details.
Note: Only
'volume'
and'silero'
are supported as string names. Custom VADs must be passed as instances.
Quick Start
Configure VAD when starting a call:
import { Micdrop } from '@micdrop/client'
// Use volume-based detection (default)
await Micdrop.start({
url: 'ws://localhost:8081',
vad: 'volume',
})
// Use AI-based detection for better accuracy
await Micdrop.start({
url: 'ws://localhost:8081',
vad: 'silero',
})
// Combine multiple VADs for best results
await Micdrop.start({
url: 'ws://localhost:8081',
vad: ['volume', 'silero'],
})
Or when starting the microphone (before starting the call):
Micdrop.startMic({ vad: 'volume' })
Volume VAD: Speech detection based on volume
By default, MicdropClient
uses VolumeVAD
for speech detection. You can use it explicitly when starting Micdrop:
Micdrop.start({ vad: 'volume' })
or when starting the microphone (before starting the call):
Micdrop.startMic({ vad: 'volume' })
It is inspired by hark and triggers speech detection events based on volume changes.
You can also pass an instance of VolumeVAD
to MicdropClient
:
const vad = new VolumeVAD({
history: 5, // Number of frames to consider for volume calculation
threshold: -55, // Threshold in decibels for speech detection
})
Micdrop.start({ vad })
- Default options:
{ history: 5, threshold: -55 }
- Persistence: Options are saved to
localStorage
and restored automatically.
When to use Volume VAD:
- ✅ Low latency requirements
- ✅ Quiet environments
- ✅ Clear speech patterns
- ❌ Noisy environments
- ❌ Soft-spoken users
Silero VAD: Human speech detection with AI
To use SileroVAD
for speech detection:
Micdrop.start({ vad: 'silero' })
It is based on @ricky0123/vad-web which runs a Silero VAD model in the browser using ONNX Runtime Web.
It is more accurate than VolumeVAD
and works better with low voice.
You can also pass an instance of SileroVAD
to MicdropClient
:
const vad = new SileroVAD({
positiveSpeechThreshold: 0.18, // Threshold for positive speech detection
negativeSpeechThreshold: 0.11, // Threshold for negative speech detection
minSpeechFrames: 8, // Minimum number of frames to consider for speech detection
redemptionFrames: 20, // Number of frames to consider for silence detection
})
Micdrop.start({ vad })
- Default options:
{ positiveSpeechThreshold: 0.18, negativeSpeechThreshold: 0.11, minSpeechFrames: 8, redemptionFrames: 20 }
- Persistence: Options are saved to
localStorage
and restored automatically.
When to use Silero VAD:
- ✅ Noisy environments
- ✅ Soft-spoken users
- ✅ Multiple speakers
- ✅ Background music/TV
- ❌ Extremely low latency needs (adds ~50ms processing)
Multiple VAD: Combine multiple VADs
Combining multiple VADs is useful to get more accurate speech detection:
- Volume to ignore low voice
- Silero to detect human speech
You can combine multiple VADs by passing an array of VAD names:
Micdrop.start({ vad: ['volume', 'silero'] })
Or with instances:
const vad = [new VolumeVAD(), new SileroVAD()]
Micdrop.start({ vad })
Or mix names and instances:
await Micdrop.start({
vad: ['volume', new SileroVAD({ positiveSpeechThreshold: 0.15 })],
})
How it works:
StartSpeaking
is emitted when any VAD detects possible speech.ConfirmSpeaking
is emitted only when all VADs confirm speech.StopSpeaking
is emitted when all VADs detect silence.CancelSpeaking
is emitted if all VADs agree speech was a false positive.
This approach reduces false positives while maintaining quick response times.
VAD Events
VADs emit the following events:
StartSpeaking
: Possible speech detected (not yet confirmed)ConfirmSpeaking
: Speech confirmedCancelSpeaking
: Speech start was a false positive (noise, etc.)StopSpeaking
: Speech endedChangeStatus
: Status changed (Silence
,MaybeSpeaking
,Speaking
)
Monitor VAD activity in your application:
Micdrop.vad.on('StartSpeaking', () => {
console.log('🎤 Possible speech detected...')
showListeningIndicator()
})
Micdrop.vad.on('ConfirmSpeaking', () => {
console.log('✅ Speech confirmed - recording')
highlightMicrophoneButton()
})
Micdrop.vad.on('StopSpeaking', () => {
console.log('🔇 Speech ended')
resetMicrophoneButton()
})
Micdrop.vad.on('CancelSpeaking', () => {
console.log('❌ False positive - not speech')
hideListeningIndicator()
})
Micdrop.vad.on('ChangeStatus', (status) => {
console.log('VAD status:', status) // 'Silence', 'MaybeSpeaking', 'Speaking'
})
Custom VAD
You can also pass your own VAD implementation:
Micdrop.start({ vad: new MyVAD() })
Your VAD implementation should extend the VAD
class:
import { VAD } from '@micdrop/client'
class MyVAD extends VAD {
private started = false
get isStarted(): boolean {
return this.started
}
async start(stream: MediaStream) {
this.started = true
// When speech is detected, emit StartSpeaking event
this.emit('StartSpeaking')
// When speech is confirmed, emit ConfirmSpeaking event
this.emit('ConfirmSpeaking')
// When speech is cancelled, emit CancelSpeaking event
this.emit('CancelSpeaking')
// When speech stops, emit StopSpeaking event
this.emit('StopSpeaking')
}
async stop() {
this.started = false
}
}
Tip: See the codebase for
HarkVAD
as another example of a custom VAD. It is not exported by default, asVolumeVAD
is the recommended and supported option inspired by Hark.
For a more complex example with audio analysis:
import { VAD } from '@micdrop/client'
class CustomVAD extends VAD {
private started = false
private audioContext: AudioContext
private analyzer: AnalyserNode
get isStarted(): boolean {
return this.started
}
async start(stream: MediaStream) {
this.started = true
// Set up audio analysis
this.audioContext = new AudioContext()
this.analyzer = this.audioContext.createAnalyser()
const source = this.audioContext.createMediaStreamSource(stream)
source.connect(this.analyzer)
// Start your detection logic
this.detectSpeech()
}
async stop() {
this.started = false
this.audioContext?.close()
}
private detectSpeech() {
// Implement your speech detection logic
const dataArray = new Uint8Array(this.analyzer.frequencyBinCount)
const analyze = () => {
if (!this.started) return
this.analyzer.getByteFrequencyData(dataArray)
// Your custom detection algorithm
const isSpeaking = this.customDetection(dataArray)
if (isSpeaking) {
this.emit('StartSpeaking')
this.emit('ConfirmSpeaking')
} else {
this.emit('StopSpeaking')
}
requestAnimationFrame(analyze)
}
analyze()
}
private customDetection(audioData: Uint8Array): boolean {
// Your custom speech detection logic here
// Return true if speech is detected, false otherwise
return false
}
}
// Use your custom VAD
await Micdrop.start({
vad: new CustomVAD(),
})
VAD Delay
All VADs have a delay
property (default: 100ms) that controls the interval for speech detection checks. You can adjust this in custom VADs if needed.