Controlling music or video playback without using your hands is possible with this gaze and blink recognition system. It works in real-time using just a regular webcam and open-source tools like OpenCV and dlib. Instead of pressing buttons or using a remote, the system responds to where the user is looking and how many times they blink. A quick blink can pause or play a track, while two blinks can skip to the next.
The system relies on facial landmark detection to locate the eyes and uses the Eye Aspect Ratio method to identify deliberate blinks. There is no need for special equipment, which makes it simple to use and easy to set up.
🖥️ How the System Works
The gaze-controlled media player operates by using a standard webcam to track the user’s eye movements and intentional blinks. It captures real-time video, identifies the face, isolates the eye region, and then processes this data to understand what the user wants to do—whether that’s pausing a track, switching to the next one, or playing media. No keyboard or mouse is needed.
There are four main parts that handle different stages of this process:
- The webcam constantly collects video at about 26 frames every second.
- A facial landmark model detects key features of the face, especially around the eyes.
- Blink detection and gaze estimation are handled by observing how open or closed the eyes are, along with the position of the iris.
- Based on what the eyes are doing, the system either plays, pauses, or skips the current track.
This setup works in two modes. One connects with external platforms like VLC or YouTube and mimics keyboard actions. The other plays MP3 files are stored locally using Python libraries.

Figure 1. The flow of the gaze-controlled media player. The system captures video from a webcam, detects facial landmarks to locate the eyes, estimates gaze direction, and tracks blinks. These signals are then interpreted as media commands, either by simulating keyboard inputs for external applications or controlling internal audio playback.
🧿 How the System Detects Gaze and Blinks
The system works by analyzing the face in each video frame to track where the user is looking and whether their eyes are open or closed. It uses a model that marks 68 specific points on the face. Out of these, six points around each eye are used to understand the shape and position of the eyes.
To find where the user is looking, the system locates the center of the iris. This is usually the darkest part of the eye. It converts the eye region to grayscale, boosts the contrast, filters out noise, and detects the darkest spot using contours. From there, it calculates the center of the iris using a formula based on image moments.
To detect a blink, it checks how open the eye is using a ratio between the vertical and horizontal distances across six key points. When this value drops below a fixed level across two or three frames, it means the user has blinked. If there are two quick blinks, the system skips to the next track. Three blinks pause or play the media.
🛠️ How the System Is Built and Runs
The system is written in Python and uses open-source libraries. Its structure is kept simple, with two main modes of use. One mode works with external apps like VLC or YouTube. The other plays MP3 files stored in a folder on the computer.
It uses OpenCV for working with webcam frames, Dlib for face and eye detection, NumPy for calculations, PyAutoGUI to simulate key presses, and Pygame for audio playback.
Each frame goes through several steps:
- The face is located in the image.
- Sixty-eight facial landmarks are identified, with attention on the ones near the eyes.
- The eyes are cropped and converted to grayscale for better contrast.
- The system checks if the user is blinking and where their iris is located.
- A matching action is taken, such as play, pause, or skip.
Threading is used to keep the system responsive. While one frame is being processed, the next is already being read, which keeps the speed close to 26 frames per second.
📊 Results From Testing the System
The system was tested in regular indoor lighting using a standard laptop. It was able to keep up with real-time input from the webcam and respond to user actions with little delay.
Performance data:
- The average speed was around 26 frames per second, which allowed smooth interaction.
- The time taken between a blink and the system response was under half a second.
- Blink sequences were correctly detected in over 92 percent of attempts when users blinked on purpose.
- Random, natural blinks were not picked up as commands, thanks to filtering based on blink duration.
Blink stability:
The system tracked blinks using a fixed ratio. This ratio had to drop for a few frames before it was accepted as a valid blink. This helped in reducing false triggers. In some cases, sudden lighting changes made the iris harder to detect. That can be improved with better threshold adjustment.
Gaze tracking:
While the main input is based on blinks, the system still shows where the user is looking for feedback. It was accurate enough in steady lighting. Rapid head movement or shadows caused minor shifts, but that didn’t affect the overall control.

Eye blink-based control in action. The system detects blink patterns in real time using webcam input.
Media control experience:
The system was tested with VLC, YouTube, and a local MP3 folder. It worked well with each of them. In external mode, the user could control the media by blinking while the player was active on the screen. In internal mode, the user could play or skip songs from a folder without touching the keyboard or mouse.

Figure 2. Plot of Eye Aspect Ratio (EAR) over time. Sharp drops in the graph represent eye blinks. Two quick dips signal a track skip, while three dips trigger a pause or play command. This shows how the system maps blink patterns to control actions.
🧩 What the Project Shows and What Comes Next
This project shows that media players can be controlled through eye movement and blinking without needing extra hardware. It uses a regular webcam, tracks facial points to locate the eyes, and measures how open or closed they are to decide if the user blinked on purpose. It works in two modes, letting users control apps like VLC or play music stored on the system.
The blink logic is simple. Two fast blinks move to the next song. Three blinks pause or resume playback. The system runs in real-time, responds quickly, and avoids reacting to normal eye movements. This keeps the interaction smooth.
This approach can be useful for people who have limited movement, or for public systems where hands-free control is preferred. It can be made better by adding voice input, tracking head tilt for volume, or letting the eyes pick items from the screen.
The current setup already works without needing any training or calibration, and it stays accurate in most home lighting conditions. With some updates, it can be turned into a full-featured tool for daily use.
📚 References
- Soukupová, T., and Čech, J. (2016). Real-Time Eye Blink Detection Using Facial Landmarks. Proceedings of the 21st Computer Vision Winter Workshop.
- Kazemi, V., and Sullivan, J. (2014). One Millisecond Face Alignment with an Ensemble of Regression Trees. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874.
- OpenCV. Open Source Computer Vision Library. https://opencv.org/
- Dlib C++ Library. Machine Learning Toolkit. http://dlib.net/
- PyAutoGUI. Cross-platform GUI Automation. https://pyautogui.readthedocs.io/en/latest/
- Pygame Documentation. Mixer Module. https://www.pygame.org/docs/ref/mixer.html
- CS50 AI Course. Harvard University. https://cs50.harvard.edu/ai/
💾 Code and Project Files
The full source code for the gaze-controlled media player is available on GitHub. It includes the blink detection scripts, MP3 playback mode, external media control, and setup instructions. You can test it using any standard webcam and a folder with MP3 files.
🔗 GitHub Repository: github.com/AbhishekTyagi404/gaze-media-player
The repository includes:
- Python scripts for blink-based control
- MP3 player mode using Pygame
- External control mode using PyAutoGUI
- Sample media folder setup
- Instructions for installing dependencies and running the app
The project runs best on a system with at least 8 GB RAM and a webcam that can deliver 25 or more frames per second.