MCP servers are getting more popular. However they lack ability of visual interaction.
I suggest creating a dedicated api which will describe all available screen features and ways to call them.
For example I am watching a youtube video.
Current activity shows the contend and declares what it can do to an LLM in a format like
[{
"description": "rewind video on integer number of seconds, where positive means forward and negative means backwards",
"callback": callback,
},
{
"description": "stop video", ...
},
{
"description": "find video by search term" ...
}...]
Using this interface anyone can create an app which could be used hands free via voice commands or via autonomous assistant.