Apple AI researchers boast a useful on-device model that significantly outperforms GPT-4

Siri has recently been trying to describe images received in Messages when using CarPlay or the Advertising Notifications feature. In typical Siri fashion, the feature is inconsistent and has mixed results.

However, Apple is moving forward with delivering on the promise of artificial intelligence. In a newly published paper, Apple's AI experts describe a system in which Siri can do more than just try to recognize what's in an image. The best part? It believes one of its models for doing these benchmarks is better than ChatGPT 4.0.

In the paper (ReALM: Reference Accuracy as Language Modeling), Apple describes something that could give a large, language-model-optimized voice assistant a useful boost. ReALM takes into account what's on your screen and active tasks. Here is an excerpt from the paper describing the function:

1. On-screen entities: These are the entities that are currently displayed on the user's screen

2. Conversation entities: These are the entities related to the conversation. These entities may come from a previous turn of the user (for example, when the user says “Call Mom,” Contact for Mom will be the relevant entity in question), or from the virtual assistant (for example, when the agent provides the user with a list of places or alerts to choose from).

3. Background entities: These are related entities that come from background processes that may not necessarily be a direct part of what the user sees on their screen or interacts with the virtual agent; For example, an alarm clock that starts ringing or music playing in the background.

If it works well, this sounds like a recipe for a smarter, more useful Siri. Apple also seems confident that it can complete such a task incredibly quickly. The comparison is against ChatGPT 3.5 and ChatGPT 4.0 from OpenAI:

As another baseline, we run GPT-3.5 (Brown et al., 2020; Ouyang et al., 2022) and GPT-4 (Achiam et al., 2023) variants of ChatGPT, as available on January 24, 2024, with learning at Context. As in our setup, we aim to have both variables predict the list of entities from the available set. In the case of GPT-3.5, which accepts only text, our input consists of the vector alone; However, in the case of GPT-4, which also has the ability to place images in context, we provide the system with a screenshot of the reference resolution task on the screen, which we find helps improve performance significantly.

So how does Apple's model work?

We demonstrated significant improvements over an existing system with similar functionality across different types of references, with our smaller model obtaining absolute gains of more than 5% for on-screen references. We also measure GPT-3.5 and GPT-4 performance, with our smaller model achieving similar performance to GPT-4, and our larger models significantly outperforming it.

Are you saying that you are significantly superior to him? The paper concludes in part as follows:

We show that ReaLM outperforms previous approaches, performing almost as well as today's state-of-the-art LLM, GPT-4, even though it consists of much fewer parameters, even for on-screen references despite being strictly in the text domain. It also outperforms GPT-4 for domain-specific user statements, making ReaLM an ideal choice for a practical reference analysis system that can reside on-device without compromising performance.

on the device without compromising performance It seems the key is for Apple. The next few years of platform development should be interesting, we hope, starting with iOS 18 and WWDC 2024 on June 10.

Leave a Reply

Your email address will not be published. Required fields are marked *