• Donate
  • Log In
Home
  • About
    • About
      • About Us
      • Our Board of Directors
      • Board Meeting Minutes
      • Board Elections
      • Updates & Announcements
      • Our Staff
      • Governance & Financials
      • Lifetime Achievement Award
  • Events
    • Events
      • Upcoming
      • Past
      • Conference FAQ
      • Conference Policies
      • Code of Conduct
      • Calls for Papers
      • Author Resources
      • Grant Opportunities
      • Best Papers
      • Test of Time Awards
  • Join & Support
    • Join & Support
      • Become a Member
      • Ways to Give
      • Our Supporters
      • Student Opportunities
      • Sponsorship Opportunities
  • Archive
    • Archive
      • Proceedings
      • Multimedia
      • ;login: Archive
      • Short Topics in System Administration Series
      • Journal of Education in System Administration (JESA)
      • Journal of Election Technology and Systems (JETS)
      • Computing Systems Journal
  • Search
Join the conversation
Back to ;login: Online

Evaluating AI Agents at Production Scale: A Multi-Signal Framework

Lessons from Generating Thousands of Mobile Apps Daily
September 12, 2025
Case Study
Authors: 
Nikita Kryzhanouski
Article shepherded by: 
Rik Farrow
When Single Metrics Break Down

When you're generating thousands of AI outputs daily, traditional evaluation approaches hit a wall. The problem isn't technical complexity—it's that the metrics we rely on for smaller systems fundamentally don't capture what matters at scale.

Take response latency. Your monitoring dashboard shows the model generated code in 2.3 seconds. Great! But if that code crashes when users try to run it, the speed is meaningless. Users will happily wait ten minutes for something that actually works rather than get instant garbage.

Or consider user ratings. You add thumbs up/down buttons to capture satisfaction. Sounds reasonable, except users are loss-averse by nature. They'll definitely hit dislike when something breaks, but they rarely bother with like when everything works as expected. You end up measuring frustration more accurately than satisfaction, which skews your optimization in the wrong direction.

At Rork App, we learned this lesson while scaling our AI-generated React Native apps from tens to thousands per day. Traditional metrics that felt scientific and objective were actually steering us away from what users needed. What we discovered is that effective evaluation requires multiple perspectives working together, with AI-powered analysis filling the gaps that surveys and simple metrics can't reach.

A Multi-Signal Evaluation Framework

The solution we built combines four different types of signals. Think of them as different lenses for looking at the same problem—each one shows you something the others miss.

1. Behavioral Signals: What Users Actually Do

The most honest signal comes from watching user actions, not asking for opinions. When someone reverts a generated app version, that's a costly decision. They're throwing away time and progress, which means they're genuinely dissatisfied with what they got.

We track these revert events along with context: which model generated the app, how complex the request was, how long before the user gave up. Unlike survey responses, behavioral data doesn't lie because it's expensive. Users might tell you they're happy to be polite, but they won't waste their own time to be nice.

The downside is that behavioral signals only capture one side of the story. You see when users are frustrated enough to take action, but plenty of users just tolerate mediocre results without reverting. Others have unrealistically high standards and revert even minor issues. So while revert rates are a strong signal of failure, they don't tell you much about success.

2. AI-Powered Intent Classification: Predicting User Momentum

Here's where AI evaluation of AI outputs gets interesting. Instead of asking users "how satisfied are you?" we analyze their messages to predict "are you making progress toward something real?"

We built a classifier that uses GPT-4 to score every user message on a scale from -10 (about to churn) to +10 (ready to ship). The key insight is that user satisfaction correlates with momentum toward their goals, not with perfect outputs.

Churn Risk Territory:

  • completely_stuck (-10): "Nothing works anymore", "The app won't even start"
  • major_blocker (-7): "Login doesn't work at all", "Can't save any data"  
  • frustrated (-4): "There are bugs everywhere", "Why does nothing work properly?"
  • polish_issues (-2): "Fix this typo", "The button color is wrong"

Neutral Ground:

  • exploring (0): "How do I add authentication?", "What's possible with this tool?"

Progress Territory:

  • iterating (+3): "Make this button bigger", "Change the color scheme"
  • building (+6): "Add a settings screen", "Create user profiles"
  • expanding (+9): "Add a complete payment system", "Build an admin dashboard"
  • shipping (+10): "Add app store screenshots", "Prepare for app store submission"

The classifier analyzes three things simultaneously: Where is this user in their journey to build something real? Are they moving forward or stuck? How likely are they to keep using the platform?

What makes this powerful is that it runs in real-time. When a user drops to -4, we can proactively reach out with support before they churn. When someone hits +9, we know they're a candidate for our showcase program. Traditional surveys would require us to interrupt users and hope they respond honestly.

The validation numbers tell the story: intent scores predict 30-day retention with 67% correlation. That's nearly double what we get from satisfaction surveys (34% correlation) and significantly better than pure behavioral metrics like session frequency (41% correlation).

3. Technical Quality Metrics: The Engineering Health Check

Technical metrics don't predict user happiness directly, but they reveal systematic problems before users complain about them. We track compilation errors, runtime exceptions, generation failures, and library usage mistakes.

The real value here is pattern recognition. If TypeScript property errors spike 15% in one week, that suggests prompt degradation or model drift. Users might not report these issues immediately, but the technical data gives us early warning that something's wrong with our system.

These metrics are also crucial for guiding prompt engineering. After analyzing thousands of errors, we discovered that 94% came from just five error types. That insight directly shaped how we restructured our system prompts to avoid the most common mistakes.

4. Functional Capability Benchmarks: Testing What Actually Matters

The final piece is structured testing of whether generated outputs actually work for their intended purpose. We break this down by category—games need animation and physics, photography apps need camera integration and image processing, social platforms need real-time messaging and user profiles.

For each category, we define what "working" means through boolean evaluation criteria. A photography app passes if the camera component renders without errors, photo capture works, at least three filters are available, filter preview updates in real-time, and edited photos can be saved to the device.

This is expensive to run comprehensively—testing 30 prompts with 30 checks each requires significant time and effort. But it's the only way to definitively answer "does this system generate apps that actually work?" When we need statistical confidence for major decisions, functional benchmarks provide the rigor that other signals can't match.

The trick is knowing when you need that level of precision. Most changes only affect narrow capabilities, so we can test just the relevant functionality instead of running full evaluations. Authentication improvements get auth-specific tests, navigation changes get navigation-focused prompts.

Deep Dive: AI-Powered Intent Analysis

The most innovative component of our framework transforms subjective user satisfaction into objective momentum prediction through structured prompt engineering.

Prompt Architecture

const prompt = `

  # User Journey Intent Classification
  Classify user messages based on their progress toward building a working app and likelihood to continue using the platform.

  ## NEGATIVE MOMENTUM (Churn Risk)

  **completely_stuck** (-10): App is completely unusable, user likely to abandon
  - "Nothing works anymore"
  - "The app won't even start"
  - "I can't do anything"
  - "This is completely broken"

  **major_blocker** (-7): Core functionality is broken, significant friction
  - "Login doesn't work at all"
  - "Can't save any data"
  - "The whole navigation is broken"
  - "Users can't sign up"

  **frustrated** (-4): Multiple issues, user losing confidence in the tool
  - "There are bugs everywhere"
  - "Why does nothing work properly?"
  - "This keeps breaking"
  - "So many errors"

  **polish_issues** (-2): Minor problems, cosmetic fixes needed
  - "Fix this typo"
  - "The button color is wrong"
  - "Text is misaligned"
  - "Small spacing issue"

  ## NEUTRAL (0)

  **exploring** (0): Learning, planning, not yet committed to building
  - "How do I add authentication?"
  - "What's possible with this tool?"
  - "Can you explain how this works?"
  - "What's the best approach?"

  ## POSITIVE MOMENTUM (Retention)

  **iterating** (+3): Improving existing features, making steady progress
  - "Make this button bigger"
  - "Change the color scheme"
  - "Improve the layout"
  - "Add loading states"

  **building** (+6): Adding new functionality, clear forward momentum
  - "Add a settings screen"
  - "Create user profiles"
  - "Add search functionality"
  - "Build a shopping cart"

  **expanding** (+9): Scaling up significantly, high engagement with project
  - "Add a complete payment system"
  - "Create a multi-step onboarding flow"
  - "Build an admin dashboard"
  - "Add real-time notifications"

  **shipping** (+10): Approaching real product, preparing for users
  - "Add app store screenshots"
  - "Set up production environment"
  - "Create onboarding for real users"
  - "Prepare for app store submission"

  ## Classification Principles

  1. **Focus on user journey stage**: Where are they in building a real app?
  2. **Assess momentum**: Are they moving forward or stuck?
  3. **Consider retention risk**: How likely are they to continue using the platform?
  4. **Value proximity**: How close are they to having something useful?

  ## Key Distinctions

  - **Stuck vs Frustrated**: Stuck = specific blocker, Frustrated = losing faith
  - **Building vs Expanding**: Building = new features, Expanding = complex systems
  - **Iterating vs Exploring**: Iterating = improving existing, Exploring = learning

`

Our intent classifier uses a hierarchical prompt structure with three key components:

Category definitions with behavioral anchors: Each intent level includes specific user language patterns and contextual markers. "Completely stuck" maps to phrases like "nothing works anymore" and "I can't do anything," while "shipping" corresponds to "prepare for app store submission" and "set up production environment."

Classification principles: The prompt emphasizes outcome-focused evaluation: proximity to user goals, forward momentum assessment, and retention risk prediction. This grounds subjective language analysis in objective business metrics.

Confidence scoring: Every classification includes certainty measurement, enabling human review of edge cases and continuous prompt improvement.

Operational Integration

Intent scores trigger automated workflows: users below -4 receive proactive support outreach, users above +6 get feature expansion suggestions, and users at +9-10 enter our product showcase pipeline.

This transforms reactive customer service into predictive user success management.

What We Tried That Didn't Work

Our path to this framework went through several dead ends that taught us important lessons about what evaluation can and can't do.

Human taste at small scale was our first approach. Engineers would test a few apps and declare which model "felt better." This worked fine when we generated dozens of apps per day, but collapsed as we scaled. "Better" had no consistent definition—sometimes it meant fewer crashes, other times it meant the app looked nicer. Human judgment varies wildly between evaluators and doesn't scale beyond small teams.

User voting seemed like the obvious solution. We built A/B testing infrastructure with like/dislike buttons, thinking we'd solved the evaluation problem by measuring user preference directly. The flaw revealed itself quickly: people are loss-averse. Users whose apps crashed would definitely hit dislike, but users whose apps worked perfectly often couldn't be bothered with like. We were systematically over-measuring negative experiences and under-measuring positive ones.

Pure error tracking felt more scientific. We categorized every TypeScript error, bundle failure, and generation timeout. This gave us fascinating insights—agentic AI clearly outperformed single-call models on complex projects, while simple models excelled at contained edits. But once we started optimizing prompts, the data turned slippery. One error type would drop 4% while another rose 4%. Net improvement was unclear, and neither our team nor users could confidently say the experience had actually gotten better.

Revert rates were more promising because they measured costly user actions rather than opinions. But the signal was incomplete. Some users tolerated mediocre results and never reverted. Others had unrealistically high standards and reverted minor cosmetic issues. Reverts told us clearly when we'd failed, but they didn't tell us when we'd succeeded.

Each approach captured something valuable but insufficient. The breakthrough was realizing we didn't need to find the perfect metric—we needed to combine complementary signals that covered each other's blind spots.

Implementation Guidelines for Production Teams

When Each Signal Type Matters Most

Behavioral monitoring should be always-on once you're generating more than a few hundred outputs daily. It's your canary in the coal mine—when revert rates or abandonment spikes, something's wrong with your system even if you don't know what yet.

Intent classification becomes critical when user success depends on sustained engagement over multiple sessions. If people need to iterate and refine outputs to achieve their goals, understanding their momentum is essential. For simple one-shot tasks, it's probably overkill.

Technical metrics are indispensable for engineering-driven optimization. They guide prompt engineering, reveal systematic failures, and help you understand why certain improvements work. But they're means, not ends—optimize technical metrics to improve user outcomes, not for their own sake.

Functional benchmarks are your heavyweight evaluation tool. Use them for major releases, competitive analysis, and when you need statistical confidence for important decisions. They're too expensive to run continuously, but irreplaceable when you need definitive answers.

Resource Allocation and Team Structure

Once your AI system generates more than 1,000 outputs daily, assign dedicated evaluation engineering resources. Shared responsibility creates bottlenecks and inconsistent measurement practices. Budget roughly 15-20% of your AI development resources for evaluation infrastructure—this includes building classification models, automating data pipelines, and developing statistical analysis capabilities.

For tooling, use commercial solutions like Langfuse or Weights & Biases for basic tracking and visualization. But expect to build custom components for domain-specific evaluation logic and multi-signal orchestration. Every production AI application has unique success criteria that generic tools can't capture.

Beyond Mobile Apps: Broader Applications

This multi-signal approach generalizes well to other production AI applications. Code generation platforms can combine code acceptance rates (behavioral), developer problem-solving progression analysis (intent), compilation success (technical), and task completion rates (functional). Content creation systems might track engagement patterns, audience response prediction, quality metrics, and format-specific capability testing.

The core principle stays the same: combine economically honest behavioral signals, AI-powered outcome prediction, objective technical measurement, and structured capability testing. The specific metrics change, but the framework of multiple complementary perspectives remains constant.

The Future of AI Evaluating AI

As AI systems handle more complex and diverse tasks, evaluation must evolve from measuring accuracy to predicting outcomes. The most promising directions involve cross-modal evaluation that combines text, code, visual, and behavioral signals, predictive evaluation that uses early interaction patterns to forecast long-term success, and adaptive evaluation that dynamically adjusts measurement strategies based on user behavior.

The fundamental insight driving this evolution is that effective evaluation measures user outcomes, not system outputs. What matters isn't whether the AI produces syntactically correct code or grammatically proper text—it's whether users achieve meaningful goals with those outputs.

Conclusion: Evaluation as Strategic Leverage

Production AI evaluation isn't about finding perfect metrics—it's about building systematic approaches that acknowledge uncertainty while providing actionable insights. Single metrics, whether technical, behavioral, or survey-based, capture important signals but miss the complexity of real-world success.

Our multi-signal framework demonstrates that AI can effectively evaluate AI at scale through structured analysis and automated outcome prediction. By combining behavioral honesty, predictive intent analysis, technical monitoring, and functional validation, you can build evaluation systems that guide optimization toward user success rather than metric gaming.

The most important insight for production teams: evaluation is strategic leverage. What you measure determines what you optimize. Focus on user outcomes rather than system convenience, combine complementary signals to eliminate blind spots, and invest in AI-powered analysis to scale beyond human evaluation limits. The question isn't whether to invest in comprehensive evaluation—it's whether to build systems that predict user success or remain trapped optimizing for metrics that don't actually matter.

Appendix
References: 
Article Categories: 
Programming
AI/ML
Last updated September 12, 2025
Authors: 

Full-stack engineer and AI systems architect with 7+ years building production applications from zero to millions of users. Currently scaling AI-powered mobile app generation at Rork App, processing thousands of outputs daily

[email protected]
  • Log in to post comments
USENIX logo
  • Contact USENIX
  • Privacy Policy

© USENIX 2025
EIN 13-3055038

Website designed and built by Giant Rabbit LLC
Powered by Backdrop CMS

We need contributions from individuals like you.

USENIX conferences directly influence the development of computing systems and products used worldwide. Contribute today to support this vital work for the next 50 years.

Secure the Future of USENIX

Donate
Close