BizIdea

INFERENCE CHIP ai-infra Scan 2026-05-13 to 2026-05-13 Run 20260514080039

Scheduling middleware that batches and routes long-context LLM jobs intelligently, cutting inference costs 60% without new hardware.

ML engineering teams building production agentic systems and RAG pipelines over large document corpora face ballooning inference costs and unpredictable latency as context windows grow beyond 32k tokens. Existing serving frameworks like vLLM and TGI were optimized for short-context, latency-first online serving and lack native support for batching, sharding, and prioritizing throughput-bound long-context jobs.

Overall rating 3.8 / 5.0
  1. 4
    Market

    $600.0M TAM growing 19.2% CAGR with five mapped competitors supports a meaningful AI infra wedge, but not yet a billion-dollar category.

  2. 4
    Differentiation

    A model-agnostic, drop-in scheduler that routes across clouds and self-hosted endpoints is sharper than single-vendor stacks, but still copyable.

  3. 3
    Execution

    Five planned early hires, clear pilot milestones, 72% gross margin, 7.1x LTV/CAC, and 7-month payback are strong, but four model flags keep risk elevated.

  4. 4
    Timeliness

    Fractile's fresh $220M round, a cited 30x throughput gap, and four recent signals make the need urgent, though the core trigger is still single-source.

Section

Why now

  1. Fractile's $220M raise from Accel and Founders Fund in May 2026 establishes top-tier investor consensus that inference throughput is the new AI infrastructure battleground, creating a buyer narrative that software-layer scheduling tools can ride immediately.
  2. Fractile's reported jump from 40 to 1,200 tokens per second quantifies a 30x throughput gap; software scheduling can capture roughly half that gap on existing GPUs today while Fractile hardware remains 12–18 months out.
  3. Some long-context inference jobs already take weeks on conventional chips per Fractile's own pitch disclosure—a failure mode enterprise AI teams are hitting right now, creating immediate budget urgency for software fixes.
  4. Investor CapEx is explicitly rotating from training to inference infrastructure, as evidenced by Fractile's post-training hardware positioning—software scheduling tools are the first layer ML teams will budget for while specialized chips are still pre-commercial.

Catalyst. Fractile's $220M raise in May 2026 crystallizes investor and operator consensus that inference throughput is the new frontier; ML teams are actively seeking software-layer fixes while next-generation hardware is still 12–18 months from general availability.

Section

The idea

The product is a cloud-hosted inference scheduling control plane deployed as a thin proxy in front of any OpenAI-compatible endpoint. When a request arrives, the scheduler profiles the job by context length, priority tier, and deadline, then dynamically batches it with compatible in-flight requests and selects the cheapest endpoint that meets the latency SLA. Teams instrument their code with a single SDK call and immediately see per-job cost breakdowns, queue depth, and throughput metrics in a dashboard. The scheduler's batching engine achieves GPU utilization rates above 80% compared to the industry-typical 30–40% seen in over-provisioned setups. For jobs with relaxed deadlines—overnight batch analysis, large-corpus ingestion—the scheduler automatically downgrades to preemptible compute and reduces effective per-token cost by 60–70%.

What's different. Existing inference serving frameworks (vLLM, TGI, LiteLLM) optimize for median online latency and treat every request as equally urgent; this product is the first to apply throughput-aware batch scheduling with SLA-tiered prioritization designed specifically for long-context jobs. Unlike cloud providers' native batch APIs, the scheduler is model-agnostic and works across any OpenAI-compatible endpoint including self-hosted models on private infrastructure. The percentage-of-savings pricing model aligns incentives with engineering teams who must justify new tooling to a CFO without upfront commitment risk.

Startup thesis
Beachhead Series A–C AI-native startups (10–80 engineers) running production document-analysis or code-review agents that regularly process 64k–200k token contexts and are already spending more than $30k per month on inference
Wedge A drop-in Python SDK plus cloud control plane that intercepts inference requests, profiles them by context length, batches compatible jobs, and routes to the cheapest available endpoint—reducing per-job cost 50–70% with a sub-hour integration
Non-obvious insight The inference bottleneck is not purely hardware-limited—it is a scheduling problem. Long-context LLM jobs have throughput-bound, batch-friendly characteristics identical to HPC workloads, yet every major inference serving framework treats them as low-latency online requests. Fractile's raise is a hardware bet; the complementary software scheduling layer—a scheduler that understands job shape, context length, and SLA tiers—is the whitespace a software-only startup can capture now without fabricating chips.
Venture-scale path Start with the scheduling SDK for agentic inference teams, expand into a full inference FinOps platform (cost attribution, budget alerts, capacity planning), then offer a managed multi-cloud inference broker that abstracts across Nvidia, Fractile, and cloud TPUs as heterogeneous inference hardware proliferates.
Target user
Primary user ML infrastructure engineers at Series A–C AI-native startups running production agentic pipelines with context windows exceeding 32k tokens
Secondary user Platform engineering leads at mid-market enterprises adopting LLM-backed document processing or legal review workflows
Economic buyer VP of Engineering or Head of ML Infrastructure
Go-to-market seed
First customer Head of ML Infrastructure at a 30–60-person Series B AI-native startup whose core product is a document-review or code-analysis agent, currently spending $50k–$120k/month on inference across AWS Bedrock or Azure OpenAI, who has already tried vLLM tuning and is still hitting latency spikes on 100k+ token jobs
Buying trigger First month the inference bill exceeds $50k, or first time an SLA breach on a long-context job causes a customer escalation
Current alternative Manual GPU cluster over-provisioning combined with vLLM or TGI self-hosting, ad-hoc queue management scripts, and periodic renegotiation of cloud GPU reserved instances
Switching reason A 30-minute SDK integration that immediately surfaces job-level cost attribution and cuts the monthly bill by 50% beats months of devops toil with no new hardware procurement required
Pricing hypothesis Percentage-of-savings pricing (20% of documented inference cost reduction) during a 90-day pilot, transitioning to a monthly platform fee of $2,000–$8,000 per cluster plus per-token overage above a committed volume

Jobs to be done

Job Current alternative Success metric
When my inference bill spikes above budget, help me understand which jobs are driving cost so I can selectively defer low-priority batch jobs and protect latency SLAs on customer-facing requests. Manual review of cloud cost explorer dashboards with no per-job cost granularity Monthly inference spend reduced by 50% with no degradation to P95 latency on SLA-critical requests
When I need to process 500 documents overnight with 100k-token contexts each, help me schedule and batch the workload intelligently so I can complete the run without provisioning additional GPU capacity. Over-provisioning H100 instances and manually queuing jobs with vLLM's default scheduler Batch run completes on existing infrastructure with GPU utilization above 75% and per-document cost reduced by 60%
Long-Context Inference Scheduling Control Plane
flowchart LR
  Client[ML Engineer / Agent App] --> Proxy[Scheduling Proxy SDK]
  Proxy --> Profiler[Job Profiler\nContext Length and Priority]
  Profiler --> Batcher[Dynamic Batcher\nSLA-Tier Queue]
  Batcher --> Router[Endpoint Router]
  Router --> GPU1[Current GPUs\nH100 / A100]
  Router --> GPU2[Preemptible Compute\nSpot / Batch]
  GPU1 --> Dashboard[FinOps Dashboard\nCost Attribution]
  GPU2 --> Dashboard
Idea scorecard — average4.2 / 5 · 5axes
Signal4/5Pain5/5Wedge5/5Defense3/5Scale4/5
  • Signal · 4/5Fractile's $220M raise from top-tier VCs directly validates inference throughput as the binding constraint; single-source evidence limits confidence but the signal is large and well-funded.
  • Pain · 5/5Inference costs and latency are causing budget escalations and SLA breaches at AI-native startups now; Fractile's own "weeks on conventional chips" disclosure confirms operational severity before new hardware ships.
  • Wedge · 5/5The drop-in SDK with a 30-minute integration time and immediate cost-attribution dashboard is a concrete, testable wedge with a clear value moment tied directly to the first monthly bill reduction.
  • Defense · 3/5Scheduling algorithms are imitable; defensibility builds through proprietary job-shape datasets, FinOps integration lock-in, and eventual multi-hardware routing advantage as Fractile and other accelerators ship.
  • Scale · 4/5The inference FinOps and scheduling TAM tracks the LLM inference market itself; if inference spend reaches $50B+ by 2028, a 20% savings take-rate on even 2% of that market implies $200M+ ARR potential.
Business model canvas
Key partners
  • GPU cloud providers (AWS, Azure, GCP) for preemptible and reserved compute access
  • Model API providers (OpenAI, Anthropic, Mistral) for endpoint compatibility testing
  • ML framework communities (vLLM, LiteLLM) for open-source ecosystem distribution
Key activities
  • Developing and iterating the job profiling and batching engine
  • Integrating with GPU cloud providers and model hosting APIs
  • Building cost attribution and FinOps reporting pipeline
Key resources
  • Scheduling algorithm IP and batch-optimization engine for long-context job shapes
  • OpenAI-compatible proxy infrastructure with multi-region low-latency endpoints
  • ML engineering team with deep inference serving and GPU scheduling expertise
Value propositions
  • 50–70% reduction in inference cost on long-context jobs with no hardware change
  • Sub-hour integration via drop-in Python SDK proxying any OpenAI-compatible endpoint
  • Per-job cost attribution and FinOps dashboard replacing opaque cloud inference bills
  • SLA-tiered job scheduling that eliminates multi-hour queue spikes on batch workloads
Customer relationships
  • Self-serve SDK onboarding with a 30-day free pilot
  • Dedicated ML infra success manager for accounts above $3k per month
  • Community Slack for ML engineers sharing scheduling configurations and benchmarks
Channels
  • Direct outbound to ML infra leads at AI-native startups through LinkedIn and AI Slack communities
  • Open-source long-context inference cost profiling tool as a top-of-funnel lead magnet
  • Product-led growth through free SDK tier with usage-based upgrade triggers
Customer segments
  • Series A–C AI-native startups running production agentic pipelines with 64k+ token contexts
  • Mid-market enterprises with LLM-backed document processing spending $30k+/month on inference
Cost structure
  • Cloud compute for scheduling control plane and proxy infrastructure
  • ML engineering salaries for scheduling algorithm and inference serving expertise
  • Customer success and sales costs for direct enterprise sales motion
Revenue streams
  • Percentage-of-savings pricing (20% of documented cost reduction) during pilot phase
  • Monthly platform fee of $2,000–$8,000 per cluster after pilot conversion
  • Per-token overage charges above committed monthly volume
Section

Market

Market sizing
TAMSAMSOM TAM · Total addressable $600.0M SAM · Serviceable available $72.0M SOM · Serviceable obtainable $5.4M
Market sizing overview
TAM $600.0M Bottom-up estimate: ~20,000 global organizations likely to sustain production GenAI workloads heavy enough to care about scheduler economics × estimated $30k annual control-plane budget; cross-checks to a 2025 AI inference market already above $100B, implying a scheduler layer can be a sub-1% software slice.
SAM $72.0M Estimate: ~3,000 reachable North America/Europe AI-native and enterprise teams with long-context document/code workloads today × ~$24k annual budget for optimization/control software.
SOM $5.4M Year-3 reachable case: 150 customers × ~$36k average annual spend from direct sales plus open-source-led conversion into high-spend inference teams.

Executive takeaways

  • Long-context inference is now a real operations problem, but incumbents mostly offer point optimizations—batch APIs, prompt caching, or provider-specific serving stacks—rather than a neutral scheduler that classifies jobs by context length, SLA, and cheapest viable endpoint [18][2][4][40].
  • The best initial buyer is still the ML infra lead already paying for production inference, because enterprise adoption is moving from pilots to scaled deployments while organizations still struggle to fully scale experiments and governance [11][13].
  • Competition is intense but fragmented: hyperscalers sell throughput reservations, managed platforms sell faster inference on their own clouds, and open-source stacks expose primitives; the whitespace is cross-provider cost attribution plus SLA-aware routing for long-context jobs [1][30][9][16][28].
  • The technology thesis is credible because the literature already shows meaningful gains from continuous batching, prefix caching, and prefill/decode disaggregation; a startup can commercialize orchestration before custom chips like Fractile or Cerebras become broadly standard [37][38][39][12].
  • Regulatory friction is manageable but non-trivial: buyers routing sensitive documents across multiple endpoints will need auditable governance, regional controls, and data-protection posture rather than pure latency wins [31][14][25][29].

Market definition

Control-plane software for long-context LLM inference that sits above model APIs, self-hosted runtimes, and managed inference clouds. It profiles job shape, queues compatible requests, applies batch/caching-aware routing, and attributes spend by workload. The market excludes general model hosting and broad observability unless they explicitly optimize long-context batch economics and cross-provider scheduling [2][30][35][41].

Customer and buyer

Primary users are ML infrastructure engineers and platform teams operating production document-analysis, code-analysis, or agentic workflows with large prompts and variable deadlines. The economic buyer is usually a Head of ML Infrastructure, VP Engineering, or central platform owner responsible for throughput, latency, and cloud GPU spend [11][13][9].

Buying triggers

  • Monthly inference spend becomes material enough that batch, cache, and reservation discounts are worth operationalizing systematically rather than manually. [1][29][4][16][36]
  • Long-context jobs or overnight document runs cause visible queueing, slow decode speed, or unstable tail latency on current vLLM/TGI-based stacks. [19][9][5]
  • Leadership sees inference as the next infrastructure bottleneck and budgets for optimization before next-generation hardware is broadly deployed. [18][22][7][26]

Willingness to pay

The market already accepts explicit price discrimination for optimization features. Anthropic charges cache reads at a fraction of base input cost, AWS and Azure both advertise 50% batch discounts, Fireworks discounts cached and batch tokens, and Together markets batch inference at 50% lower cost. That creates a credible budget envelope for a scheduler that proves savings beyond what one provider can offer alone [4][1][29][16][36]. [4][1][29][16][36]

Category dynamics

Growth signal 19.2% CAGR

Tailwinds

  • Enterprise GenAI adoption is moving from pilots to scaled programs, increasing the number of organizations with real inference operations to optimize.
  • Infrastructure capital is rotating toward inference, not just training, as shown by Fractile, Groq, Baseten, and SambaNova/Intel signals.
  • Clouds and vendors now explicitly price batch, cache, and throughput options, which normalizes the buyer conversation around savings and workload classes.

Headwinds

  • A meaningful share of easy savings can already be captured with provider-native batch or caching features.
  • Sophisticated buyers can self-build on open-source stacks, reducing willingness to pay for undifferentiated routing logic.
  • Governance and data-protection requirements can limit endpoint choice for the exact document-heavy workloads that most need cost optimization.

Validation signals

  • Fractile’s $220M financing explicitly argues inference latency is the next bottleneck after training.
  • Baseten and Groq both raised large rounds around inference infrastructure, indicating continued capital formation in the category.
  • Cloud vendors and managed platforms now advertise batch and cache discounts directly, showing buyers already optimize around workload class rather than raw model quality alone.
  • Community and operator materials still surface long-context performance pain and continuous-batching gains, suggesting the operational problem is not abstract.

Regulatory & technical constraints

  • Batch APIs often exclude interactive features like tool calling or structured output, which limits where a scheduler can defer work without changing application behavior.
  • Prompt caching gains depend on stable prefixes and sometimes session affinity or replica locality, so software scheduling must understand workload shape, not just price tables.
  • Throughput reservations and regional deployment modes are provider-specific, complicating a neutral control plane that must normalize quotas, latencies, and residency options.
  • Sensitive enterprise document flows need auditable governance and data-protection controls when requests may cross regions or model providers.
Long-context inference control landscape
← Low cross-provider neutrality High cross-provider neutrality → ← Low optimization depth High optimization depth → Q2 Q1 · winning zone Q3 Q4 Proposed startup AWS Bedrock Baseten Fireworks AI Together AI LiteLLM
Section

Competition

The field breaks into four groups. Hyperscalers (AWS, Azure, Vertex) sell batch, caching, and reserved throughput inside one cloud [1][30][21]. Managed inference platforms like Baseten, Fireworks, and Together optimize latency and cost, but within their own infrastructure and commercial incentives [9][16][36]. Open-source orchestration layers like Anyscale, vLLM, LiteLLM, BentoML, and KServe expose primitives yet still push integration and policy work onto the buyer [6][41][28][10][27]. Specialized hardware vendors like Groq, Cerebras, SambaNova, and Fractile raise the performance ceiling but also increase endpoint heterogeneity, which strengthens the case for a neutral routing layer [22][12][26][18].

Competitor Stage Wedge Pricing Strength Weakness vs. us
AWS Bedrock incumbent Native batch inference, cache pricing, and reserved/provisioned cloud controls inside AWS. Usage-based token pricing with 50% lower batch pricing on supported models; provisioned throughput via account team. Trusted enterprise procurement path with integrated logging, quotas, and adjacent AWS services. Primarily optimizes workloads that stay on AWS, not a neutral broker across clouds, self-hosted stacks, and new hardware.
Baseten scale-up Managed inference platform combining infrastructure and runtime optimizations for reliability, speed, and cost. Public token pricing for model APIs plus compute-based dedicated deployments and enterprise plans. Deep production focus, self-host options, and strong inference branding backed by fresh capital. Wants customers on Baseten rather than routing across every endpoint the buyer already uses.
Fireworks AI scale-up Fast serverless and deployment-based inference with public cached-token and batch discounts. Public per-token serverless pricing, 50% cached-input discount, 50% batch discount, and GPU-hour on-demand pricing. Clear speed/cost messaging and prompt-caching mechanics for repeated-prefix workloads. Still a single-vendor execution venue rather than an optimizer across clouds and self-hosted clusters.
Together AI scale-up Open-source model cloud with serverless, dedicated, and batch inference under one commercial umbrella. Public per-token pricing with batch inference marketed at 50% lower cost for many models. Broad model catalog and strong appeal to AI-native teams that prefer open models. Economic incentive is to land more traffic on Together rather than arbitrage between providers.
Anyscale scale-up Ray Serve orchestration plus vLLM-based serving for teams willing to run a more customizable platform. Custom / enterprise-led pricing not publicly detailed on the cited serving overview pages. Strong open-source credibility and a flexible orchestration story for advanced teams. Heavier platform adoption path than a drop-in scheduler and still leaves cost attribution policy work to the customer.

Why incumbents do not win by default

  • Cloud platforms. AWS, Azure, and Vertex increasingly expose batch and reserved-throughput controls, but those controls are cloud-native and do not solve cross-provider scheduling or hardware-neutral FinOps by default.
  • Managed inference platforms. Baseten, Fireworks, and Together already monetize speed and cost-efficiency, but they win by keeping workloads on their own stack rather than brokering the cheapest endpoint across many stacks.
  • Open source and in-house. vLLM, LiteLLM, Ray Serve, BentoML, and KServe give sophisticated teams the raw pieces, yet a buyer still has to design routing policy, savings attribution, and reliability operations internally.
  • Specialized inference hardware. Groq, Cerebras, SambaNova, and Fractile can improve tokens-per-second, but they do not eliminate the need to classify workloads, arbitrate among endpoints, or explain spend by job type.
Section

Business plan

Long Context Inference Scheduler is a control plane for AI teams whose document-analysis and code-analysis workloads now hit long-context inference cost and queueing limits before they can justify new hardware. The first customer is a 10-80 engineer Series A-C AI-native startup already spending roughly $50k-$120k per month on inference and seeing either a budget shock or an SLA breach on 64k-200k token jobs. The product wedges in as a drop-in SDK and proxy that classifies requests by context length, priority, and deadline, then batches and routes deferrable jobs to the cheapest endpoint that still meets policy. Research supports a focused starting market rather than a broad inference platform story, with an estimated $600.0M TAM, $72.0M SAM, and $5.4M year-3 SOM for the initial long-context control-plane category. The buyer case is credible because clouds, model vendors, and managed inference platforms already train customers to pay for batching, caching, and throughput optimization, but most tools remain provider-specific or require substantial internal platform work. The deliberate strategy is to win one narrow workflow first, prove incremental savings after native provider optimizations are enabled, and only then expand into broader inference FinOps and multi-hardware brokerage. The main reasons to stay cautious are intense substitution risk from hyperscalers and open source, governance friction for sensitive document routing, and limited direct customer evidence beyond the research corpus and one headline catalyst around Fractile's financing. This should therefore be treated as a strong operating thesis with clear falsification tests, not yet as a de-risked infrastructure investment.

Problem

  • AI-native teams running 64k-200k token document or code-analysis jobs on vLLM, TGI, or cloud model APIs end up overprovisioning GPUs, absorbing queue spikes, and losing cost visibility because current serving stacks treat long-context workloads like low-latency requests.
  • Provider-native batch, caching, and provisioned-throughput controls help inside one stack, but buyers still lack a neutral system that decides which jobs are deferrable, which endpoint is cheapest under policy, and how spend maps back to workload-level business value.

Solution

  • Deploy a thin proxy plus Python SDK in front of OpenAI-compatible endpoints so each request is profiled by context length, SLA tier, and deadline before execution.
  • Use shadow mode first, then controlled routing to batch compatible long-context jobs, steer overnight or low-priority work toward cheaper capacity, and expose job-level cost attribution that proves savings beyond native vendor features.

Why we win

  • The beachhead is a narrow but urgent workflow where the buyer already owns a visible monthly bill and can measure savings, queue reduction, and deployment speed faster than in a broad platform sale.
  • A neutral scheduler can sit above AWS, Azure, self-hosted vLLM stacks, and emerging hardware instead of forcing workload migration onto one vendor's commercial stack.
  • Proprietary telemetry on context shape, deadline slack, cache reuse, and realized savings can compound into a routing and policy moat that open-source primitives do not accumulate by default.
Strategic choices
Beachhead Series A-C AI-native startups with 10-80 engineers running production document-analysis or code-analysis agents that regularly process 64k-200k token contexts and already spend more than $30k per month on inference.
Wedge rationale This slice feels pain immediately, has a technical buyer who can install a proxy quickly, and has enough deferrable or overnight work to show savings fast. Starting with broader enterprise AI programs, latency-sensitive chat, or full model hosting would add security reviews, integrations, and competitive surface area before the company proves incremental value over native batch and cache features.
Sequencing The company should begin with read-only telemetry and savings dashboards, then enable policy-based routing for a limited class of long-context jobs, then add broader FinOps controls and hardware-aware brokerage only after pilot customers convert. Hiring and partnerships should follow the same order: build the scheduler core first, sell founder-led to design partners, and only add solutions, partnerships, and broader compliance packaging after repeatable pilot-to-production conversion exists.
Not yet Latency-critical chat inference for end-user interactive workloads · Full managed model hosting or GPU cloud resale · Highly sensitive EU and UK document-routing use cases before policy controls are proven · Broad enterprise observability positioning disconnected from long-context scheduling ROI
Go-to-market
Wedge Sell a 90-day pilot to the Head of ML Infrastructure or VP Engineering at an AI-native startup immediately after monthly inference spend becomes material or a long-context SLA breach triggers executive attention; start in shadow mode on one document or code-analysis workflow, then convert to production once savings and queue reduction are verified.
Channels Founder-led outbound to ML infrastructure and platform leaders at AI-native startups with visible inference spend · Open-source profiling and benchmarking tooling distributed into vLLM, LiteLLM, Ray, BentoML, and KServe communities · Cloud, model, and accelerator partnerships after the initial scheduler wedge proves repeatable demand
Funnel targets Target account to qualified pilot 20-30%, qualified pilot to paid pilot 50%+, paid pilot to production 50%+, production account to referenceable case study 50%+.
Pricing Use percentage-of-savings pricing during the pilot so the first buyer can justify deployment against a live bill, then convert to a monthly platform fee of roughly $2k-$8k per managed cluster plus usage overages tied to scheduled token volume. This keeps the first contract aligned to a measurable cost-reduction event and later shifts to predictable software spend once the scheduler becomes operational infrastructure.
Product roadmap
MVP The MVP is a shadow-mode and limited-control scheduler for OpenAI-compatible endpoints that profiles jobs, measures context length and deadline slack, recommends batching and cheaper execution venues, and shows verified workload-level savings in a dashboard. It must work with existing vLLM, LiteLLM, or cloud API stacks before the company asks buyers to move all inference traffic.
6 months Ship production-ready SDK instrumentation, shadow-mode telemetry, job-level cost attribution, policy controls for one class of deferrable long-context jobs, and 2 to 3 design-partner pilots on live document or code-analysis workloads.
12 months Add write-path routing for more workload classes, provider-specific pricing normalization, region and endpoint policy controls, benchmark reporting against native batch and cache features, and the first repeatable production conversions.
24 months Expand into a broader inference FinOps and brokerage layer with capacity planning, reserved-versus-spot policy automation, and routing across heterogeneous clouds and specialized inference hardware while preserving the neutral control-plane position.
Key bets Enough target workloads are deferrable or batch-friendly that a scheduler can save money without harming critical latency paths. · Buyers will install a neutral proxy in shadow mode sooner than they will migrate inference stacks to a single managed platform. · The company can prove savings after native provider batch and caching features are already enabled, not only before buyers operationalize them. · Governance controls for region, provider approval, and audit logging will be sufficient for the first wave of document-heavy customers.
Business model
Revenue streams Savings-share pilot fees tied to documented reduction in inference spend · Recurring platform subscription per managed cluster or deployment · Usage overages and premium modules for advanced FinOps controls, policy governance, and multi-provider brokerage
Unit of value Managed inference cluster and scheduled long-context workload volume
Target gross margin 70%
Expansion levers More workloads and clusters inside each customer after the first long-context deployment · Upsell from routing and savings analytics into broader inference FinOps and policy controls · Expand from startup customers into larger platform teams once governance and region controls mature
Strategy map
North-star metric Long-context jobs completed under scheduler control with verified cost savings and SLA compliance
Input metrics Percentage of monitored workloads classified as deferrable or batch-friendly · Incremental savings versus native provider features alone · Pilot to production conversion rate · P95 latency compliance on routed workloads · Median time from SDK install to first savings report
Moats to build Cross-provider telemetry set on context length, cache reuse, deadline slack, and realized cost per completed job · Policy engine and benchmark corpus that shows when native batch, caching, or specialized hardware do or do not outperform default routing · Deep integrations with open-source serving stacks and provider APIs that shorten deployment versus internal builds
Kill criteria Fewer than 5 of the first 20 target accounts show a workload where at least 20% of long-context volume is credibly deferrable or batchable. · Benchmarks on 5 production workloads fail to deliver at least 15% incremental savings after native provider batch and caching features are already enabled. · Fewer than 2 of the first 4 paid pilots convert to production contracts within 6 months of pilot completion. · Security or platform teams reject hot-path deployment in most qualified deals, forcing the product to remain a dashboard with no control authority.

Milestones

0-12 months
  • Sign 2 to 3 design partners in the core AI-native startup beachhead
  • Launch shadow-mode telemetry and workload-level savings dashboard
  • Complete five comparative benchmarks versus native provider features
  • Convert at least 2 paid pilots and 1 production customer
12-24 months
  • Standardize policy-based routing for approved providers and regions
  • Reach repeatable pilot-to-production conversion on one workload class
  • Expand from one workload into broader inference FinOps controls inside existing accounts
  • Establish initial cloud, model, or accelerator partnerships that widen endpoint coverage
24-36 months
  • Support heterogeneous endpoint routing across clouds, self-hosted clusters, and emerging inference hardware
  • Build a referenceable benchmark corpus and policy dataset that improves routing quality over time
  • Expand into larger platform teams and selected enterprise accounts with stronger governance requirements
  • Reach the modeled year-3 SOM trajectory through multi-workload expansion inside lighthouse customers
Strategy map
flowchart LR
  Wedge[Long-context document and code workloads] --> MVP[Shadow-mode scheduler and savings dashboard]
  MVP --> Proof[Paid pilots with verified savings and SLA protection]
  Proof --> Expansion[Inference FinOps and multi-hardware brokerage]

Founding team

Role Start timing Rationale
Founder / CEO Month 0 Founder-led sales and design-partner discovery are required because the wedge depends on nuanced buyer pain, live telemetry access, and fast pricing iteration.
Founding eng Month 0 The company needs an owner for the SDK, proxy, routing policy engine, and initial dashboard before it can run credible pilots.
Founding ML systems eng Month 1 Benchmarking, provider-specific tuning, and workload-shape analysis are core to proving incremental savings over native infrastructure features.
Product engineer Month 4 The dashboard, policy controls, and operator workflow must become production-ready once the first pilots start sharing live traces.
Solutions engineer Month 6 Deployment support, security reviews, and ROI instrumentation become bottlenecks only after the company reaches repeatable paid pilots.

Experiment roadmap

Horizon Experiment Hypothesis Success metric Owner
0-90 days Interview 20 ML infrastructure and platform leaders and collect anonymized workload traces for document and code-analysis jobs above 64k tokens. The target beachhead has a repeatable pattern of deferrable work, bill shock, and queue pain that justifies a focused scheduler purchase. At least 8 targets confirm budget-worthy pain and at least 5 share enough telemetry to classify deadline slack and workload shape. Founder / CEO
0-90 days Build a shadow-mode SDK and dashboard that measures context length, queue depth, provider choice, and workload-level cost on one existing stack. Buyers will install read-only instrumentation quickly if time to first benchmark is short and no traffic is intercepted. Two design partners install the SDK in under one day and produce a usable savings baseline within one week. Founding eng
90-180 days Run controlled bake-offs against native batch and caching features on five production workloads. A workload-aware scheduler can still deliver at least 15% incremental savings beyond provider-native optimizations. Three of five workloads show at least 15% incremental savings with no material breach of agreed latency thresholds. Founding ML systems eng
90-180 days Convert two shadow-mode accounts into paid pilots with limited routing authority on overnight or low-priority workloads. Buyers will fund a pilot once the dashboard quantifies savings and hot-path risk is contained to one workload class. Two paid pilots signed and at least one workload moved from observe-only to controlled routing within 60 days of kickoff. Founder / CEO
180-365 days Ship region and provider policy controls plus audit logs for sensitive document-routing accounts. Governance objections can be reduced enough to unlock larger contracts without abandoning the neutral multi-endpoint thesis. Three late-stage prospects clear security review with policy-based routing controls and at least one converts to production. Product engineer
180-365 days Launch an open-source profiling tool and benchmark dataset for long-context workload economics. Open-source distribution will create qualified pipeline more efficiently than paid top-of-funnel marketing in this technical category. Ten qualified inbound conversations and three pilot evaluations sourced directly from the open-source channel. Developer relations / founder

Risk assessment

Business plan risks — 5 mapped
Impact →
High
R3 R5
R1 R2
Medium
R4
Low
Low
Medium
High
Likelihood →
  1. R1Native batch, caching, and provisioned-throughput features absorb most of the available savings before the startup lands. · Highlikelihood / Highimpact — Benchmark against native controls first and sell only where cross-provider routing or workload classification still shows measurable incremental savings.
  2. R2Platform teams resist putting a startup proxy in the inference hot path. · Highlikelihood / Highimpact — Start in shadow mode, limit early routing authority to non-critical workloads, and ship strong audit and rollback controls before expanding scope.
  3. R3Open-source projects or internal platform teams replicate the basic scheduler primitives. · Mediumlikelihood / Highimpact — Compete on time to deployment, benchmark accountability, proprietary telemetry, and policy controls rather than on routing logic alone.
  4. R4Sensitive document workflows trigger provider approval, data residency, or governance objections that reduce endpoint flexibility. · Mediumlikelihood / Mediumimpact — Prioritize lower-sensitivity US-first accounts, add provider and region policy controls early, and avoid broad enterprise expansion until these controls are proven.
  5. R5The beachhead does not expand into a broader inference FinOps or brokerage product. · Mediumlikelihood / Highimpact — Test expansion demand in every pilot, instrument adjacent buyer requests, and keep hiring modest until multi-workload expansion is real.
Risk Likelihood Impact Mitigation
Native batch, caching, and provisioned-throughput features absorb most of the available savings before the startup lands. High High Benchmark against native controls first and sell only where cross-provider routing or workload classification still shows measurable incremental savings.
Platform teams resist putting a startup proxy in the inference hot path. High High Start in shadow mode, limit early routing authority to non-critical workloads, and ship strong audit and rollback controls before expanding scope.
Open-source projects or internal platform teams replicate the basic scheduler primitives. Medium High Compete on time to deployment, benchmark accountability, proprietary telemetry, and policy controls rather than on routing logic alone.
Sensitive document workflows trigger provider approval, data residency, or governance objections that reduce endpoint flexibility. Medium Medium Prioritize lower-sensitivity US-first accounts, add provider and region policy controls early, and avoid broad enterprise expansion until these controls are proven.
The beachhead does not expand into a broader inference FinOps or brokerage product. Medium High Test expansion demand in every pilot, instrument adjacent buyer requests, and keep hiring modest until multi-workload expansion is real.
First customer
Title Head of ML Infrastructure at a Series B AI-native startup
Profile A 30-60 person startup running production document-review or code-analysis agents on 64k-200k token contexts with visible monthly spend across cloud APIs or self-hosted inference.
Trigger Monthly inference spend crosses roughly $50k or a long-context queueing incident causes a customer-facing escalation.
Buyer VP Engineering or Head of ML Infrastructure
Initial contract 90-day paid pilot priced as 20% of verified savings with a practical range of roughly $15k-$40k, converting to about $30k-$80k annual recurring software spend for one to two managed clusters plus usage overages.

What must be true

  • At least one quarter of qualified target accounts must have enough deferrable long-context work to save money through batching and routing without breaking latency commitments.
  • Buyers must still see at least 15% incremental savings after they enable native batch, cached-input, or throughput-reservation features on their current provider.
  • A shadow-mode deployment must reach first benchmark output within one day and first controlled routing within two weeks in the majority of pilots.
  • The economic buyer must fund the purchase from infrastructure or platform budget rather than an experimental AI budget in at least the first several paid pilots.
  • Early customers must show willingness to expand into broader FinOps or multi-provider brokerage once the first workload is proven, or the wedge will cap out as a narrow utility.

Open diligence questions

  • What share of the target customer's long-context volume is truly deadline-flexible rather than user-facing and latency-critical?
  • How much savings remains after the customer fully enables native batch, prompt caching, and reserved-throughput options on its existing provider?
  • Why would a buyer install a neutral proxy instead of moving more traffic to Baseten, Fireworks, Together, or cloud-native controls?
  • Which governance objections appear first in real deals: hot-path reliability, provider approval, or data residency?
  • Does one workload land at $30k-$80k annual spend without requiring services-heavy custom deployment?
Investor verdict
Call Watch
Conviction Real pain and a coherent wedge, but conviction stays moderate until the team proves incremental savings beyond native tools and shows buyers accept a hot-path control plane.
Why believe The company targets a specific budget owner, a measurable cost event, and a neutral cross-provider gap that incumbents only partially address today.
Why doubt Clouds, managed inference vendors, and open-source communities can close much of the surface area quickly unless the startup captures proprietary workload data and proves faster deployment with better accountability.
Next diligence The next proof point is 2 paid pilots on live long-context workloads that show double-digit incremental savings after native optimizations and convert into annual contracts.
Section

Financial model

3-year totals
Year 1 revenue $95K EBITDA $-1.02M · Cash EOP $1.98M
Year 2 revenue $818K EBITDA $-1.17M · Cash EOP $808K
Year 3 revenue $2.65M EBITDA $-393K · Cash EOP $416K
Unit economics
ARPU (annual) $60K
Gross margin 72%
CAC $25K Payback 7.0 months
LTV / CAC 7.1x LTV $180K
Funding ask
Round pre-seed · $3.0M
Runway 30 months
Milestone Exit Y2 with 25 active paid deployments, ~68% gross margin, multiple production references, and enough routing proof to support a seed round around repeatable pilot-to-production conversion.

Model sanity

  • Revenue engine. The base case reaches 67 active paid deployments by Q4Y3 at roughly $60K blended annual revenue per deployment, with most of the lift coming from repeat pilot conversion after Year 1.
  • Must go right. The product must keep proving savings beyond native batch and caching features so net adds can sustain 4-6 per quarter in Y2 and 8-12 per quarter in Y3.
  • Model breaks if. If pricing slips toward $52K and gross margin stalls near 68%, downside cash falls below zero before the next round case is proven.
  • Next-round proof. The next financing is justified if the company exits Y2 with 25 active deployments, ~68% gross margin, and multiple production references that shorten the sales cycle.
Revenue, cash, and EBITDA — 12-month Y1 + 8-quarter Y2/Y3
$0K$1.00M$2.00M$3.00MM1M4M7M10Q1Y2Q4Y2Q3Y3Q4Y3
  • Revenue (line, area)
  • Cash EOP (dashed)
  • EBITDA (bars, gray = loss)
Use of funds — $3.0M pre-seed
Engineering · 42% GTM · 29% G&A · 13% Buffer (6 mo) · 16%
Headcount build by role — peak12 FTE
Q1Y13Q2Y15Q3Y15Q4Y16Q1Y26Q2Y26Q3Y26Q4Y29Q1Y39Q2Y39Q3Y39Q4Y312
  • FounderCEO
  • PlatformEng
  • MLSystemsEng
  • ProductEng
  • SolutionsEng
  • Sales
  • CustomerSuccess
  • PartnershipsOps
Year-3 scenarios — base / downside / upside
Y3 revenueY3 EBITDACash low pointDescription
Downside$1.89M-$1.02M-$378KNative vendor features compress pricing, pilot conversion slows, and the company reaches fewer net paid deployments before a seed round.
Base$2.65M-$393K$387KThe company proves incremental savings beyond native tooling, keeps pricing near the middle of the BP production band, and compounds founder-led sales with open-source-led inbound.
Upside$3.45M$282K$886KBenchmark wins and partner referrals produce faster conversion, cleaner deployment economics, and a stronger seed-ready growth story.
Sensitivity — Y3 cash and revenue impact, sorted by magnitude
VariableDownsideUpsideCash impactRevenue impact
CAC$35K CAC because pilots require heavier founder and solutions time$20K CAC via open-source and partner-sourced pipeline-$300K-$140K
sales cycle9-month pilot-to-production cycle4-5 month cycle with warm references-$290K-$320K
ARPU$52K annual revenue per active deployment$66K annual revenue per active deployment-$255K-$354K
gross margin68% steady-state gross margin75% steady-state gross margin-$215K$0K
hiring paceAdd one seller and one CS hire two quarters earlier than plannedDelay one non-critical GTM hire until deployment count exceeds 30-$210K-$90K
churn3.0% monthly churn on active deployments1.5% monthly churn-$135K-$175K

Scenarios

Scenario Y3 revenue Y3 EBITDA Cash low point Description Key changes
Downside $1.89M $-1.02M $-378K Native vendor features compress pricing, pilot conversion slows, and the company reaches fewer net paid deployments before a seed round.
  • Annual revenue per active deployment falls to about $52K.
  • Net new deployments slow to 3,4,5,5 in Y2 and 6,8,10,10 in Y3.
  • Gross margin tops out near 68% because support and infra overhead stay elevated.
Base $2.65M $-393K $387K The company proves incremental savings beyond native tooling, keeps pricing near the middle of the BP production band, and compounds founder-led sales with open-source-led inbound.
  • Annual revenue per active deployment stays at $60K.
  • Net new deployments follow 4,5,6,6 in Y2 and 8,10,12,12 in Y3.
  • Gross margin rises from 50% early in Y1 to 72% in Y3.
Upside $3.45M $282K $886K Benchmark wins and partner referrals produce faster conversion, cleaner deployment economics, and a stronger seed-ready growth story.
  • Annual revenue per active deployment reaches about $66K through larger production scopes.
  • Net new deployments rise to 5,6,7,7 in Y2 and 10,12,14,14 in Y3.
  • Gross margin improves to roughly 75% as routing and support become more standardized.

Sensitivity

Variable Downside Base Upside
ARPU $52K annual revenue per active deployment $60K annual revenue per active deployment $66K annual revenue per active deployment
CAC $35K CAC because pilots require heavier founder and solutions time $25.2K CAC $20K CAC via open-source and partner-sourced pipeline
churn 3.0% monthly churn on active deployments 2.0% monthly churn 1.5% monthly churn
sales cycle 9-month pilot-to-production cycle 6-month blended cycle 4-5 month cycle with warm references
gross margin 68% steady-state gross margin 72% steady-state gross margin 75% steady-state gross margin
hiring pace Add one seller and one CS hire two quarters earlier than planned Add hires only after production proof points Delay one non-critical GTM hire until deployment count exceeds 30
Key assumptions (24)
ID Name Value Unit Source
A1 Model start month 2026-06 month [BP date 2026-05-14]; model assumes the pre-seed closes the month after the plan date.
A2 Starting cash at M1 3000 USDK [BP fundingAsk $2-4M pre-seed]; base case uses a $3.0M close inside the stated range and sized to cover the Year-2 proof point plus 6 months of buffer.
A3 Customer unit in the model active paid deployment definition [BP businessModel unitOfValue managed inference cluster and workload volume]; customersEop represents paid pilot or production deployments generating recurring software revenue.
A4 Starting customers (M1) 0 count [BP milestones] design-partner work starts before paid deployments.
A5 Blended annual revenue per active deployment 60.0 USDK [BP firstCustomer initialContract $15k-$40k pilot and $30k-$80k production ARR]; base case uses a blended value near the midpoint once pilots and production accounts mix together.
A6 Revenue recognition for adds average active customers per period formula Startup finance heuristic anchored to BP pilot-to-production timing: revenue is modeled as ((beginning customers + ending customers) / 2) × ARPU for each month or quarter.
A7 Year 1 new paid deployments by month [0,0,0,1,0,1,0,0,1,0,0,1] count [BP milestones 0-12 months] paced to reach multiple paid pilots and the first production conversion without assuming a fast enterprise ramp.
A8 Year 2 new paid deployments by quarter [4,5,6,6] count [BP milestones 12-24 months; BP gtm funnelTargets; BP sequencingRationale] assumes founder-led references plus open-source-led inbound create a repeatable but still narrow motion.
A9 Year 3 new paid deployments by quarter [8,10,12,12] count [BP market SOM 150 customers at ~$36k; BP expansionLevers] exits Year 3 at 67 active deployments, still conservative versus the stated SOM ceiling.
A10 Gross margin ramp 50% M1-M6; 60% M7-M12; 68% Y2; 72% Y3 percent [BP businessModel targetGrossMarginPct 70; BP risks on native vendor substitution and hot-path reliability] the model starts below target while the team absorbs infrastructure and support cost, then moves above target after routing and support normalize.
A11 Founder / CEO fully-loaded salary 150.0 USDK annual per FTE Startup finance heuristic for a U.S. pre-seed infra founder taking a below-market but real cash salary.
A12 Platform engineer fully-loaded salary 170.0 USDK annual per FTE [BP team Founding eng] startup-finance heuristic for early AI infrastructure engineering talent with payroll overhead.
A13 ML systems engineer fully-loaded salary 185.0 USDK annual per FTE [BP team Founding ML systems eng] startup-finance heuristic for benchmark and routing-optimization talent.
A14 Product engineer fully-loaded salary 155.0 USDK annual per FTE [BP team Product engineer] startup-finance heuristic for an early full-stack product builder with benefits and taxes.
A15 Solutions engineer fully-loaded salary 140.0 USDK annual per FTE [BP team Solutions engineer] startup-finance heuristic for deployment and security-review support.
A16 Sales fully-loaded salary 160.0 USDK annual per FTE [BP GTM founder-led outbound and later repeatable motion] startup-finance heuristic for early technical sales compensation including variable pay.
A17 Customer success fully-loaded salary 110.0 USDK annual per FTE Startup finance heuristic for post-pilot onboarding and reference-customer support.
A18 Partnerships and ops fully-loaded salary 130.0 USDK annual per FTE [BP channels and later partnerships] startup-finance heuristic for one operator supporting ecosystem and internal scaling late in Year 3.
A19 Payroll allocation policy founder 60% S&M and 40% G&A; solutions engineer 60% S&M and 40% R&D; partnerships and ops 50% S&M and 50% G&A; sales and customer success 100% S&M; all engineering roles 100% R&D policy [BP team role rationales; BP sequencingRationale] reflects a product-first org with founder-led selling and deployment-heavy early GTM.
A20 Hiring sequence beyond the named founding team first seller M10; second platform engineer M16; first customer success M18; second seller M21; partnerships and ops M27; third seller M30; second customer success M33 timing [BP team; BP milestones; BP sequencingRationale] adds GTM and customer coverage only after early pilots and production references exist.
A21 Non-payroll operating expense schedule S&M monthly Y1 [4,4,4,5,5,6,6,7,7,8,8,9], quarterly Y2-Y3 [33,36,39,42,45,48,51,54]; R&D monthly Y1 [12,12,12,14,14,15,15,16,16,17,17,18], quarterly Y2-Y3 [57,60,63,66,69,72,75,78]; G&A monthly Y1 [6,6,6,6,6,7,7,7,7,8,8,8], quarterly Y2-Y3 [27,30,33,36,39,42,45,48] USDK [BP operations; BP risks; Research regulatoryLandscape and distributionChannels] covers cloud tooling, benchmark infrastructure, travel, security/legal work, and lightweight overhead.
A22 Net customer schedule embeds churn 2.0% monthly churn in unit economics; customer adds shown in A7-A9 are net active deployments after expected churn and modest seat expansion. policy Startup finance heuristic anchored to an infra product with decent stickiness but real pilot fallout and vendor substitution risk.
A23 Blended CAC 25.2 USDK per net new customer Calculated from modeled Y2-Y3 sales and marketing spend of about $1.59M divided by 63 net new active deployments; consistent with founder-led outbound plus open-source distribution.
A24 Funding sizing rule end of Y2 milestone plus 6-month buffer policy Developer instruction; [BP fundingAsk] the round is sized to reach repeatable pilot-to-production conversion before the next institutional round.
unit economics flow
flowchart LR
  Leads --> PaidPilots
  PaidPilots --> ActiveDeployments
  ActiveDeployments --> PlatformRevenue
  PlatformRevenue --> GrossProfit
  GrossProfit --> Cash

Flags: The model still burns more than $1.0M in Y1 and $1.17M in Y2, so execution risk is concentrated in proving faster pilot conversion before cash falls under $1M. · Revenue per exit FTE only reaches the low end of SaaS benchmarks because solutions work and deployment support remain meaningful through Y3. · Q4Y3 is the first positive EBITDA quarter, so a one-to-two quarter slip in conversion timing would likely pull fundraising forward. · Customer counts are modeled as net of churn and small expansions; if real gross churn exceeds the 2.0% heuristic without offsetting upsell, Y3 revenue will miss materially.

Section

Top risks

  • Cloud provider counter-move. AWS, Azure, and GCP could bundle native long-context batch scheduling into their managed inference APIs, commoditizing the core wedge within 12–18 months. Mitigation: Build multi-cloud and self-hosted model compatibility before any single provider ships competing batch APIs and shift moat to proprietary job-shape datasets and FinOps integrations cloud providers cannot easily replicate.
  • Open-source substitution. The vLLM or LiteLLM communities could add throughput-aware batch scheduling as an open-source feature, undermining the commercial value proposition. Mitigation: Contribute core scheduling primitives to open source as a distribution channel while keeping the managed control plane, cost attribution engine, and multi-endpoint routing proprietary.
  • Single-source evidence gap. Fractile's hardware performance claims (40 to 1,200 tokens/sec) are unverified by independent benchmarks; if the hardware thesis proves overblown, urgency signals for inference scheduling may weaken. Mitigation: Validate customer pain directly through ML infra interviews before Series A; the scheduling product's value case is independent of Fractile's hardware shipping on schedule.
Section

Evidence

Cited sources (40)

  1. AWS. Amazon Bedrock Pricing – AWS · https://aws.amazon.com/bedrock/pricing
  2. AWS. Process multiple prompts with batch inference - Amazon Bedrock · https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
  3. Anthropic. Batch processing · https://platform.claude.com/docs/en/build-with-claude/batch-processing
  4. Anthropic. Prompt caching · https://platform.claude.com/docs/en/build-with-claude/prompt-caching
  5. Anyscale. How continuous batching enables 23x throughput in LLM inference · https://speakerdeck.com/anyscale/how-continuous-batching-enables-23x-throughput-in-llm-inference
  6. Anyscale. LLMs and agentic AI on Anyscale | Anyscale Docs · https://docs.anyscale.com/llm
  7. Baseten. Announcing Baseten’s $75M Series C · https://www.baseten.co/blog/announcing-baseten-75m-series-c
  8. Baseten. Cloud Pricing · https://www.baseten.co/pricing
  9. Baseten. The Baseten Inference Stack | Guides · https://www.baseten.co/resources/guide/the-baseten-inference-stack
  10. BentoML. LLM Inference Handbook · https://www.bentoml.com/llm
  11. Capgemini. Harnessing the value of AI: Unlocking scalable advantage · https://www.capgemini.com/insights/research-library/generative-ai-in-organizations-2025
  12. Cerebras. Inference - Cerebras · https://www.cerebras.ai/inference
  13. Deloitte. State of Generative AI Q4 – Press Release · https://www.deloitte.com/us/en/about/press-room/state-of-generative-ai.html
  14. EU AI Act. EU Artificial Intelligence Act | Up-to-date developments and analyses of the EU AI Act · https://artificialintelligenceact.eu/
  15. Exploding Topics. How Many AI Companies Are There? (2025) · https://explodingtopics.com/blog/number-ai-companies
  16. Fireworks AI. Fireworks - Pricing · https://fireworks.ai/pricing
  17. Fireworks AI. Prompt caching - Fireworks AI Docs · https://docs.fireworks.ai/guides/prompt-caching
  18. Fractile. https://www.fractile.ai/news/fractile-raises-220m-to-build-the-next-generation-of-inference-hardware · https://www.fractile.ai/news/fractile-raises-220m-to-build-the-next-generation-of-inference-hardware
  19. GitHub / vLLM. [Performance]: decoding speed on long context · Issue #11286 · vllm-project/vllm · https://github.com/vllm-project/vllm/issues/11286
  20. Google Cloud. Batch predictions | Generative AI on Vertex AI | Google Cloud Documentation · https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/batch-prediction
  21. Google Cloud. Throughput quota | Generative AI on Vertex AI | Google Cloud Documentation · https://docs.cloud.google.com/vertex-ai/generative-ai/docs/resources/throughput-quota
  22. Groq. Groq Raises $750 Million as Inference Demand Surges · https://groq.com/newsroom/groq-raises-750-million-as-inference-demand-surges
  23. Groq. Rate Limits - GroqDocs · https://console.groq.com/docs/rate-limits
  24. Hugging Face. Text Generation Inference · Hugging Face · https://huggingface.co/docs/text-generation-inference/en/index
  25. ICO. Guidance on AI and data protection · https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection
  26. Intel. Intel, SambaNova Planning Multi-Year Collaboration for Xeon-Based AI Inference · https://newsroom.intel.com/data-center/intel-and-sambanova-planning-multi-year-collaboration-for-xeon-based-ai-inference
  27. KServe. Overview | KServe · https://kserve.github.io/website/docs/admin-guide/overview
  28. LiteLLM. Router - Load Balancing | liteLLM · https://docs.litellm.ai/docs/routing
  29. Microsoft Azure. Azure OpenAI Service - Pricing | Microsoft Azure · https://azure.microsoft.com/en-us/pricing/details/azure-openai
  30. Microsoft Learn. What Is Provisioned Throughput for Foundry Models? - Microsoft Foundry · https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput?tabs=global-ptum
  31. NIST. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile · https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence
  32. NVIDIA. Removing the Guesswork from Disaggregated Serving | NVIDIA Technical Blog · https://developer.nvidia.com/blog/removing-the-guesswork-from-disaggregated-serving
  33. PR Newswire / MarketsandMarkets. AI Inference Market worth $254.98 billion by 2030 - Exclusive Report by MarketsandMarkets™ · https://www.prnewswire.com/news-releases/ai-inference-market-worth-254-98-billion-by-2030---exclusive-report-by-marketsandmarkets-302388315.html
  34. Together AI. Overview - Together AI docs · https://docs.together.ai/docs/inference/batch/overview
  35. Together AI. Pricing | Together AI · https://www.together.ai/pricing
  36. arXiv. Efficient Memory Management for Large Language Model Serving with PagedAttention · https://arxiv.org/abs/2309.06180
  37. arXiv. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills · https://arxiv.org/abs/2308.16369
  38. arXiv. Splitwise: Efficient generative LLM inference using phase splitting · https://arxiv.org/abs/2311.18677
  39. vLLM. Automatic Prefix Caching - vLLM · https://docs.vllm.ai/en/latest/features/automatic_prefix_caching
  40. vLLM. Parallelism and Scaling - vLLM · https://docs.vllm.ai/en/latest/serving/parallelism_scaling