Scale AI

Performance

Scale AI specializes in providing high-quality labeled data essential for training AI applications across various industries, including autonomous vehicles, robotics, augmented reality, and more.

Memo Highlights

Confidential
Do not share

Deck only available via Desktop

Highlights

Backed by Accel, Index, Founders Fund, Coatue, Spark, Tiger, Amazon - Scale AI is a rapidly growing AI data infrastructure platform. The company specializes in providing high-quality labeled data essential for training AI applications across various industries, including autonomous vehicles, robotics, augmented reality, and more.

‍

As reported by YC, the company tripled ARR in 2023 and is expected to reach $1.4B by the end of 2024. With revenue growing 200% year on year, they also expect to be profitable by the end of this year. They reportedly reached $760M in 2023 and current customers include Open AI, Anthropic, Microsoft, Meta, Google, Cohere, Adept, Nvidia, General Motors, Toyota, Etsy, Instacart, Chegg, US Army & many more. Scale's customers are increasing their investments in data, to service larger models and more enterprises. They are a critical component to create and improve models and deploy generative AI solutions. Just as Nvidia ($2.4T valuation) has established a leading position in AI chips, Scale AI has become the dominant player in data solutions for AI models, recognizing that data is the most crucial component for any AI model.

We are investing at a 4% premium to Scale’s $1B Series F led by tier 1 VC Accel at ~$13.69B valuation. Given they are close to profitability and the demand for well trained AI continues to rapidly grow, they are on a trajectory for a strong potential future IPO.

‍

Recent Updates:

$1B Series F closed - Video of the CEO: “Our data engine generates nearly of all the data necessary to fuel the leading LLMs in the industry today”
Meta & Scale Partner - To Drive Enterprise Adoption of Llama

‍

INTRODUCTION

AI is now a critical component of major products & companies that are defining the future, examples of this include Tesla’s Autopilot, GitHub Copilot, and even TikTok content recommendations. A long-running issue with building artificial intelligence and machine learning applications was a lack of well-organized data required to build models. This lack of data extends the timelines required to build AI models and leads to a decrease in the accuracy of the application. Further, without a strong dataset to train these AI applications, the applications can often have decreased capabilities and increased vulnerability. Further, a lack of data can often prevent the application from being built at all. For example, in medical research, since there is a limited amount of data available to diagnose rare diseases and conditions, building an AI application to identify such conditions is often difficult and inaccurate.

‍

Scale AI aims to fix these issues. Scale AI’s vision is to be the foundational infrastructure behind artificial intelligence and machine learning applications. The company began with data labeling and annotation used in building AI/ML models. Data labeling and data annotation involve tagging relevant information or metadata in a dataset to use for training an ML model. To train and build any ML algorithm, the model needs to be grounded on accurate data that is correctly labeled. Scale AI’s core value proposition is built around ensuring companies have correctly labeled to allow them to build effective ML models. By building comprehensive datasets to train AI/ML applications, Scale AI seeks to enable developers to build accurate applications with increased capability and limited vulnerability.

‍

PRODUCT

To get a foundational understanding of Scale AI, it's important to understand the lifecycle of building a machine-learning model for any given industry vertical. The entire roadmap begins with data and its sources before moving to data engineering, which is a component of data science.

‍

Source: Andy Scherpenberg

‍

Scale AI’s core value proposition is built around the data engineering component of this lifecycle. Specifically, Scale AI helps companies with data annotation and labeling of “ground truth” data. This ground truth data refers to correctly labeling data in an expected format, such as tagging a picture of a cat as a “cat” or assisting in differentiating a dog from a cat in an image. Scale AI manages every step of the ML lifecycle by offering a wide variety of product solutions including data annotation, data management, automated data extraction, model evaluation, and synthetic data generation.

‍

‍

Scale Data Engine

Scale AI’s main product is its data engine. which companies use to build and train ML algorithms. The data engine collects, curates, and annotates data to train and evaluate models. Companies including Lyft, Toyota, Airbnb, and General Motors pay Scale AI to get high-quality annotated data labeled by human contractors or an ML algorithm.

‍

Data Annotation and Labeling

Scale AI annotates many different types of data including 3D sensor fusion, image, video, text, audio, and maps. Although image, video, text, and audio products could be generalized across several industries, 3D sensor fusion and map labeling are specific to the autonomous driving, robotics, and Augmented Reality and Virtual Reality (AR/VR) industries.

Scale Rapid is a labeling platform for ML teams to quickly develop production-quality training data. It allows users to upload data, set up labeling instructions, and get feedback and calibration on preliminary labels in a few hours, in order to quickly scale up the data labeling process to larger volumes. Scale AI provides the annotator workforce necessary to label the data.
Scale Studio is a platform to manage a company’s annotation projects and workforce. This product offering provides a tool that tracks and visualizes annotator metrics and also provides ML-assisted annotation tooling to speed up annotations. It tracks metrics such as throughput, efficiency, and accuracy.

The difference between Scale Studio and Scale Rapid is the approach to labeling the data. Scale Rapid requires that the data be annotated by Scale AI, while Scale Studio requires the company to bring its own annotator workforce.

‍

Manage and Evaluate Data

In August 2020, Scale AI launched Nucleus, a “data debugging SaaS product.” Nucleus provides advanced tooling for understanding, visualizing, curating, and collaborating on companies’ data, allowing teams to build better ML models. Specifically, Nucleus allows for data exploration, debugging of bad labels, comparing accuracy metrics of different versions of ML models, and finding failure cases.

Source: Not Boring

‍

Generative AI Platform

Scale AI also develops custom ML models and solutions as a service, including its Document AI product. Document extracts information from digital documents. Companies like Brex and Flexport use it for invoices and logistics paperwork. Scale AI claims that Document AI produces higher quality data which has lower latency and helps save time and money compared with traditional optical character recognition (OCR) methods.

‍

Source: Not Boring

‍

Scale Forge is an AI-powered marketing suite. This product enables marketers and creatives to generate product imagery, social media ads, and lifestyle pictures. Scale AI claims that these images can be generated in seconds. Marketers can experiment and prototype different visual products, generate product images, and improve conversion with high-quality images.

‍

Source: Scale AI

‍

Scale E-Commerce AI is a product for ecommerce platforms to create, enrich, and enhance ecommerce catalog data. Scale AI claims that this product improves engagement, discoverability, and conversion. The product enables ecommerce and retail teams to get high-quality data from seller feeds and the public internet. The ecommerce suite uses annotation to remove duplicates, merge variants, fix inconsistencies, and correct errors on ecommerce platforms.

‍

Scale Content Understanding aims to improve business intelligence and analysis by enriching content metadata, discovering trend insights, and flagging sensitive content. Specifically, Content Understanding can reduce overhead by fully managing complex ML capabilities such as deduplication, object identification, and fraud detection.

‍

Scale Synthetic is a product offering that helps companies generate synthetic data, which is just annotated information that computer algorithms generate as an alternative to real-world data. Synthetic data improves the performance of ML models and costs less to acquire. As of September 2023, Scale AI supports generating synthetic images, videos, and 3D point cloud synthetic data. However, domain gaps between synthetic data and real data may not lead to an improvement in ML model performance; Scale AI acknowledges this risk. Additionally, generating 3D synthetic data is costly in both compute and human effort, although this cost may decrease over time.

‍

Scale Donovan is an AI suite for the federal government. Donovan ingests data from cloud, hybrid, and on-prem sources, organizes data to make it interactable, and enables operators and analysts to ask questions to sensor feeds and map/model data. Further, Donovan produces a course of action, summary report, and other actionable insights to help operators achieve mission objectives.

‍

Source: Scale AI

‍

Scale Spellbook is Scale AI’s product intended for developers to build, compare, and deploy large language model apps. Spellbook was announced in November 2022. Its features include scaling CPU and GPU computing, managing model deployments and A/B testing, and monitoring real-time metrics such as uptime, latency, and performance. Spellbook also includes structured testing for ML models through regression tests and model comparisons.

‍

CUSTOMERS

Scale AI has adopted a sales model in which it derives most of its revenue from a small set of large data-labeling consumers. These companies include large organizations such as General Motor’s Cruise, Zoox, Nuro, and other autonomous driving companies that require vast volumes of labeled camera data. Scale AI’s customers include not only autonomous driving companies, but robotics companies as well, including Kodiak Trucks, Embark, Skydio, and Toyota Research Institute.

‍

With the Document AI product, Scale AI expanded its customer base to companies such as Flexport, Brex, and SAP. Scale AI also has startup customers that use computer vision for their products, including CellarEye for managing wine collections, TimberEye for optimizing log inventory and management, and States Title for faster real-estate transactions.

Scale AI’s marketing and ecommerce suites enabled Scale AI to access marketers and retail platforms. As of September 2023, Scale Forge was still a new product being rolled out gradually via waitlist, so the company did not list notable customers. On the other hand, Scale AI’s ecommerce suite is utilized by companies including Instacart, Faire, Pinterest, and Square.

With the introduction of Scale Donovan, Scale AI expanded to serve the federal government and defense contractors. Key customers include the US Army, the US Air Force, and the Defense Innovation Unit. Scale have also secured a $249M contract with the Department of Defense.

‍

Source: Scale AI

‍

MARKET SIZE

The rise of AI can be attributed to several key factors, including increased computing power in AI chips, a growing volume of training data, improved technological bottlenecks (such as vanishing gradients, which led to the discovery of transformers), and a decrease in cloud storage and compute costs. With its data labeling and annotation products, Scale AI mainly targets the data collection and labeling market, which is estimated to reach $17.1 billion by 2030 and is projected to grow at a CAGR of 28.9% from 2023 to 2030.

‍

Scale AI’s model customization and data debugging product lines have expanded in scope to address the global AI market. The global artificial intelligence market was valued at $136.6 billion in 2022 and is expected to grow at a CAGR of 37.3% through 2030.

‍

BUSINESS MODEL

Scale AI does not publicly disclose its pricing model. It has two pricing tiers: one for enterprise clients, and one for individuals.

‍

Enterprise

Source: Scale AI

‍

Scale AI provides data annotation for the enterprise on a custom pricing basis.

Self-Serve Data Engine

‍

Source: Scale AI

‍

For Scale AI’s self-serve data engine, a client can manage and annotate data for ML projects in one place, but use its own workforce. Scale AI prices this product on a pay-as-you-go basis by credit card. The first 1K labeling units are free, and the price of labels over 1K is not disclosed.

Companies pay Scale AI to label data, and the price ranges depending on the volume and the data type (image, video, text, 3D LiDAR, etc.). Scale AI labels the data using a labor source of more than 100K contractors. The company also builds in-house algorithms to ensure the quality of the data. Scale AI also automates the labeling process using its own ML algorithms.

‍

Scale AI has gone beyond the autonomous vehicle labeling market to pick up large government contracts to label geospatial data. In addition, Scale AI has managed to garner enterprise contracts with companies like Brex and OpenAI for natural language processing. The company has ramped up its release of products in recent years, growing what was previously an exclusively annotation-based product line into something that includes model training, collection, and debugging.

‍

Source: Not Boring

‍

Key Opportunities

Specific Data Labeling for More Industries

Scale AI has focused on developing data labeling and annotation services for specific industries including autonomous driving. Acquiring new customers and expanding to new industries is a key opportunity. Scale AI has already proven itself by labeling a variety of data types; in 2018, Scale AI focused on autonomous driving companies such as GM, Cruise, Lyft, Zoox, and nuTonomy.

‍

In 2023, its customers include government agencies like the DoD, marketplaces like Airbnb, fintech companies like Brex, and AI developer OpenAI. Each has very different data labeling needs, but Scale AI has proven it can win contracts and deliver quality service to each of them.

‍

Source: Scale

‍

Product Expansion

Scale AI could expand its products across the ML lifecycle. Scale AI has already launched the Nucleus product, which helps companies evaluate and debug data. Additionally, Scale AI has developed Synthetic, which generates synthetic data for training ML models. Although synthetic data has some gaps with real-world data, leveraging synthetic data with real-world data for training ML models could increase model performance while decreasing data acquisition costs. Scale AI can expand the capabilities of synthetic data with recent advances in AI-generated art. As of September 2023, Scale AI is developing and rolling out Forge for marketing teams to generate images. In the future, Scale AI could continue to search for new innovative AI applications for other professions and use cases.

‍

COMPETITION

Scale AI is expanding to different parts of the ML stack beyond data labeling, including ML model debugging and evaluation with products like Nucleus. However, there are many more competitors in each of these spaces in ML infra, including Databricks, Labelbox Model, and Snorkel Flow. Scale AI’s core differentiator is its lower cost of human-in-the-loop data labeling at scale.

‍

TEAM & FOUNDERS

Alexandr Wang (CEO) and Lucy Guo (co-founder) founded Scale AI in 2016. Wang and Guo met while working at Quora. Wang was a machine-learning enthusiast and recognized the importance of training data in advancing artificial intelligence. He came up with the idea while studying at MIT after noticing his peers weren’t building AI products, despite their training, because there was a lack of well-organized data required for them to build models. He identified that there was a hole in the market: in order to bridge the gap between human and machine-learning capabilities, there was a need for accurately labeled datasets that could train AI models.

‍

Wang recruited coworker and product designer, Lucy Guo, to help build this vision. The team’s mission was to build a platform that combined human intelligence with machine learning algorithms to create a reliable data training system for AI. At that point in time, AI development was limited by data labeling, annotation, and quality control. Wang and Guo set out to address these limitations by founding Scale AI in 2016. Wang dropped out of MIT and Guo dropped out of Carnegie Mellon to build Scale AI.

Round

Secondary - Series C shares

Investors

Accel, Index, Founders Fund, Coatue, Spark, Tiger, Amazon

Date

Sept 2024

Questions

team@joinbeyond.co

Submission received!

We'll email you regarding next steps.

Oops! Something went wrong while submitting the form.

Memo

‍

Recent Updates:

$1B Series F closed - Video of the CEO: “Our data engine generates nearly of all the data necessary to fuel the leading LLMs in the industry today”
Meta & Scale Partner - To Drive Enterprise Adoption of Llama

‍

INTRODUCTION

‍

PRODUCT

‍

Source: Andy Scherpenberg

‍

‍

Scale Data Engine

‍

Data Annotation and Labeling

Scale Rapid is a labeling platform for ML teams to quickly develop production-quality training data. It allows users to upload data, set up labeling instructions, and get feedback and calibration on preliminary labels in a few hours, in order to quickly scale up the data labeling process to larger volumes. Scale AI provides the annotator workforce necessary to label the data.
Scale Studio is a platform to manage a company’s annotation projects and workforce. This product offering provides a tool that tracks and visualizes annotator metrics and also provides ML-assisted annotation tooling to speed up annotations. It tracks metrics such as throughput, efficiency, and accuracy.

‍

Manage and Evaluate Data