AI readiness report

Today I’m going to summarise the latest report from Scale AI’s 2024 AI Readiness Report.

The report focuses on what it takes for organisations to transition from merely adopting AI to actively optimising and evaluating it.

To understand the current state of AI development and adoption, Scale surveyed over 1,800 ML practitioners and leaders directly involved in building or applying AI solutions.

Key challenges identified include security concerns, lack of expertise, and insufficient benchmarks to effectively evaluate models. While improving operational efficiency was the top reason for adopting AI (cited by 79%), only half are measuring the business impact of their AI initiatives.

AI Year in Review 2023 saw rapid advancements in generative AI from leaders like OpenAI, Google, Anthropic, and Meta.

Key milestones included:

OpenAI's GPT-4 in March 2023, demonstrating human-level performance
Google's launch of Bard and PaLM 2
Anthropic's release of Claude 2 in the summer with a 100K context window
Meta's unveiling of Llama 2 and Code Llama
Google DeepMind's release of Gemini in late 2023, outperforming humans on the MMLU test
Emergence of open source models like Falcon, Mixtral, and DBRX enabling local inference with less compute
Anthropic's launch of Claude 3 in March 2024, doubling the context window
Cohere's release of Command R designed for scalability and long context

Frontier research also made significant strides in mathematical reasoning, model interpretability, and performance improvements with smaller models fine-tuned on high-quality data.

Key stats on the impact of generative AI:

74% report generative AI forced the creation of an AI strategy in 2024, up from 53% in 2023
82% plan to increase investment in commercial closed-source models over the next 3 years
85% consider AI very or highly critical to their business in the next 3 years, up from 77%
Only 4% have no plans to work with generative AI
38% have generative AI models in production, a substantial increase from 19% in 2023

Applying AI

This section examines trends in enterprise AI adoption, model preferences, investment plans, use cases, and challenges.

Stage of adoption:

22% have one model in production
27% have multiple models in production
49% still evaluating use cases or developing first model/application

Model preferences:

58% use latest OpenAI GPT-4, 44% use GPT-3.5
Google Gemini used by 39%
OpenAI overwhelmingly the preferred vendor

Planned model investments:

72% increasing investment in commercial closed-source models
67% increasing investment in open-source models

Top use cases driving adoption:

Improved operational efficiency (61%)
Improved customer experience (55%)
Computer programming and content generation

Challenges:

Infrastructure, tooling, solutions not meeting needs (61%)
Insufficient budget (54%)
Data privacy concerns (52%)

For the 60% who have not yet adopted AI, security concerns and lack of expertise were the top reasons holding them back. Software and internet companies cited other priorities taking precedence.

"RAG aims to address a key challenge with LLMs - while they are very creative, they lack factual understanding of the world and struggle to explain their reasoning. RAG tackles this by connecting LLMs to known data sources, like a bank's general ledger, using vector search on a database. This augments the LLM prompts with relevant facts."

Jon Barker, Customer Engineer, Google

Building AI

This section explores the key pillars needed to build effective models, including model architecture innovations, computational resource trends, and the high-quality data imperative.

New neural network designs like sparse expert models are enabling larger, more capable models that efficiently activate only relevant subsets of neurons for each input. Example models leveraging these architectures include Falcon, Mixtral, DBRX and AI21 Labs' Grok.

The transition to GPU and TPU-centric AI workloads presents challenges:

48% rate compute resource management as "most challenging" or "very challenging"
38% cite lack of suitable AI-specific tools and frameworks as a major obstacle

"CPUs consume about 80% of IT workloads today. GPUs consume about 20%. That's going to flip in the short term, meaning 3 to 5 years. Many industry leaders that I've talked to at Google and elsewhere believe that in 3 to 5 years, 80% of IT workloads will be running on some type of architecture that is not CPU, but rather some type of chip architecture like a GPU."

Jon Barker, Customer Engineer, Google

Data is critical to building effective models. Survey results highlight:

Labeling quality is the top challenge in preparing training data
55% leverage internal labeling teams
50% engage specialized data labeling services
29% use crowdsourcing

"Even if you train long enough with enough GPUs, you'll get similar results with any modern model. It's not about the model, it's about the data that it was trained with. The difference between performance is the volume and quality of data, especially human feedback data. You absolutely need it. That will determine your success."

Ashiqur Rahman, Machine Learning Researcher, Kimberly-Clark

Going forward, key priorities include:

Acquiring domain-specific human-generated datasets
Investing in human-in-the-loop pipelines to refine model outputs
Collecting multimodal data spanning text, speech, images and video

Evaluating AI

This section dives into current model evaluation practices and challenges for both model builders and enterprises applying AI.

Top reasons for evaluating models:

Performance (67%)
Reliability (68%)
Security (62%)
Safety (54%)

Evaluation approaches:

Automated model metrics (61%)
Benchmarks (42%)
Human preference ranking (41%)
Human evaluation (41%)

Automated metrics and human preference ranking surfaced issues the fastest, with over 70% discovering problems within a week. However, existing benchmarks have shortcomings, with 48% lacking security benchmarks and 50% missing industry-specific benchmarks.

For model builders:

87% evaluate models/applications
46% have internal teams with dedicated test & evaluation platforms
64% leverage internal proprietary platforms
40% use third-party evaluation platforms

For enterprises applying AI:

72% evaluate models/applications
49% use internal proprietary platforms
42% have internal teams using external evaluation platforms
38% adopt third-party platforms

"Evaluating generative AI performance is complex due to evolving benchmarks, data drift, model versioning, and the need to coordinate across diverse teams. The key question is how the model performs on specific data and use cases... Centralized oversight of the data flow is essential for effective model evaluation and risk management in order to achieve high acceptance rates from developers and other stakeholders."

Babar Bhatti, IBM AI Customer Success Lead

Gaps remain in current evaluation practices. Only about half of organizations are measuring the business impact of AI models on key outcomes like revenue and profitability. Performance and usability benchmarks, along with industry-specific standards, are needed as AI permeates different sectors.

Conclusion

The report concludes that optimization and evaluation are key to unlocking AI performance and ROI, whether organizations are building or applying the technology. The two most significant trends are:

The growing need for model evaluation frameworks and private benchmarks
Ongoing challenges optimizing models for specific use cases without sufficient tooling for data preparation, model training, and deployment

Scale reaffirms its mission to accelerate AI application development and commitment to shedding light on the latest trends, challenges, and requirements for building, applying, and evaluating AI.

Yikes, a paywall!

‍70+ tutorials, courses and case studies wait behind it. No subscription, $150 paid once.

✅ Full course & tutorial access
✅ Case studies on companies using AI
✅ Private community access
✅ No subscription, $150 paid once
✅ Expense it using this template. Or get a team account.

Get all access ->