A Harvard dropout creates an AI chip company: raises 120 million US dollars, fac

A Harvard dropout creates an AI chip company: raises 120 million US dollars, fac

Recently, AI chip startup Etched announced that it has raised $120 million to challenge Nvidia in AI chip design.

Etched is designing a new chip called Sohu, which is used to handle a key part of AI processing: Transformation. The company says that by engraving the Transformer architecture into the chip, it is building the world's most powerful Transformer inference server. Etched claims that this is the fastest Transformer chip ever.

Primary Venture Partners and Positive Sum Ventures led this round of financing, and received support from institutional investors such as Hummingbird, Fundomo, Fontinalis, Lightscape, Earthshot, Two Sigma Ventures (strategic), and Skybox Data Centers (strategic).

It is worth mentioning that the company's angel investors include Peter Thiel, Stanley Druckenmiller, David Siegel, Balaji Srinivasan, Amjad Masad, Kyle Vogt, Kevin Hartz, Jason Warner, Thomas Dohmke, Bryan Johnson, Mike Novogratz, Immad Akhund, Jawed Karim, and Charlie Cheeve.

Advertisement

Alex Handy, Director of the Thiel Fellowship, said in a statement: "Investing in Etched is a strategic bet on the value of artificial intelligence. Their chip solves the scalability issues that competitors dare not address, challenging the stagnation that prevails in the industry. The founders of Etched embody the unconventional talents we support - dropping out of Harvard to enter the semiconductor industry. They have worked hard so that others in Silicon Valley can continue to code with peace of mind, without worrying about any underlying technology they are researching."

Transformers dominate the world, and GPUs hit a wall

As everyone can see, in the past, solving AI problems relied on GPUs. However, Etched said in a blog post that the secret of Santa Clara is that GPUs have not become better, but bigger. For four years, the computing power per unit area of the chip (TFLOPS) has remained almost unchanged.

They said that NVIDIA's B200, AMD's MI300, Intel's Gaudi 3, and Amazon's Trainium2 all count two chips as one card to achieve "double" performance. From 2022 to 2025, AI chips have not really become better, but bigger. From 2022 to 2025, all GPU performance improvements have used this trick, except for Etched.Before the era of transformers dominating the world, many companies built flexible AI chips and GPUs to handle hundreds of different architectures. Here are some examples:

NVIDIA's GPUs, Google's TPUs, Amazon's Trainium, AMD's accelerators, Graphcore's IPUs, SambaNova SN Series, Cerebras's CS-2, Groq's GroqNode, Tenstorrent's Grayskull, D-Matrix's Corsair, Cambricon's Siyuan, and Intel's Gaudi.

No one has ever made an AI chip specifically for algorithms (ASIC). The cost of a chip project is between 50 million and 100 million US dollars, and it takes several years to go into production. When we first started, there was no market.

Suddenly, the situation changed:

Unprecedented demand: Before ChatGPT, the market for transformer inference was about 50 million US dollars, and now it has reached several billion US dollars. All major technology companies are using transformer models (OpenAI, Google, Amazon, Microsoft, Facebook, etc.).

 

Convergence in architecture: AI models used to vary greatly. But since GPT-2, the architecture of the most advanced models has remained almost unchanged! OpenAI's GPT series, Google's PaLM, Facebook's LLaMa, and even Tesla FSD are all transformers.

When the cost of training a model exceeds 1 billion US dollars and the inference cost exceeds 10 billion US dollars, dedicated chips are inevitable. At this scale, a 1% improvement will justify a custom chip project of 50 million to 100 million US dollars.

In fact, ASICs are several orders of magnitude faster than GPUs. When Bitcoin miners entered the market in 2014, it was cheaper to discard GPUs than to use them to mine Bitcoin.

Due to the involvement of tens of billions of US dollars, the same situation will also occur in artificial intelligence.The Transformer is strikingly similar: Adjustments such as SwiGLU activation and RoPE encoding are ubiquitous: LLMs, embedding models, image repair, and video generation.

Although GPT-2 and Llama-3 are state-of-the-art (SoTA) models separated by five years, their architectures are almost identical. The only major difference is scale.

Etched believes in the hardware lottery: The winning models are those that run the fastest and cheapest on hardware. Transformers are powerful, practical, and profitable enough to dominate every major AI computing market before alternatives emerge:

Transformers power every large AI product: from agents to search to chat. AI labs have spent hundreds of millions on R&D to optimize GPUs for Transformers. Both current and next-generation SoTA models are Transformers.

As models expand from billion-dollar to hundred-billion-dollar to trillion-dollar training runs in the coming years, the risk of testing new architectures soars. Instead of re-testing scaling laws and performance, it is better to spend time building features on top of Transformers, such as multi-token prediction.

Today's software stack is optimized for Transformers. Every popular library (TensorRT-LLM, vLLM, Huggingface TGI, etc.) has special kernels for running Transformer models on GPUs. Many features built on transformers are not easily supported in alternatives (e.g., speculative decoding, tree search).

The future hardware stack will be optimized for transformers. NVIDIA's GB200 specifically supports transformers (TransformerEngine). The entry of ASICs like Sohu into the market is a one-way street. A Transformer killer on the GPU needs to run faster than the Transformer on Sohu. If this happens, we will also build an ASIC for it!

Two Harvard dropouts founded a chip company

As generative AI touches more and more industries, companies that produce chips to run these models benefit greatly. Especially NVIDIA, whose influence is huge, accounting for about 70% to 95% of the AI chip market share. Cloud providers from Meta to Microsoft have invested tens of billions in NVIDIA GPUs, fearing falling behind in the generative AI race.Thus, it is understandable that generative AI vendors are dissatisfied with the status quo. Their success largely depends on the willingness of mainstream chip manufacturers. Therefore, together with opportunistic venture capital firms, they are looking for promising emerging companies to challenge the AI chip giants.

Etched is one of the many alternative chip companies vying for a piece of the pie, but it is also one of the most interesting companies. Founded only two years ago by two Harvard dropouts, Gavin Uberti (formerly of OctoML and Xnor.ai) and Chris Zhu, along with Robert Wachen and former Cypress Semiconductor CTO Mark Ross, they are trying to create a chip that can do one thing: run AI models.

This is not uncommon, as many startups and tech giants are developing chips specifically designed to run AI models, also known as inference chips. Meta has MTIA, Amazon has Graviton and Inferentia, and so on. However, the uniqueness of Etched's chips lies in the fact that they only run one type of model: Transformers.

The Transformer was proposed by a Google research team in 2017 and has now become the mainstream architecture for generative AI model models.

Transformers are the foundation of OpenAI's video generation model Sora. They are the core of text generation models such as Anthropic's Claude and Google's Gemini. They also power art generators like the latest version of Stable Diffusion.

In a new blog post, Etched's founders stated that the company made its biggest bet on AI in June 2022 when it bet on a new AI model to take over the world: the Transformer.

In Etched's view, within five years, AI models will be smarter than humans in most standardized tests.

How could this happen? Because the computational power used by Meta to train Llama 400B (2024 SoTA, smarter than most humans) is 50,000 times that used by OpenAI on GPT-2 (2019 SoTA).

By providing AI models with more computing power and better data, they become smarter. Scale is the only secret that has consistently worked for decades, and every major AI company (Google, OpenAI/Microsoft, Anthropic/Amazon, etc.) will invest over $100 billion in the next few years to maintain scale. We are in the midst of the largest infrastructure construction in history.

But scaling up by another 1,000 times will be very expensive. The cost of the next generation of data centers will exceed the GDP of a small country. At the current pace, our hardware, power grid, and wallets cannot keep up.We are not concerned about running out of data. Whether through synthetic data, annotation pipelines, or new AI-labeled data sources, we believe that the data issue is actually a problem of inference computation. Mark Zuckerberg, Dario Amodei, and Demis Hassabis seem to agree with this view.

"In 2022, we bet that Transformers will dominate the world," said Etched CEO Uberti in an interview with TechCrunch. "In the development of artificial intelligence, we have reached a point where dedicated chips that outperform general-purpose GPUs are inevitable - technology decision-makers around the world know this."

At that time, there were various types of AI models, including CNNs for self-driving cars, RNNs for language, and U-Nets for generating images and videos. However, the Transformer (the "T" in ChatGPT) was the first model that could be scaled.

CEO Gavin Uberti said in a blog post, "We bet that if intelligence continues to expand with computation, within a few years, companies will invest billions of dollars in AI models, all running on dedicated chips." "We have spent two years creating the world's first Transformer-specific chip (ASIC) called Sohu. We have etched the Transformer architecture into our chip, and we cannot run traditional AI models: DLRM that supports your Instagram feed, protein folding models from bio labs, or linear regression in data science."

A 4nm chip called "Sohu"

Etched's chip, named Sohu, is an ASIC (Application-Specific Integrated Circuit). Uberti claims that Sohu, manufactured using TSMC's 4nm process, can provide better inference performance than GPUs and other general-purpose AI chips while consuming less energy.

Uberti said, "When running text, image, and video transformers, Sohu is even an order of magnitude faster than Nvidia's next-generation Blackwell GB200 GPU, and it's cheaper. A single Sohu server can replace 160 H100 GPUs... For business leaders who need dedicated chips, Sohu will be a more economical, efficient, and environmentally friendly choice."

Uberti added, "We also cannot run CNNs, RNNs, or LSTMs. But for transformers, Sohu is the fastest chip ever. It has no competitors. Sohu is even an order of magnitude faster than Nvidia's next-generation Blackwell (GB200) GPU, and it's cheaper, suitable for text, audio, image, and video transformers."Uberti stated that since their establishment, every major AI model (ChatGPT, Sora, Gemini, Stable Diffusion 3, Tesla FSD, etc.) has become a transformer. However, if the transformer were to be suddenly replaced by an SSM, a monarch mixer, or any other type of architecture, Etched's chips would become useless.

"“But if we are right, Sohu will change the world,” Uber said confidently.

Through specialization, Sohu has achieved unprecedented performance. An 8xSohu server can process more than 500,000 Llama 70B tokens per second.

It is introduced that Sohu only supports transformer inference, whether it is Llama or Stable Diffusion 3. Sohu supports all current models (Google, Meta, Microsoft, OpenAI, Anthropic, etc.) and can handle adjustments for future models.

Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have more math blocks. As a result, Sohu has a FLOPS utilization rate of over 90% (compared to about 30% on GPU7 using TRT-LLM).

How is such powerful performance achieved?

How does Sohu achieve all of this? There are several methods, but the most obvious (and also the most intuitive) method is to simplify the inference hardware and software pipeline. Since Sohu does not run non-transformer models, the Etched team can cancel hardware components unrelated to transformers and cut the software overhead traditionally used for deploying and running non-transformer models.

Etched said in a blog post that NVIDIA H200 has 989 TFLOPS of non-sparse FP16/BF16 computing power. This is the most advanced (even better than Google's new Trillium chip), and the computing power of the GB200 launched in 2025 only increased by 25% (1250 TFLOPS per die).

Because the vast majority of the area of the GPU is used for programmability, focusing on transformers allows you to perform more calculations. You can prove this to yourself from first principles:

Building a single FP16/BF16/FP8 multiply-accumulate circuit requires 10,000 transistors, which is the cornerstone of all matrix math. The H100 SXM has 528 tensor cores, each with 4 × 8 × 16 FMA circuits. The multiplication tells us that the H100 has 2.7 billion transistors dedicated to tensor cores.But the H100 has 80 billion transistors! This means that only 3.3% of the transistors on the H100 GPU are used for matrix multiplication!

This is a well-thought-out design decision by NVIDIA and other flexible AI chips. If you want to support a variety of models (CNNs, LSTMs, SSMs, etc.), there is no better way.

By only running Transformers, Etched can install more FLOPS on the chip without reducing precision or sparsity.

Is there a view that the bottleneck of inference is memory bandwidth, not computation? In fact, for modern models like Llama-3, the answer is obvious!

Let's use the standard benchmarks of NVIDIA and AMD: 2048 input tokens and 128 output tokens. Most AI products' prompts take much longer than completions (even the new Claude chat has more than 1,000 tokens in system prompts).

On GPUs and Sohu, inference is run in batches. Each batch loads all model weights once and reuses them for each token in the batch. Usually, LLM inputs are computationally intensive, while LLM outputs are memory-intensive. When we combine input and output tokens with consecutive batches, the workload becomes very computationally intensive.

Here is an example of LLM consecutive batch processing. Here, we are running a sequence with four input tokens and four output tokens; each color is a different sequence.

 

We can extend the same trick to run Llama-3-70B with 2048 input tokens and 128 output tokens. Let each batch contain 2048 input tokens of one sequence and 127 output tokens of 127 different sequences.

If we do this, each batch will require approximately (2048 + 127) × 70B parameters × 2 bytes per parameter = 304 TFLOPs, while only loading 70B parameters × 2 bytes per parameter = 140 GB of model weights and about 127 × 64 × 8 × 128 × (2048 + 127) × 2 × 2 = 72 GB of KV cache weights. This is much more than memory bandwidth: H200 needs 6.8 PFLOPS of computation to fully utilize its memory bandwidth. This is the case with 100% utilization - if the utilization is 30%, it needs more than 3 times the memory.Due to Sohu's possession of such a vast amount of computational power and extremely high utilization rates, we can operate with enormous throughput without encountering memory bandwidth bottlenecks.

In the real world, batches are much larger, with input lengths varying and requests arriving according to a Poisson distribution. This technique works better in these situations, but we use the 2048/128 benchmark in this example because NVIDIA and AMD use it.

It is well known that on GPUs and TPUs, software is a nightmare. Handling arbitrary CUDA and PyTorch code requires a very complex compiler. Third-party AI chips (AMD, Intel, AWS, etc.) have spent billions of dollars on software with little to show for it.

But because Sohu only runs transformers, we only need to write software for transformers!

Most companies that run open-source or proprietary models use transformer-specific inference libraries, such as TensorRT-LLM, vLLM, or HuggingFace's TGI. These frameworks are very rigid - although you can adjust model hyperparameters, there is no actual support for changing the underlying model code. But that's okay - because all transformer models are very similar (even text/image/video models), and adjusting hyperparameters is all you really need.

While this supports 95% of AI companies, the largest AI labs adopt customization. They have teams of engineers manually tuning GPU kernels to squeeze out a little more utilization, reverse-engineering which registers have the lowest latency for each tensor core.

With Etched, you no longer need to reverse-engineer - because Etched's software (from drivers to kernels to the service stack) will all be open source. If you want to implement a custom transformation layer, your kernel guide is free to do so.

Etched will become the world's first

Uberti said that every large homogeneous computing market will eventually end with a dedicated chip: networking, Bitcoin mining, high-frequency trading algorithms are all hard-coded into silicon.

These chips are several orders of magnitude faster than GPUs. No company uses GPUs to mine Bitcoin - they simply cannot compete with dedicated Bitcoin miners. The same will happen with artificial intelligence. Uberti said that with trillions of dollars at stake, specialization is inevitable."We believe that the vast majority of spending (and value) will be on models with more than 10 trillion parameters. Due to the economies of scale in continuous batching, these models will run in the cloud on one of dozens of MegaClusters," said Uberti. "This trend will mirror chip factories: there used to be hundreds of cheap, low-resolution factories, but now, the construction cost of high-resolution factories is about 20 to 40 billion dollars. There are only a few MegaFabs in the world, and they all use very similar underlying architectures (EUV, 858 square millimeter photomasks, 300 millimeter wafers, etc.)."

Etched stated that the cost of transitioning to a Transformer is very high. Even if a new architecture better than the Transformer is invented, the resistance to rewriting kernels, rebuilding speculative decoding functions, building new specialized hardware, retesting scaling laws, and retraining teams is enormous. Uberti said that this situation will only happen once or twice in a decade, just like what happens in the chip field: changes in lithography technology, photomask/wafer size, and photoresist composition will indeed continue to occur, but at a very slow pace.

"The more we scale AI models, the more we will focus on model architecture. Innovation will happen elsewhere: speculative decoding, tree search, and new sampling algorithms," said Uberti. "In a world where the cost of training a model is 10 billion dollars and the cost of chip manufacturing is 50 million dollars, specialized chips are inevitable. The company that manufactures them first will win."

Etched asserted that no one has ever made AI chips for a specific architecture. Even last year, this made no sense. Chips for a specific architecture require a huge demand and a firm belief in their longevity.

Uberti stated: "We are betting on the Transformer, and both requirements are becoming a reality."

The company pointed out that market demand has reached an unprecedented level. The Transformer inference market was less than 50 million dollars at the beginning, and now it has exceeded 5 billion dollars. All major technology companies are using Transformer models (OpenAI, Google, Amazon, Microsoft, Facebook, etc.).

Uberti said that they are seeing an architectural convergence: in the past, AI models would change a lot. But since GPT-2, the architecture of the most advanced models has remained almost unchanged. OpenAI's GPT series, Google's PaLM, Facebook's LLaMa, and even Tesla FSD are all Transformers.

Uberti said that the company is working at an extremely fast pace to make Sohu a reality.

Uberti emphasized: "The company is advancing at the fastest pace in history, from architecture to validating silicon wafers for chips with a 4nm photomask size." "We are working directly with TSMC and dual-sourcing HBM3E from two top suppliers. We have received tens of millions of dollars in bookings from AI and foundation model companies, and we have ample supply chain capacity to scale. If our bet is correct and we execute, Etched will become one of the largest companies in the world."

The company reiterated that if this prediction is correct, Sohu will change the world.Today, the computational cost of AI coding agents is $60 per hour, and it takes several hours to complete tasks. Gemini takes over 60 seconds to answer questions about video 16. The cost of coding agents is higher than that of software engineers, and it takes several hours to complete tasks. Video models generate one frame per second, and even when the number of registered users of ChatGPT reaches 10 million (only 0.15% of the global population), OpenAI has exhausted GPU capacity.

We cannot solve this problem - even if we continue to manufacture larger GPUs at a rate of 2.5 times every two years, it will take a decade to achieve real-time video generation.

Imagine what would happen if artificial intelligence models suddenly increased their speed by 20 times and reduced costs. With Sohu, real-time video, audio, agents, and search finally become possible. Uberti said that the unit economics of each AI product will be reversed overnight.

It was revealed that the company's early customers have booked tens of millions of dollars in hardware.

When asked how a small company like Etched can defeat Nvidia, Etched Chief Operating Officer and co-founder Robert Wachen said in an email to VentureBeat:

"In the past, the AI computing market was fragmented: people used different types of models, such as CNN, DLRM, LSTM, RNN, and dozens of other models across domains. The expenditure for each architecture ranged from tens of millions to hundreds of millions of dollars, and the market for these workloads was large enough for general-purpose chips (GPUs) to prevail," Wachen said.

He pointed out that the market is rapidly consolidating into one architecture: Transformer. In a world where people spend billions of dollars on transformer models and custom chips cost between 50 and 100 million US dollars, specialized chips are inevitable.

"Our chip cannot defeat the GPU in most workloads - we cannot support them. However, for transformer inference (supporting each major 'generative AI' product), we will clear the market. With such specialization, our chip is an order of magnitude faster than the next-generation Blackwell GPU," Wachen said.

Comments