Sony, Toshiba say Cell Processor makes computing more connected
(11/30/04, 09:46:52 PM GMT)
San Francisco, Ca. – More details are emerging
about the revolutionary “cell processor” that IBM, Sony, and Toshiba have been
hinting about for months.
At a press conference here on Monday, Nov.
29th, abstracts of the papers the companies will be presenting at the
International Solid State Circuits Conference in February of next year were
released, focusing on the hardware features of the radically new computer
As enlightening as they are about the hardware,
additional documentation that is also available makes it clear that what the
companies are after is not just a new CPU that can be used in a number of
different net-centric computing appliance applications.
Their aim is a fundamental reordering of
existing computer hardware and software architecture to reflect the realities of
the new pervasively connected computing environment.
The ISSCC abstracts reveal a multicore 64-bit Power CPU architecture with embedded streaming processors, high
speed I/O, SRAM and the use of a dynamic multiplier.
Currently, target applications (Figure
1, below) for the Cell architecture depend on who is doing the talking. In gaming circles it is viewed
as a the muscular gaming engine for Sony’s new Playstation 3. But it has also
been promoted for use in set-top boxes, mobile devices and workstations. A
version of the Cell processor is already being used by Sony in workstations for use by game developers.
Each processing element consists of an IBM
Power-architecture 64-bit RISC CPU, a highly sophisticated direct-memory access
controller and up to eight identical streaming processors, all of which reside
on a very fast local bus.
Each processing element is also connected to others
over parallel bundles of high speed serial I/O links which are capable of
throughputs of about 6.4 GHz per link.
It seems to be conceptually similar to the
multi-plane architecture used in network processing units: Power processors
handling supervisory, I/O and interface and traditional computational tasks,
while in the data plane, the streaming processors -- self-contained SIMD units that operate autonomously once they are
launched -- focus on data movement.
Data and instructions are moved about the via
128-kbyte local pipe-lined SRAM that is located between each stream processor
and the local bus, (2) a bank of one hundred twenty-eight 128-bit registers and
bank of four floating-point and four integer execution units. All
operate in single-instruction, multiple-data mode from one instruction stream.
To make all processing resources appear in a
single pool under control of the system software and operate as a tightly couple
multiprocessor, the hardware includes a new DMA controller design that allows
any processor in the system to access any bank of DRAM in a particular cell
module through a bank-switching arrangement.
Traditional CPU architectures
As impressive as the hardware design and
performance is, it is driven by, and reflects, a fundamental rethinking of the
common programming model and software architecture from which most modern
standalone RISC architectures are derived.
According to the principal developers, the cell
processor architecture represents a fundamental shift to a new architectural
paradigm that reflects the new connected computing environment.
According to them, the RISC processors and
controllers in current use were all conceived in the era before the Internet and
World Wide Web became a mainstream phenomenon and are designed principally for
The sharing of data and application programs
over a computer network was not a principal design goal of these CPUs. And while
they all have a common RISC heritage, the processor environment on the Internet
Each CPU has its own particular instruction set
and instruction set architecture (ISA), its own particular set of assembly
language instructions and structure for the principal computational and memory
elements that execute these instructions.
Not only does this make a programmer’s life
more complicated, it increases the cost of application development, since
identical applications have to be written to reflect not only each processor’s
ISA, but the physical constraints they must operate within and the specific
requirements of the device in which it is used, which in the new connected
computing environment are extensive.
In addition to personal computers (PCs) and
servers, they point out, a diversity of computing devices have emerged,
including cellular telephones, mobile computers, personal digital assistants (PDAs),
set top boxes, digital televisions and many others. The sharing of data and
applications among this assortment of computers and computing devices presents
Java is not enough
According to the inventors of the new
architecture, a number of techniques in the past have been employed to
overcome these problems, including sophisticated interfaces and complicated
programming techniques, all of which require substantial increases in processing
power to implement. The result has been a substantial increase in the time required to
process applications and to transmit data over networks.
One way around this that is commonly employed
is to transmit the data and the applications code separately over the Internet.
While this approach minimizes the amount of bandwidth needed, it also often
causes frustration among users.
The correct application, or the most current
application, for the transmitted data may not be available on the client's
computer. This approach also requires the writing of multiple versions of each
application for each CPU ISA used on the network.
The Java Virtual Machine “write one, run
everywhere” model, they point out -- which uses a platform independent virtual
machine written in interpretive form, rather than compiled to make maximum us of
each target processor’s resources -- is a partial and increasingly unsuccessful
attempt to solve this problem.
And it will become more inadequate as
real-time, multimedia, network applications are become more pervasive, they
point out. Such net-centric applications will require. many thousands of
megabits of data per second, and the Java programming model makes reaching such
processing speeds extremely difficult.
Therefore, a new
network-optimized computer architecture and a new programming model are required,
they believe, to overcome the problems of
sharing data and applications among the various members of a network without
imposing added computational burdens. This new computer architecture and
programming model also should overcome the security problems inherent in sharing
applications and data among the members of a network.
“Software cells” turn Java Upside Down
At the core of the new connected computing
architecture the companies have developed is a new “software cell”-based programming
model for transmitting data and applications over a network and among the network's members. In one sense, it turns the
Java model on its head.
While it can operate in the Java mode, which
downloads a platform independent to run on a node it can also be described as a
"write once, reside anywhere and participate everywhere" programming model.
Another way to look at the software cell model
is as the Web Services paradigm writ small, in that an application does not have
to depend only on the resources resident on the hardware where it resides but
can incorporate services from external resources to accomplish its task.
It also differs from the traditional approach in that it combines
application and data in the same deliverable "software cell," or apulet, designed for transmission
over the network for processing by any processor on the network.
The code for the applications preferably is
based upon the same common instruction set and ISA. Each software cell
preferably contains a global identification (global ID) and information
describing the amount of computing resources required for the cell's processing.
Since all computing resources have the same
basic structure and employ the same ISA, the particular resource performing this
processing can be located anywhere on the network and dynamically assigned.
Identical, scalable hardware resources
To make the “software cell” approach, however
requires the use of a modular hardware architecture from which all members of
the network (Figure One, above) -- clients, servers, PCs, mobile
computers, game machines, PDAs, set top boxes, appliances, digital televisions
-- can be constructed.
This common computing module requires a consistent structure and
preferably the same ISA. In this approach, the only difference between a PDA or mobile
phone and a server is the number of resources available locally in the hardware module for
execution of the software cell.
And even though the resources are not available
locally, that does not mean that a PDA could not execute the application. Since
the hardware modules and software cells are identical in structure, if the
network bandwidth and the application requirements locally allowed it, a
software cell could be executed remotely and delivered locally to provide the
functionality the PDA requires.
The consistent modular structure, the
developers point out, also enables efficient, high speed processing of
applications and data by the network's members and the rapid transmission of
applications and data over the network. It also simplifies the building of
members of the network of various sizes and processing power and the preparation
of applications for processing by these members.
The basic processing module
(Figure 2, above) includes a processor element (PE), which consists of a processing unit (PU);
a direct memory access controller (DMAC); and a number of attached processing
units (APUs). In the case of the hardware implementation to be descibed at the
ISSCC, each PE consists of an IBM Power CPU core, and the APUs are dedicated
Typically a single PE would consist of one PU
and up to eight APUs which interact with a shared dynamic random access memory
(DRAM) using a cross-bar architecture. The PU schedules and orchestrates the
processing of data and applications by the APUS. The APUs perform this
processing in a parallel and independent manner. The DMAC controls accesses by
the PU and the APUs to the data and applications stored in the shared DRAM.
number of PEs used in any particular network connected appliance device depends
on the processing power required locally. A server may use four PEs, while a workstation may employ two PEs and a PDA
may require only one PE. The number of APUs of a PE assigned to processing a
particular software cell depends upon the complexity and magnitude of the
programs and data within the cell (Figure 3, below).
New hardware building blocks
To make this architecture work, radical new
approaches have had to be developed for almost every aspect of a computer
system: DRAM, DMAC, synchronization, bus and I/O architecture, security, remote
procedure command sequencing, and timing.
Currently the companies have applied for and/or
been granted nine patents covering almost every aspect of the hardware design,
details of which will be described in more depth in February at the ISSCC.
Typically, however, the shared DRAM is
configured into sixty-four memory banks, each of which has one megabyte of
storage capacity. Each section of the DRAM is controlled by a bank controller,
and each DMAC has equal access to each bank controller. What this allows, the
developers say, is access by the DMAC to any portion of the shared DRAM.
The synchronization system developed by the
companies to allow an APU to read data from, and the write data to the shared
DRAM, is designed to avoid conflicts among the multiple APUs and multiple PEs
sharing the DRAM.
This is done by setting aside an area of DRAM
for storing full-empty bits, each of which corresponds to a designated area of
the DRAM. Because it is integrated into the DRAM, the synchronized system avoids
the computational overhead of a data synchronization scheme implemented in
Cell has on-chip security "sandboxes"
To deal with security issues, “sandboxes”
are incorporated into the DRAM to protect against the corruption of data for a program being
processed by one APU from data for a program being processed by another APU.
Each sandbox defines an area of the shared DRAM beyond which a particular APU,
or set of APUs, cannot read or write data.
The new hardware module architecture also
handles remote procedure calls in a different way. They are issued by a main PU
to the APUs to initiate processing of applications and data. These commands,
called APU remote procedure calls (ARPCs), enable the PUs to orchestrate and
coordinate the APUs' parallel processing of applications and data without the
APUs performing the role of co-processors.
Considerable new work has gone into the
development of a dedicated pipeline structure for the processing of streaming
data. With this structure, a coordinated group of APUs, and a coordinated group
of memory sandboxes associated with these APUs, are established by a PU for the
processing of data. The pipeline's dedicated APUs and memory sandboxes remain
dedicated to the pipeline during periods that the processing of data does not
occur and are placed in a reserved state during these periods.
Timing is of the essence in this new approach
to connected computing. So the companies have developed -- and patented -- a new
absolute timer design that is independent of the frequency of the clocks
employed by the APUs for the processing of applications and data.
Applications are written based upon the time
period for tasks defined by the absolute timer. If the frequency of the APU
clocks increases because of enhancements to the APUS, the time period for a
given task as defined by the absolute timer remains the same.
What this scheme allows, said the developers,
is the use of enhanced processing timers by newer versions of the APUs
without disabling these newer APUs from processing older applications written
for the slower processing times of older APUs.
The new architecture also required the
development of an alternate scheme for allowing newer, faster APUs to process
older applications written for the slower processing speeds of older APUS.
The approach the developers of the architecture
have taken is to analyze, in real time, the particular instructions or microcode
employed by the APUs in processing these older applications for problems in the
coordination of the APUs' parallel processing created by the enhanced speeds.
"No operation" ("NOOP") instructions are then
inserted into the instructions executed by some of these APUs to maintain the
sequential completion of processing by the APUs expected by the program. By
inserting these NOOPs into these instructions, the developers point out, the
correct timing for the APUs' execution of all instructions is maintained.
Moving ahead with Cell
In addition to Sony's PlayStation and work
station plans for Cell, IBM plans to begin pilot production of Cell-based
microprocessors circuits during the first half of next year, and Toshiba next
year is planning to launch a Cell-based
While the abstracts do not
go into too much detail on throughput, the performance of the
streaming-processor/ SRAM block has been estimated at about 4.8 GHz while a four Power CPU-element Cell module would have a performance of about one
The companies will be presenting five papers at
the ISSCC. Focusing on key concepts of Cell architecture is "The Design and
Implementation of a First-Generation Cell Processor" (session 10.2). Other
papers are "A Streaming Processing Unit for a Cell Processor" (session 7.4) and
"A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a Cell
Processor" (session 26.7).
Two additional papers on the CELL design
include "A Double-Precision Multiplier with Fine-Grained
Clock-Gating Support for a First-Generation Cell Processor" (session 20.3) by
IBM, and "Clocking and Circuit Design for a Parallel I/O on a First-
Generation Cell Processor" (session 28.9) by Rambus Inc and Stanford University.
For more information about topics, issues and technologies mentioned in this story go to the flashing icon in the upper left corner on
this page or go to the iAppliance Web Views page and call up the associatively-linked Java/XML-based Web map of the iApplianceWeb site.
Enter the appropriate key word, product or company name to list instantly every news and product story, product review and product database entry relating to the topic since the beginning of the 2002.