I remember hearing somebody talk about programming hot loops in either the the PS3 or PS2 in Excel, to get a good handle on the concurrency question by having assembler in multiple columns next to each other
That would be the PS2’s VUs which had an upper and lower pipe and it was easier to write instructions for each in separate columns. Then in one SDK we received program called vcl which took a single list of instructions, doing all the pipelining for you, as well as optimizing loops and assigning registers automatically. It was a godsend.
I remember discussion at the time about how the PS3 was a uniquely difficult architecture to emulate. Was that true? Have those difficulties now been overcome? I see RPCS3 exists but I’ve no idea if it has done the difficult parts.
Depends on your definition of "overcome". RPCS3 does emulate the architecture, and many games are playable on it, but it's still far from being perfect. Many games have stability issues due to timing/synchronization inaccuracies, for example.
So, I'd have to dig through some older notes I have, however, some of this information seems inaccurate based upon my own interpretation of the specs (and writing code...specifically, but not limited to, the PowerPC part). A suggestion from me is to provide sources, and also maybe an epub of this.
Pure speculation from my side, but I'd think that the advantages over traditional big register banks and on-chip caches are not that great, especially when you're writing 'cache-aware code'. You also need to consider that the PS3 was full of design compromises to keep cost down, e.g. there simply might not have been enough die space for a cache controller for each SPU, or the die space was more vaulable to get a few more kilobytes of static scratch memory instead of the cache logic.
Also, AFAIK on some GPU architectures you have something similar like per-core static scratch space, that's where restrictions are coming from that uniform data per shader invocation may at most be 64 KBytes on some GPU architectures, etc...
The TI-99/4A had 256 BYTES (128 words) of static RAM available to the CPU. All accesses the 16K of main memory had to be done through the video chip. This made a lot of things on the TI-99/4A slow, but there were occasional bits of brilliance where you see a tiny bit of the system it could've been. Thanks to the fast SRAM and 16-bit CPU, the smooth scrolling in Parsec was done entirely in software—the TMS9918A video chip lacking scroll registers entirely.
> The EIB is made of twelve nodes called Ramps, each one connecting one component of Cell... Having said that, instead of recurring to single bus topologies (like the Emotion Engine and its precursor did), ramps are inter-connected following the token ring topology, where data packets must cross through all neighbours until it reaches the destination (there’s no direct path).
I knew IBM was involved in the design of the Cell BE, but I had no idea some successor of IBM's token ring tech (at least the concept of it) lived on in it. I'm sure there's other hardware (probably mainframe hardware) in and before that 2006 with similar interconnects.
The PS3 was used a few time in clusters – some NN work was done on it back in the day. My understanding (somewhat echoed in TFA) is that when programming Cell, you really needed to think about communication patterns to avoid quickly running into memory bandwidth limitations, especially given memory hierarchy and bus quirks.
For it's day, it packed a lot of compute into cheap package, so long as you could do something useful with a data set that fit into 256kB, the size of the local memory buffer on each SPE. If you overflowed that, the anemic system bandwidth would make it suck. Protein folding was an example of a problem that back then used tons of compute but could be fit into small space.
It was the biggest contributor to folding @ home at one point. It came bundled with the PS3 and played relaxing music and showed a heat map of the world ps3 compute nodes as it went on. There was also https://en.wikipedia.org/wiki/PlayStation_3_cluster
With enough effort you could definitely do it. Just remember it is a device that came out in 2006 and it has 256MB of system RAM and 256MB of VRAM, at best you're running a quite small model after a lot work trying to port some inference code to CELL processors. Honestly it does sound a cool excuse to write code for the CELL processors, but don't expect amazing performance or anything.
It's a nearly 20 year old gaming console. Even if you could port a deep learning workload to run efficiently on the Cell architecture, it would be thoroughly outclassed by a modern cell phone (to say nothing of a desktop computer).
The PS3 only had 256mb of main memory so you'd be pretty limited there. Memory bandwidth, great at the time, is pretty poor by today's standards (25 gb/s)
I remember hearing somebody talk about programming hot loops in either the the PS3 or PS2 in Excel, to get a good handle on the concurrency question by having assembler in multiple columns next to each other
That would be the PS2’s VUs which had an upper and lower pipe and it was easier to write instructions for each in separate columns. Then in one SDK we received program called vcl which took a single list of instructions, doing all the pipelining for you, as well as optimizing loops and assigning registers automatically. It was a godsend.
I can't remember the details because we coded the SPU in C, but the PS3 SPUs had odd and even cycles with different access properties too.
Sounds like a Gantt chart with code might fit.
I love those
I remember discussion at the time about how the PS3 was a uniquely difficult architecture to emulate. Was that true? Have those difficulties now been overcome? I see RPCS3 exists but I’ve no idea if it has done the difficult parts.
Depends on your definition of "overcome". RPCS3 does emulate the architecture, and many games are playable on it, but it's still far from being perfect. Many games have stability issues due to timing/synchronization inaccuracies, for example.
With sufficient thrust, pigs fly just fine. Eventually you can overcome any issues by throwing more CPU at the problem
So, I'd have to dig through some older notes I have, however, some of this information seems inaccurate based upon my own interpretation of the specs (and writing code...specifically, but not limited to, the PowerPC part). A suggestion from me is to provide sources, and also maybe an epub of this.
Please see this: https://github.com/flipacholas/Architecture-of-consoles
> A suggestion from me is to provide sources, and also maybe an epub of this
What do you mean?
It seems they missed this. https://payhip.com/copetti
That was a small fundraiser started to convert all articles into epubs, finished in 2022
[dead]
i did a bit dev on ps3 and i remember there was a small memory on the chip, like 256k that was accessible to programmer.
i always found this very appealing, having a blazing fast memory under programmer control so i wonder: why don't we have that on other cpus?
> why don't we have that on other cpus
Pure speculation from my side, but I'd think that the advantages over traditional big register banks and on-chip caches are not that great, especially when you're writing 'cache-aware code'. You also need to consider that the PS3 was full of design compromises to keep cost down, e.g. there simply might not have been enough die space for a cache controller for each SPU, or the die space was more vaulable to get a few more kilobytes of static scratch memory instead of the cache logic.
Also, AFAIK on some GPU architectures you have something similar like per-core static scratch space, that's where restrictions are coming from that uniform data per shader invocation may at most be 64 KBytes on some GPU architectures, etc...
> why don't we have that on other cpus?
We do, it's called "cache" or "registers".
The TI-99/4A had 256 BYTES (128 words) of static RAM available to the CPU. All accesses the 16K of main memory had to be done through the video chip. This made a lot of things on the TI-99/4A slow, but there were occasional bits of brilliance where you see a tiny bit of the system it could've been. Thanks to the fast SRAM and 16-bit CPU, the smooth scrolling in Parsec was done entirely in software—the TMS9918A video chip lacking scroll registers entirely.
> The EIB is made of twelve nodes called Ramps, each one connecting one component of Cell... Having said that, instead of recurring to single bus topologies (like the Emotion Engine and its precursor did), ramps are inter-connected following the token ring topology, where data packets must cross through all neighbours until it reaches the destination (there’s no direct path).
I knew IBM was involved in the design of the Cell BE, but I had no idea some successor of IBM's token ring tech (at least the concept of it) lived on in it. I'm sure there's other hardware (probably mainframe hardware) in and before that 2006 with similar interconnects.
The EIB has nothing to do with 1980s Token Ring and this is arguably a mistake in the article. It's just a ring topology.
I suspect it’s an attempt at a metaphor that isn’t clearly marked as such.
Can it run deep learning workloads?
The PS3 was used a few time in clusters – some NN work was done on it back in the day. My understanding (somewhat echoed in TFA) is that when programming Cell, you really needed to think about communication patterns to avoid quickly running into memory bandwidth limitations, especially given memory hierarchy and bus quirks.
https://open.clemson.edu/all_theses/629/
For a while, it was a major player in protein folding. I remember the PS3 was particularly apt at doing that sort of work.
For it's day, it packed a lot of compute into cheap package, so long as you could do something useful with a data set that fit into 256kB, the size of the local memory buffer on each SPE. If you overflowed that, the anemic system bandwidth would make it suck. Protein folding was an example of a problem that back then used tons of compute but could be fit into small space.
It was the biggest contributor to folding @ home at one point. It came bundled with the PS3 and played relaxing music and showed a heat map of the world ps3 compute nodes as it went on. There was also https://en.wikipedia.org/wiki/PlayStation_3_cluster
they've also been used for crypto mining/cracking
See also QPACE https://en.wikipedia.org/wiki/QPACE
With enough effort you could definitely do it. Just remember it is a device that came out in 2006 and it has 256MB of system RAM and 256MB of VRAM, at best you're running a quite small model after a lot work trying to port some inference code to CELL processors. Honestly it does sound a cool excuse to write code for the CELL processors, but don't expect amazing performance or anything.
It's a nearly 20 year old gaming console. Even if you could port a deep learning workload to run efficiently on the Cell architecture, it would be thoroughly outclassed by a modern cell phone (to say nothing of a desktop computer).
Eugh, maybe?
The PS3 only had 256mb of main memory so you'd be pretty limited there. Memory bandwidth, great at the time, is pretty poor by today's standards (25 gb/s)
https://en.wikipedia.org/wiki/PlayStation_3_cluster