A brief comparison of 386 FPUs

Reply 140 of 148, by galanopu

Posted on 2022-12-18, 15:19

galanopu Offline

Rank Newbie

Rank: Newbie
Posts: 96
Joined: 2020-10-28, 08:45
Location: EU

Anonymous Coward wrote on 2022-12-18, 15:11:

The "KN" is the older design with the grey top, right? Don't those have compatibility issues with Cyrix's own SLC/DLC CPUs?

It is not about the color. I has to write KN on top.
And yeah I think these are a bit older revisions.
To my understanding... just a bit more incompatible with IEEE standards.
But worked fine with my 486DLC in the video.
I have not tested a KN with the 486SXL2 yet. Should work.

Let's mod everything! Check my youtube channel:
https://www.youtube.com/channel/UCZ6ULBqIKhxuNslAbqFNJUg
Interested in my devices? Check my store:
https://migronelectronics.bigcartel.com

Reply 141 of 148, by feipoa

Posted on 2022-12-18, 16:07

feipoa Offline

Rank l33t++

Rank: l33t++
Posts: 9715
Joined: 2011-03-07, 13:54
Location: Canada

Does anyone know what does the KN stands for? I suspect performance is similar to the older grey-tops, prior to some specific date, which I have forgotten. From my notes, these older FasMath chips have issues with clock-doubled CPUs, and sometimes with the SXL in general. On my AMI Mark V Baby Screamer, my notes indicate:

"PCPBench does not exit properly when using grey top FasMath's and clock doubled CPUs. Black top is OK."

"Cannot use Winplay in Win311 to play down-sampled mp3's with grey top FasMath and DRx2-66 and SXL2-66. Black-top OK. However, if using SXL-40, grey-top OK with Winplay"

"DRx2-66 will not work with Fasmath black-top or FasMath DLC. IIT 3C87-50 OK. ULSI OK. Will work w/grey-top in DOS only; Winplay hangs"

"SXL2-66 on same system plays mp3s fine with black-top's, but hangs w/grey-tops"

"Grey-top ok with SXL w/Winplay, but not clock-doubled"

I had done more in depth testing on this issue with many different FPUs, but I cannot locate my comprehensive summary sheet. The issue may be motherboard specific, I don't recall exactly. I do recall sending Anonymous Coward the details maybe 8 years ago. If you are better at searching, maybe you can find it.

Plan your life wisely, you'll be dead before you know it.

Reply 142 of 148, by Sphere478

Posted on 2022-12-18, 18:23

Sphere478 Offline

Rank l33t++

Rank: l33t++
Posts: 5704
Joined: 2021-01-13, 04:45

galanopu wrote on 2022-12-17, 23:02:
Check again my last video. Assuming testing at the same clock, the fastest FPU is the Cyrix -KN. No question about that. This o […]
Show full quote

Sphere478 wrote on 2022-12-17, 18:03:

Got one of these correct one?

Check again my last video.
Assuming testing at the same clock, the fastest FPU is the Cyrix -KN.
No question about that. This one is a bit rare though.

Now benches are nice but in practice a 387 FPU are a bit pointless these days.
Games that need one, just run to slowly with a system of this class.
CAD programs are just too old and pointless.

So in the end a 5% difference in FPU performance, you will not really notice.
And the Cyrix you got is also a good one.

I can’t seem to find one for sale.

Sphere's PCB projects.
-
Sphere’s socket 5/7 cpu collection.
-
SUCCESSFUL K6-2+ to K6-3+ Full Cache Enable Mod
-
Tyan S1564S to S1564D single to dual processor conversion (also s1563 and s1562)

Reply 143 of 148, by Sphere478

Posted on 2022-12-31, 16:24

Sphere478 Offline

Rank l33t++

Rank: l33t++
Posts: 5704
Joined: 2021-01-13, 04:45

It arived!

works great! Installs and works exactly as you would
Suspect.

Attachments

Filename

F4F507D8-33AA-4103-B177-C4AB9A899A21.jpeg

File size

1.14 MiB

Views

801 views

File license

Public domain

Last edited by Sphere478 on 2023-01-23, 21:01. Edited 2 times in total.

Sphere's PCB projects.
-
Sphere’s socket 5/7 cpu collection.
-
SUCCESSFUL K6-2+ to K6-3+ Full Cache Enable Mod
-
Tyan S1564S to S1564D single to dual processor conversion (also s1563 and s1562)

Reply 144 of 148, by Deunan

Posted on 2022-12-31, 21:13

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1635
Joined: 2018-05-29, 12:32

FYI not all combinations of CPU, NPU and mobo chipset work properly.

First: This was already mentioned - Cyrix/TI CPUs with clock doublers will often not work properly with a coprocessor when that doubler is active. Now the general idea of the doubler is the CPU core runs at 2x clock and the bus at 1x. NPU is also "bus" but the I/O channel between CPU and NPU is special, so (I've never tested that) I guess some or even all signal might also run with 2x speed.

Second: I found a curious mobo that has issues with certain CPUs and NPUs due to some marginal timings in the chipset. In general the I/O channel is direct but the busy and error signals go through chipset so that it can mask them (when IBM designed the PC they decided ignore quite a few Intel guidelines on how to build a system around 8086). From what I figured the busy gets stuck and CPU hangs waiting for NPU.
This affects not only the 486-like replacement chips but also original 386, though it seems to affect AMD more than Intel. And the severity depends on the CPU stepping, later ones are more stable, so it's not just the chipset that is at fault here. Again my educated guess is AMD fully static 386 core has ns differences in timings vs original Intel dynamic 386, which is probably affecting things.

I'd like to point out that some of my test passes require 8h or more to finish so I might have a bit different definition of what can be called a stable system. I've seen issues that only manifest after a few hours but can be reproduced almost every time. I take "100% compatible" claims with a pinch of salt 😀

Reply 145 of 148, by Am386DX-40

Posted on 2023-01-23, 20:49

Am386DX-40 Offline

Rank Member

Rank: Member
Posts: 139
Joined: 2011-03-27, 02:50

This is an awesome thread

Reply 146 of 148, by kool kitty89

Posted on 2023-06-07, 22:42

kool kitty89 Offline

Rank Member

Rank: Member
Posts: 434
Joined: 2012-02-15, 08:43
Location: San Jose, CA

DOS Tomb Raider 1 (unaccelerated) is an interesting test for FPU performance as well, and a special case for FPU parallelism with an integer based pixel pipeline (or if pixel pipeline isn't technically correct, at least integer/ALU based rasterizer).

I haven't found any articles detailing how Tomb Raider was done at low level, but given the performance I've gotten with 486DX and 486DLC+copro configurations (ie actually playable on a DLC40+copro, especially with reduced but not tiny screen size, like 256x160 or 240x150), it definitely doesn't use anything quake-like in rendering but also definitely requires an FPU for any sort of useability. It also requires a 486 compatible CPU as it hangs/crashes after the title screen if using a vanilla 386DX (or SX). Presumably it would also work on a 486SLC + Co-pro, but I don't have one (or don't have one installed on a motherboard ... I've got some bare SLC chips for potential projects, but that's it).

TR will run without an FPU on a 486DLC or 486SX, as in it will enter the main game and render, but extremely slowly, like sub 1FPS speed, which implies it enters some sort of FPU emulator. Not sure how/why it does that rather than refusing to run at all, but maybe it was designed for use with NexGen NX586 systems, given those were the only 586/Pentium class CPUs ever sold without FPUs (or arguably those and UMC U5SD Green CPUs ... which I don't think I've tested with TR). Or I might not have actually tested on a 486SX and it might rely on L1 cache to do fast enough FP to integer conversion ... and I haven't run tests with a 486DLC using the Cyrix setup utility to make sure cache and other stuff is properly configured. (I know I got slower Doom performance than I should have because of that)

In any case, DOS TR can be made very playable on vanilla i/AMD/CX 486DXs or AMD or Cyrix 5x86s and would be interesting to add to a comparison list (though it doesn't have convenient frame counters or timedemos like quake ... or maybe there's at least console commands for enabling a frame counter). Given there's a shareware demo, there's potential for writing a timedemo style script, I suppose, without legal complications. (not modifying the source code, but pulling up a script that does automatic keyboard inputs to give the same results every time)

Now, this would also be relevant for testing actual pentium class CPUs and would be a very rough approximation for a "what if quake had a 486DX renderer option," but is still also relevant to 387 class FPU tests, since it works with 486DLC (and should with SLC), so you could take a 486DLC or DRX/2 system and just run through all the available 387 socket FPUs.

And as a general side-note, targeting the 486DX (with FPU) as a baseline low-end standard option alternative to Pentium would probably have been a sensible/reasonable compromise for mid/late 90s games, even ones with partial or full P5 or P6 pentium feature targets and/or had 3D accelerator patches or versions. Developers would probably at least want to do all the vertex math on the FPU and be consistent with that across the board (and you do get some parallelism for FPU+ALU operations on the 486 ... or at least 586/686 class CPUs with FPU prefetch/FIFO) especially if you're mandating 32-bit fixed point precision at a minimum.

Albeit, game engines fully optimized for 16-bit (or 16x16=32-bit multiply-accumulate operations and 32/16=16-bit division) operations, effectively using the CPU as a 16-bit DSP or geometry engine, would still probably be consistently faster than anything using the FPU on a 486/5x86/CX6x86/K6 and probably K5 (K5 has 1-clock peak Fmul throughput IIRC, but it also has a very fast ALU). Cyrix CPUs going all the way back to the 486DLC/SLC have fast 16-bit hardware multiply unit, 3/5 clocks from register/cache hit, so very very fast for 16-bit math and still 7/9 clocks for 32-bit reg/cache, though this is now slower than a Cx5x86/6x86 FPU) This would, admittedly, also still apply to 3D accelerators, since you could write drivers or custom renderers using 16-bit fixed point integer math for max speed at the expense of vertex precision and only using 32-bit integer math where absolutely necessary.

You could even have drivers/games optimized like this first and foremost, but then re-optimized for Pentium FPU by just doing the integer computations as floating point ones and converting the output to integer falues, since the P5 FPU does FP to integer conversions very fast, part of why Quakes FPU pixel pipeline works so well) This also goes for MMX optimization, since MMX does packed fixed-point multiply very fast and I think does packed 16-bit multiply fastest at 2 multiplies per clock (and most/all 16-bit operations faster than 32-bit), so actual MMX integer optimized 3D math drivers would use as much 16-bit integer math as possible, and drivers using such for APIs requiring 32-bit precision (FP or integer) could do all the math on lower precision and convert that to 'fake' higher precision values for the final output. (this is what should have been done to make the most of all CPUs with MMX units, especially fast MMX units and could have been done in parallel with supporting the later 3DNow! and SSE transparently by just padding/converting lower precision values to higher precision ones at the expense of possible image quality: it could/should have just been one more option for quality settings in various games, and even with the market Intel dominated and focused you had Intel's own CPUs thus competing with MMX vs raw FPU performance on P5/Ppro vs P55C/PII and later, albeit MMX doesn't come around until 1997. PPro/PII does integer math extremely fast, though 1-cycle multiply throughput and 2-cycle divides (from memory, 3 cycles from register, both at all precisions) so doing an all integer renderer on the PPro could be faster than an FPU driven one. The fast operations in-memory vs registers also might mitigate (or exceed) the advantages of doing pixel rendering/buffering inside FPU registers ... except P6 FDIV is actually just 1 cycle throughput, but then there's a much higher count for reciprocal throughput and latency in both cases and 16-bit is much faster than 32-bit. (and with divides used less frequently, the latency would be most important unless you pipelined your computations to do several sequential divides)

There's also FPU parallelism on P6, it should be even faster to divide work up into parallel chunks and convert all final (or shared) outputs to a common integer or FP format and you could stagger integer and FP divides (unless both use shared functional units, in which case there might be no real gain). In any case, latency and reciprocal throughput of 32-bit integer division is near identical to FP, but 16-bit DIV/IDIV is almost twice as fast and 8-bit is even faster (but situations where that would be useful would be quite limited). In any case it would just be the P5 pentium family where making heavy use of 16-bit (or mixed 16 and 32-bit) fixed point math would actually be slower (significantly slower, even) than doing as much as possible in FP. (any 386 or 486 class CPU and any Non-Intel 586/686 class CPU and Intel's P6/PII/PIII would be faster or at least very near the same speed when doing all integer ops like that) The integer side of the P6 CPUs are very DSP-like in fast multiply-accumulate operations and the Cyrix 486DLC was probably the first x86 family CPU similarly optimized for DSP-like operations. (the hardware multiplier in the NEC V33 was similarly oriented with 12 cycle 16-bit multiply, but limited to Real Mode and 286 instruction set: it supported EMS memory mapping internally for 24-bit 16MB address space externally, but not XMS/Protected mode ... not sure if it even supported the 286's HMA, it also wasn't 286 pin compatible for some reason)
For that matter, with the DSP-like integer performance, it's possible that highly 16-bit integer optimized operations would be faster on a 486DLC than a P5 pentium, or at least a Socket 3 Pentium Overdrive at the same clock speed (ie tach fan removed, locking it to 1x multiplier ... I have one like that, but it won't POST over 50 MHz where the same boards will do i486DX-50 and 486SX-33s at 1x66 MHz ... some DX-50s ad SX-33s run OK at 1x60 at 3.3~3.45V even). Cyrix 486SX/DX chips with larger L1 cache should have a bigger advantage. In which case, any software designed as such would benefit from a P5 pentium specific patch or re-write doing most of those integer ops as FP ones instead (even if it meant just converting them to 16-bit integer values afterward for minimal change in code).

Looking at this for cycle times:
https://www.agner.org/optimize/instruction_tables.pdf
(covers P5/P55 P6/PII/PIII, K7, and more, but doesn't have K5, K6, or any Cyrix processors)
This for 486DLC
http://www.bitsavers.org/components/cyrix/Cyr … Sheet_May92.pdf

Also worth noting: both the Playstation's GTE and N64's RSP make heavy use of 16-bit fixed point math, though both can also do 32-bit fixed point math (and I believe the N64's so-called "standard microcode" uses RSP routines heavily biased towards full 32-bit vertex precision)

Fast lookup tables + shifts + adds would also be faster in some situations, but this is more clear cut for 386 optimized engines than 486 ones. (probably depending on whether LUTs are small enough to fit into 486 L1 cache and not targeting Cyrix's fast 16x16-bit hardware multiplier unit)

Reply 147 of 148, by kool kitty89

Posted on 2023-06-08, 01:51

kool kitty89 Offline

Rank Member

Rank: Member
Posts: 434
Joined: 2012-02-15, 08:43
Location: San Jose, CA

On another note:
Since some 386 and 386SX motherboards allow asynchronous FPU clocks via a separate oscillator, would this allow you to overclock FPUs beyond the CPU/bus speed? Would running them asynchronously have a performance penalty compared to synchronous clock?

Reply 148 of 148, by rasz_pl

Posted on 2023-06-08, 08:54

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 3563
Joined: 2017-06-04, 00:57

kool kitty89 wrote on 2023-06-07, 22:42:

I haven't found any articles detailing how Tomb Raider was done at low level ... definitely requires an FPU for any sort of useability.

Slopes. DN3D does floating point math to render any non straight angle geometry. When starting the game the first examples are roof vents, then when you jump down there is small angled parapet where you would jump to get the rocket launcher.
90MHz NexGen Nx586 https://www.youtube.com/watch?v=41O2bNG2qKA&t=234s down to 14 fps on the roof, ~22 fps with the parapet in the viewport, >30 when offscreen.

Re: Duke Nukem 3D on a i80386:
>Build engine does in fact use floating point assembly instructions in its slope rendering routines
>Build engine source is written for the Watcom C/C++ compiler ... by default, generates code with support for 8087 software emulation

kool kitty89 wrote on 2023-06-07, 22:42:

It also requires a 486 compatible CPU as it hangs/crashes after the title screen if using a vanilla 386DX (or SX).

about that :] Duke Nukem 3D on a i80386

kool kitty89 wrote on 2023-06-07, 22:42:
doesn't have convenient frame counters... or maybe there's at least console commands for enabling a frame counter

DNRATE parameter

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Main menu

Common searches