DOS Tomb Raider 1 (unaccelerated) is an interesting test for FPU performance as well, and a special case for FPU parallelism with an integer based pixel pipeline (or if pixel pipeline isn't technically correct, at least integer/ALU based rasterizer).
I haven't found any articles detailing how Tomb Raider was done at low level, but given the performance I've gotten with 486DX and 486DLC+copro configurations (ie actually playable on a DLC40+copro, especially with reduced but not tiny screen size, like 256x160 or 240x150), it definitely doesn't use anything quake-like in rendering but also definitely requires an FPU for any sort of useability. It also requires a 486 compatible CPU as it hangs/crashes after the title screen if using a vanilla 386DX (or SX). Presumably it would also work on a 486SLC + Co-pro, but I don't have one (or don't have one installed on a motherboard ... I've got some bare SLC chips for potential projects, but that's it).
TR will run without an FPU on a 486DLC or 486SX, as in it will enter the main game and render, but extremely slowly, like sub 1FPS speed, which implies it enters some sort of FPU emulator. Not sure how/why it does that rather than refusing to run at all, but maybe it was designed for use with NexGen NX586 systems, given those were the only 586/Pentium class CPUs ever sold without FPUs (or arguably those and UMC U5SD Green CPUs ... which I don't think I've tested with TR). Or I might not have actually tested on a 486SX and it might rely on L1 cache to do fast enough FP to integer conversion ... and I haven't run tests with a 486DLC using the Cyrix setup utility to make sure cache and other stuff is properly configured. (I know I got slower Doom performance than I should have because of that)
In any case, DOS TR can be made very playable on vanilla i/AMD/CX 486DXs or AMD or Cyrix 5x86s and would be interesting to add to a comparison list (though it doesn't have convenient frame counters or timedemos like quake ... or maybe there's at least console commands for enabling a frame counter). Given there's a shareware demo, there's potential for writing a timedemo style script, I suppose, without legal complications. (not modifying the source code, but pulling up a script that does automatic keyboard inputs to give the same results every time)
Now, this would also be relevant for testing actual pentium class CPUs and would be a very rough approximation for a "what if quake had a 486DX renderer option," but is still also relevant to 387 class FPU tests, since it works with 486DLC (and should with SLC), so you could take a 486DLC or DRX/2 system and just run through all the available 387 socket FPUs.
And as a general side-note, targeting the 486DX (with FPU) as a baseline low-end standard option alternative to Pentium would probably have been a sensible/reasonable compromise for mid/late 90s games, even ones with partial or full P5 or P6 pentium feature targets and/or had 3D accelerator patches or versions. Developers would probably at least want to do all the vertex math on the FPU and be consistent with that across the board (and you do get some parallelism for FPU+ALU operations on the 486 ... or at least 586/686 class CPUs with FPU prefetch/FIFO) especially if you're mandating 32-bit fixed point precision at a minimum.
Albeit, game engines fully optimized for 16-bit (or 16x16=32-bit multiply-accumulate operations and 32/16=16-bit division) operations, effectively using the CPU as a 16-bit DSP or geometry engine, would still probably be consistently faster than anything using the FPU on a 486/5x86/CX6x86/K6 and probably K5 (K5 has 1-clock peak Fmul throughput IIRC, but it also has a very fast ALU). Cyrix CPUs going all the way back to the 486DLC/SLC have fast 16-bit hardware multiply unit, 3/5 clocks from register/cache hit, so very very fast for 16-bit math and still 7/9 clocks for 32-bit reg/cache, though this is now slower than a Cx5x86/6x86 FPU) This would, admittedly, also still apply to 3D accelerators, since you could write drivers or custom renderers using 16-bit fixed point integer math for max speed at the expense of vertex precision and only using 32-bit integer math where absolutely necessary.
You could even have drivers/games optimized like this first and foremost, but then re-optimized for Pentium FPU by just doing the integer computations as floating point ones and converting the output to integer falues, since the P5 FPU does FP to integer conversions very fast, part of why Quakes FPU pixel pipeline works so well) This also goes for MMX optimization, since MMX does packed fixed-point multiply very fast and I think does packed 16-bit multiply fastest at 2 multiplies per clock (and most/all 16-bit operations faster than 32-bit), so actual MMX integer optimized 3D math drivers would use as much 16-bit integer math as possible, and drivers using such for APIs requiring 32-bit precision (FP or integer) could do all the math on lower precision and convert that to 'fake' higher precision values for the final output. (this is what should have been done to make the most of all CPUs with MMX units, especially fast MMX units and could have been done in parallel with supporting the later 3DNow! and SSE transparently by just padding/converting lower precision values to higher precision ones at the expense of possible image quality: it could/should have just been one more option for quality settings in various games, and even with the market Intel dominated and focused you had Intel's own CPUs thus competing with MMX vs raw FPU performance on P5/Ppro vs P55C/PII and later, albeit MMX doesn't come around until 1997. PPro/PII does integer math extremely fast, though 1-cycle multiply throughput and 2-cycle divides (from memory, 3 cycles from register, both at all precisions) so doing an all integer renderer on the PPro could be faster than an FPU driven one. The fast operations in-memory vs registers also might mitigate (or exceed) the advantages of doing pixel rendering/buffering inside FPU registers ... except P6 FDIV is actually just 1 cycle throughput, but then there's a much higher count for reciprocal throughput and latency in both cases and 16-bit is much faster than 32-bit. (and with divides used less frequently, the latency would be most important unless you pipelined your computations to do several sequential divides)
There's also FPU parallelism on P6, it should be even faster to divide work up into parallel chunks and convert all final (or shared) outputs to a common integer or FP format and you could stagger integer and FP divides (unless both use shared functional units, in which case there might be no real gain). In any case, latency and reciprocal throughput of 32-bit integer division is near identical to FP, but 16-bit DIV/IDIV is almost twice as fast and 8-bit is even faster (but situations where that would be useful would be quite limited). In any case it would just be the P5 pentium family where making heavy use of 16-bit (or mixed 16 and 32-bit) fixed point math would actually be slower (significantly slower, even) than doing as much as possible in FP. (any 386 or 486 class CPU and any Non-Intel 586/686 class CPU and Intel's P6/PII/PIII would be faster or at least very near the same speed when doing all integer ops like that) The integer side of the P6 CPUs are very DSP-like in fast multiply-accumulate operations and the Cyrix 486DLC was probably the first x86 family CPU similarly optimized for DSP-like operations. (the hardware multiplier in the NEC V33 was similarly oriented with 12 cycle 16-bit multiply, but limited to Real Mode and 286 instruction set: it supported EMS memory mapping internally for 24-bit 16MB address space externally, but not XMS/Protected mode ... not sure if it even supported the 286's HMA, it also wasn't 286 pin compatible for some reason)
For that matter, with the DSP-like integer performance, it's possible that highly 16-bit integer optimized operations would be faster on a 486DLC than a P5 pentium, or at least a Socket 3 Pentium Overdrive at the same clock speed (ie tach fan removed, locking it to 1x multiplier ... I have one like that, but it won't POST over 50 MHz where the same boards will do i486DX-50 and 486SX-33s at 1x66 MHz ... some DX-50s ad SX-33s run OK at 1x60 at 3.3~3.45V even). Cyrix 486SX/DX chips with larger L1 cache should have a bigger advantage. In which case, any software designed as such would benefit from a P5 pentium specific patch or re-write doing most of those integer ops as FP ones instead (even if it meant just converting them to 16-bit integer values afterward for minimal change in code).
Looking at this for cycle times:
https://www.agner.org/optimize/instruction_tables.pdf
(covers P5/P55 P6/PII/PIII, K7, and more, but doesn't have K5, K6, or any Cyrix processors)
This for 486DLC
http://www.bitsavers.org/components/cyrix/Cyr … Sheet_May92.pdf
Also worth noting: both the Playstation's GTE and N64's RSP make heavy use of 16-bit fixed point math, though both can also do 32-bit fixed point math (and I believe the N64's so-called "standard microcode" uses RSP routines heavily biased towards full 32-bit vertex precision)
Fast lookup tables + shifts + adds would also be faster in some situations, but this is more clear cut for 386 optimized engines than 486 ones. (probably depending on whether LUTs are small enough to fit into 486 L1 cache and not targeting Cyrix's fast 16x16-bit hardware multiplier unit)