VOGONS


First post, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Hi guys,
seems I can't post the patches in the patch section of the forum, so here they are.

1: This one is really minor : a few bit twiddling in text rendering code to avoid a LUT (better CPI on most (all?) architectures I know of)

2: I wanted to make dynamic code selection on x86/x64 so this patch allows client code to test for SSE1 or SSE2 availability to dispatch (should work on MSVC and gcc). Hope this could make people submit more SIMD optimizations (good for slower Atom machines).

3: For machines with SSE2 (yes, there was a logic behind the previous one:)), RENDER_StartLineHandler treats more items at the same time.

4: MEMBlockWrite, most invoked when requesting int21 disk reads is now paging-conscious and branches much much less, reverting to memcpy for sub-blocks.

Hope you'll consider them ok!

Cheers,
Tramb

Edit: Patch 1 fixed

Attachments

  • Filename
    tramb_patches_v2.zip
    File size
    4.3 KiB
    Downloads
    735 downloads
    File license
    Fair use/fair dealing exception
Last edited by Tramboi on 2011-07-07, 07:45. Edited 1 time in total.

Reply 2 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Here is the fourth patch again, sorry for the inconvenience 😖

Attachments

Reply 3 of 19, by danoon

User metadata
Rank Member
Rank
Member

I like the memblock optimization, I experimented with the same thing in the dostring functions. Some cases saw a 100x speed improvement in the java port, though I haven't checked them in yet, its hard to justify large chunks of optimized code replacing 2 lines of easy to read/debug code when they don't get called too much. Did you see good improvements on real world games, like start up time of doom, etc on slower machines?

Reply 4 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

I saw a 15/20% acceleration on my OpenPandora and 5/10% on my laptop for the bench I tried : OpenWatcom default installation. I think a lot of utilities hitting the disk hard would benefit from this one.
I understand the point for maintainability, but at some point, only the user counts, I guess.
At work I don't mind having duplicated code for the PS3 PPU, the PS3 SPU, the 360, and PC if each reaps performance benefits, but I understand the opposite POV 😀

Reply 9 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Hi guys,
ykhwong, I revised my first patch, could you give it a try and see if it fixes your game? Thanks in advance 😀
wc, my goal is to have a look to the top functions that come out from oprofile on my laptop with a consideration for smaller machines (I own an OpenPandora but I can't really do profiling on it without heavy lifting). If I gain a few percents here and there on my PC with a quite modern CPU without compromising readability too much, I'm happy, because it will certainly give bigger gains on ARM and Atoms who don't have that much features in caching, store-to-load forwarding, smart-ass branch predictions and super-deep pipelines.

Cheers,
Tramb

Attachments

  • Filename
    tramb_patches_v2.zip
    File size
    4.3 KiB
    Downloads
    554 downloads
    File license
    Fair use/fair dealing exception

Reply 10 of 19, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

my goal is to have a look to the top functions that come out from oprofile on my laptop

That's only of use if oprofile can gather information about the generated code from the recompiler. Otherwise your "few percent" are "pretty much zero" which is why I've asked for real benchmarks on this.

Reply 11 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

oprofile does recognize as "anonymous" the executable generated pages.
For instance, on my latest run for CPU_CLK_UNHALTED events, I had this:
3681 anon
2394 vga_read_p3da
1955 VGA_TEXT_Draw_Line
864 Normal1x_9_32_R
534 DBOPL::Channel* DBOPL::Channel::BlockTemplate<(DBOPL::SynthMode)1>(DBOPL::Chip*, unsigned int, int*)
474 mem_readw_checked_drc(unsigned int)
and so on.
Lots of it in the crt but they are shared with the other processes on the machine.

It of course greatly depends on the game tested but these functions are clearly not negligible in the overall picture.
When there are loadings in the applications, MEMBlockWrite is a big contender too, stalling the emulation for many host cycles, especially for big buffers.
The RENDER_StartLineHandler optimisation was a huge win for the function IIRC, doing twice (4 times on 32bit CPUs) less load and compares.
The scalers should use SSE2 too but the macro machinery makes it a bit tricky so I'm still pondering about how.

What sampling profiler do you use in the team by the way? CodeAnalyst, VTune, another?

Reply 12 of 19, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

oprofile does recognize as "anonymous" the executable generated pages.

They're usually more than 50% of the overall time so not sure if that anon accounts for the recompiler code.

The RENDER_StartLineHandler optimisation was a huge win for the function IIRC

Sounds like a compiler problem to me though, when not using default "has to run even on 386 systems"
flags the auto profiling should reorder things for good.
But maybe I'm just missing the point of the optimization, it definitely didn't improve readability.

What sampling profiler do you use in the team by the way? CodeAnalyst, VTune, another?

Depends on what you're trying to do, VTune is usually pretty usable for full profiles.
But the really relevant things are game benchmarks.

Reply 13 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

They're usually more than 50% of the overall time so not sure if that anon accounts for the recompiler code.

The anon in question was in the process dosbox that I just compiled with full dwarf info so I do think so, but I'll check at home tonight with the normal core to eliminate the issue. Here we probably have a game that is not CPU intensive and does lot of sync checks on the p3da, no?

Sounds like a compiler problem to me though, when not using default "has to run even on 386 systems"
flags the auto profiling should reorder things for good.
But maybe I'm just missing the point of the optimization, it definitely didn't improve readability.

First, it doesn't improve readability, it's all about speed there.
I don't know of a compiler on earth that would do the same optimization, even icc would probably not (and if it would it would probably generate a prolog to do movdq instead of movdqu in the loop). I am 150% positive that gcc and msvc would just compile the code as it is written, I check their asm output everyday at work, I'd know it 😀
Anyway I use the default ./configure script so it means the shipped builds don't do it.

Depends on what you're trying to do, VTune is usually pretty usable for full profiles.
But the really relevant things are game benchmarks.

What metric do you use? FPS without vsync?
I don't know how to do do it in DosBox, can you explain me?
And it doesn't measure latency hogs like MEMBlockWrite that appear during int21 loading operations.

Hope it answers properly to your remarks 😀
Tell me if you (as the project) are not interested at all in low level optimizations, I won't submit more.

Reply 14 of 19, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Here we probably have a game that is not CPU intensive and does lot of sync checks on the p3da, no?

Yes, the usual targets of optimizations are games that push the system to limits.

I am 150% positive that gcc and msvc would just compile the code as it is written

Profile-guided optimization usually gets it done if the code isn't prepared for other guidances.
If not and the code can't be re-arranged to allow the compiler to do the optimizations you should
still be very sure that the code you're looking at is worth adding custom code like SSE.

I don't know how to do do it in DosBox, can you explain me?

Use some game that has a fps display, like quake or duke3d, then choose some defined procedure
for measurement (quake has some console commands for example).

And it doesn't measure latency hogs like MEMBlockWrite that appear during int21 loading operations.

Why not? If you're maxing out your CPU this should result in different framerates.

Tell me if you (as the project) are not interested at all in low level optimizations, I won't submit more.

Maybe you got that wrong, but I don't want to distract you from finding optimizations, I'm just
trying to share some experiences. There are a lot of different games out there with varying
characteristics that benefit from different approaches of optimization. If you aim at getting more
speed out of demanding games, the textmode stuff may not be of relevance at all but if
textmode games on low powered systems are your target it very well may be. But keep in
mind that those changes will generally only be done if there's some real noticeable benefit
as otherwise the chance of bugs is always present no matter how trivial it may look like so
it's simply not worth it.

Reply 15 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Yes, the usual targets of optimizations are games that push the system to limits.

Yes, but on small ARM devices and small Atoms/VIA, all games push the system to limits 😀 You just can't run the higher-profile games.

Profile-guided optimization usually gets it done if the code isn't prepared for other guidances.
If not and the code can't be re-arranged to allow the compiler to do the optimizations you should
still be very sure that the code you're looking at is worth adding custom code like SSE.

PGO won't do this kind of optimization either. It's more useful to move around branches considering likeliness, speculating virtual calls, inlining selectively and so on, and anyway, integrating PGO to software releases is a real PITA.
And there's the issues of having a representative PGO session.
But of course, you gotta put a line on where the optimization is worth it or not considering readability and regression risks.

Use some game that has a fps display, like quake or duke3d, then choose some defined procedure
for measurement (quake has some console commands for example).

Unfortunately the kind of game that interests me the most, ie Martian Dreams, Ultima VI or Veil of darkness that I have profiled, don't have (I think) such measures 😀
3D games are a bit out of reach for the smaller machines, which explains why the CPU emulation is maybe relatively less of a problem in the kind of game I profile.

Reply 16 of 19, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

PGO won't do this kind of optimization either. It's more useful to move around branches considering likeliness, speculating virtual calls, inlining selectively and so on, and anyway, integrating PGO to software releases is a real PITA.
And there's the issues of having a representative PGO session.

PGO is (imo) only for a quick check if likeliness evaluation makes sense. If you get a significant
speed boost in certain code parts you can have a closer look and reorder code as you say,
because then *everybody* no matter how he compiles the code will benefit from it.

Can you please post the system specs of your target devices?

Reply 17 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Yep, that's a good way and unobtrusive way of using PGO.

The device I'd like to play more games with:
http://openpandora.org/

OTOH, SSE and SSE2 vector instructions are for virtually any x86 boxes, including small Atom machines like MSI Wind or EEEPC.

Reply 18 of 19, by Tramboi

User metadata
Rank Newbie
Rank
Newbie

Hello again!
Could you please consider integrating only the 4th patch?
It is portable, not too ugly (IMO) and it does help on file intensive processes on small ARM machines. In exchange I won't bother you anymore with the other patches 😀
Cheers,
Tramb

Reply 19 of 19, by barf982

User metadata
Rank Newbie
Rank
Newbie

I've been using dosbox v74 svn build 20110705. As an enduser, what attracted me to this build, was the save/load game feature. Been gaming for many years and have simply avoided certain games because of frustration at not being able to save at any given point, and having to return to the beginning of level, or worse. I'm sure you've all experienced this , and might have even caused certain objects to become airborn at great velocity. I know I have. In anycase, a game from maybe ten years ago, which was eventually converted from dos to win to 3dfx was " montezuma's return" . This particular game had super graphics for the time, decent sound, and gameplay etc etc. It was just plain ole' fun, until... you used all your continues or fell in lava, or experienced one of the many excruciating deaths. And thereby had to reload from start or beginning of level. As I've said, avoided this game, even though it was very cool, but also very frustrating. Then we have dosbox, "wow", and then we get a feature that will allow saving, and loading. "double wow"
Ok, so now we get to the point of this post. Yes I've been playing montezuma's return, abandonia's version, which includes the dos version and the windows version. Of course I'm playing the dosversion. Fantastic, enjoying it so much, and having so much fun. Total immersion. except... for 1 very annoying bug. The game saves/ loads/ saves again/ loads again, and allows different slots which is even better. but... often after loading, the sound drops out. sometimes by restarting the game, not dosbox, it picks up again, after you reload the saved game. sometimes not. Sometimes by reloading dosbox it works. sometimes not. As I've stated, I'm only an enduser, have no real knowledge of programming. Have tried many different configs with dosbox, but the problem still occurs. Of course if I don't reload a saved game during gameplay, the sound never drops out. Only on a load. Was wondering if anyone had a clue, or a clue and a solution, that someone like myself could understand and possibly implement. If not?? , then thanks for listening to my lengthy discussion of an era past tense, but still very much alive to many gamers such as yourselves. Later