Dual GPU Usage

GeForce_GTX_590_0068-Edit

A nice bonus we came across recently at the lab was how Windows 8.1 handles system setups with 2 GPUs. Since the beginning of our lab, our CAVE machines have each had 2 GPUs in them, but it was always a challenge to correctly program a scenario to make use of both GPUs at once. This held particularly true in Windows XP, where basically customized OpenGL hooks and functions would need to be employed; of which they seemed to not always produce the best results in terms of increased frame rates, etc. It turns out that in newer versions of Windows that have 2 GPUs, if you launch an application on the desktop assigned to the second GPU such that the window only covers that desktop, the second GPU will automatically be used by the system. This has enabled a near doubling of frame rates in many of our CAVE applications!

The Black Art of Projectors

img_1351Projectors that accompany immersive display systems often times are not the easiest pieces of machinery to operate.  Sure, turning them on and obtaining an image is simple, but at least for the projectors we have, some features and settings are definitely not obvious in terms of what they do or how to properly set them.  Furthermore there’s the constant problem of maintaining nice color calibration between projectors.  This ends up being a function of the projector’s bulb life span, the physical location of the projection surface relative to any lighting in the area, light reflected between adjacent projectors, etc. etc.  Auto-calibration procedures for this are few and far between, and those that do exist are complex.

A concept of such projectors that took a while to understand, is a projectors ability to natively provide active stereo support, regardless of whether it’s enabled at the operating system (nVidia control panel level).  That’s right, the projectors we have in our CAVE and Dev Lab can basically over-ride this and still provide various forms of stereo support.  This helped us alleviate a particularly problem we were seeing in Windows 8+ operating systems, where enabling stereo at the operating system level would max your stereo frame rate to half of the operating system refresh rate.  By disabling stereo at the OS level, we were still able to obtain active stereo via the projector by still enabling stereo in the Win32 Windows creation in an OpenGL application.

Cruising Along

weston_small

Over the past year, I proudly welcomed a new addition to my family – Weston William Tredinnick.  Now that we’ve attained a pretty smooth routine with him and he’s sleeping well – I’ll be aiming to write a bit more on this blog.

Several other exciting changes have been happening around the LEL – our former director Patti Brennan is now the new director of the National Library of Medicine.  We also have a new faculty member at the lab – Karen Schloss from Brown University.  Karen specializes in color perception and is a great addition to the lab!  With all of these changes the lab continues to cruise along, working on new projects and virtual reality research as usual!

The Oculus + Laptop Conundrum

In general, I think Oculus has been making some good moves and making some good general progress on the advancement of their HMD.  I am looking forward to trying out the consumer available HMD later this Spring.

One area that hasn’t made a huge amount of sense to me was their move to disallow the Oculus from working with most laptops.  It sounds like the move was due to many laptops having more than one graphics card present (an on-board card, plus perhaps an nVidia mobile card), and a piece of technology present in these laptops called “Optimus” which automatically handles switching between the two cards depending on what application is currently running and has focus.

It sounds like the Optimus technology interferes with their desire and ability to implement their “direct to hmd” mode, which in the long run, will probably make things easier for most people to run applications.

What this does take away, in many regards, is the ability to have a portable HMD in a nice form factor.  I could fit the Oculus and a laptop in a single bag, go anywhere in the world and show off our research and software technology driving many of our research projects.  (I still can actually, due to my refusal to upgrade the Oculus run time on some of our laptops).  But this does cut us off from having the ability to keep up-to-date with the latest available run-time and its benefits.

No one wants to lug around an entire desktop PC with the Oculus, but many people are now being forced to.

It sounds like there are laptops out there that are compatible, due to them not having a mobile graphics card; however the price tag on these laptops are not cheap, minimum $2,000 a pop from what I’ve seen.

Perhaps this is a smart play by Oculus in the long-run as maybe they will drive laptop manufacturers to start wiring non-mobile cards into their machines and as more hardware companies do it, the price will come down across the board, which will be beneficial to the public.  But this will depend on the success of the Oculus after consumer release!

Couple Pieces of Exciting New Tech

A couple of new pieces of technology have caught my radar recently.  One is Valve’s Lighthouse Tracking System.

This technology may replace the 10-30,000 dollar tracking systems that presently exist and remain an expensive selling point for VR display systems.

Another one that has caught my attention is Google’s Project Soli.  It’s basically an embedded gesture recognition system that analyzes radio wave interactions between the chip and human fingers.

Embedding some of these onto a VR display system (such as a HMD) may bring about a new method for natural interaction with your hands (arguably better than using cameras ala the Leap Motion).  One could also possibly place them around the walls of a CAVE system for easier gesture detection.

 

The Limits of Points

With the vizHome project – a significant challenge has been determining techniques to efficiently render the millions of points we capture by way of the LiDAR scanner.  Some of the homes we scanned resulted in data sets upwards of 750 million points.

Although very powerful, modern GPUs are nowhere near being able to rendering that many points per frame and still maintain interactive frame rates.

So how many points can GPUs handle before they start to slow down?

I did some benchmarking on one of our power wall machines that has a Quadro 5000 in it with 2.5 GB of graphics memory.

The test consisted of drawing some number of random points within a 50 meter x 50 meter x 50 meter cube with the viewpoint being at 0, 0, 50 and facing straight ahead down the -z axis so that the whole cube of point was in view.

The test was done using basic point drawing (squares) with a very simple shader and no smoothing.  Here are the results:

point_benchmark

As can be seen, 60 fps is lost fairly quickly, right around 8-10 million points.  Considering a LiDAR scanner can scan 11 million points in 5-6 minutes, we have a challenging task on our hands here in order to maintain high frame rates.

The additional colored marks were checks performed when we sorted the data back to front or front to back and for rendering in stereo as a sanity check.  Sorting front to back increased frame rates fairly significantly in this situation; however this won’t necessary carry over to a more general case since this was a fairly worst-case situation.  In the stereo trials, we saw frame rates roughly cut in half as expected (since we rendered twice as many points).

It would be interesting to see how these numbers match up against newer cards (this card is about 5 years old now).

 

 

New HMD Comparison

It seems that these days new Head Mounted Displays (HMDs) are being released every month or so, with their cost rapidly declining since the days from 2004-2006 when I worked with one in graduate school.

Overall, my impression with recent HMDs is that they are hands down better than the one I worked with during graduate school (a nVis nVisor SX).  In my opinion, the Oculus Rift’s design and form factor is leaps and bounds ahead of the nVis HMD (lighter, less light leak-through, smaller tethered cables), which is a great thing for VR!

With so many HMDs arriving, I figured it’s useful to track the specs and features of all of these new displays for easy comparison.  Instead of providing a completely new list, here are some links to existing ones, followed by a small chart comparing the most recent HMDs.

http://www.stereo3d.com/hmd.htm#chart

http://www.roadtovr.com/head-mounted-display-hmd-vr-headset-comparison/

 

HMD Resolution (per eye) Refresh Field of View Weight Wireless Cost Type of Display Tracking?
Oculus DK 1 640×800 60 hz 110 degrees 380 g No $300 LCD Rotation
Oculus DK 2 960×1080 75 hz / 72 hz / 60 hz 100 degrees 440 g No $350 OLED Rotation + Position via External Camera
Oculus Crescent Bay 1280×1440 90 hz 110 degrees ? No ? OLED Rotation + Position via External Camera
Oculus Samsung Gear VR 1280×1440 60 hz 96 degrees ? Yes 200 + $700 Samsung Galaxy AMOLED Rotation
Sony Morpheus 960×1080 120 hz 100 degrees ? No ? OLED Rotation + Position via Playstation Move Camera
Valve Vive 1080×1200 90 hz 110 degrees ? No $200-$500 Position + Rotation – 15ft x 15ft tracked space

 

Comparing all of these to the nVis nVisor SX  – the nVis had a 1280×1024 resolution per eye, 60 hz refresh rate and only a 60 degree field of view.  Plus the kicker: its cost was $24,000…

It’s definitely exciting times for HMD technology.

Multi-threaded loading of data in OpenGL applications

During implementation of the vizHOME viewer point rendering application, it quickly became apparent that with the amount of data we are gathering with the LiDAR scanner, we would need a multi-threaded technique for loading points outside of our main rendering thread.. often known as an “out-of-core” technique.

Implementation details on how to properly execute this sort of technique weren’t exactly in abundance.  A common use in graphics applications is for loading level-of-detail versions of large terrains on the fly, in games or geographical applications.  Other previous point cloud renderers have done it in various forms but not a whole lot is out there on what is the correct way to do it.

Having had various multi-threaded programming experiences over the years, I’ve been able to put together a couple different techniques for loading “out-of-core” in the current viewing tool.  In working through this problem, two different variations on the technique stood out, namely, whether we should create multiple threads and a single OpenGL context (MTSC – multiple threads, single context) or whether we should create multiple threads, each with their own OpenGL context (MTMC – multiple threads, multiple context).

The big difference between these two variations is with MTSC, physical memory has to be allocated each time a file is read from disk containing points, whereas with MTMC, we can share OpenGL objects across threads and can read data directly onto the GPU via the glMapBufferRange function in the actual reading thread.  This is instead of passing the read data in a physical memory buffer back to the main rendering thread for upload to the GPU via the glBufferData function. Uploading to the GPU on the reading thread saves an extra copy of data and also makes it so we don’t have to worry about things like memory fragmentation from dynamic memory allocations (supposing that we aren’t using some sort of memory pool to alleviate this). On the other hand, literature states that the more OpenGL contexts you have active and the more things you have going on in those contexts, the more the OpenGL pipeline needs to perform a “context switch”, which, according to this nice blog post about multiple contexts, affords a performance penalty.

In our situation, we actually aren’t performing any rendering within the other contexts, just uploading to the GPU, so the question becomes, does this still create a significant performance hit?  If so, is it worse than having to pass around a bunch of memory buffers at run time as opposed to being able to directly upload to the GPU?

To test this out, I conducted a short test, comparing rates at which data loads to overall frame rate of our point rendering application amongst the two techniques.  The test was performed on a 1280×800 window on a Windows 7 64 Bit machine with 8 GB ram, 8 cores and a GeForce GT 750M nVidia card with 4 GB of graphics memory.  The test point cloud consisted of 156.4 million points.  Timing was performed by starting in a position within the model where 2,259 octants and 4,433,172 points would need to be loaded and recording the time between start of reading and when all the reading queues became empty.  All times were recorded after nodes had been cached by the system try to remove this factor from the measurements as much as possible (no other files or applications were accessed in-between).  Also, the reading here is actually being done from a big binary blob that has been memory mapped (this will be discussed in a future post), so basically a memcpy copies out the data from the binary blob either into a physically allocated float array (MTSC) or a direct GPU memory pointer returned from glMapBufferRange (MTMC).  The test was conducted with 1, 2, 4, and 8 threads in each case.  I recorded 10 timings for the MTSC and MTMC for each of these thread numbers and threw out the min and max of each set of 10 timings.  I then took the average of the remaining 8 samples and plotted them (milliseconds to converge on y-axis, number of threads on x axis):

thread_reading_analysis

A couple fairly interesting things stand out here.  There was a decent difference in over all convergence time where Multiple Threads, Single OpenGL Context was faster when using a fewer number of threads.  However, when the thread total upped to 4-8, the time to convergence was very similiar, with the Multiple Threads, Multiple OpenGL Contexts actually performing about a quarter to half millisecond faster.

So it appears that using multiple OpenGL contexts for reading can possibly give you a win, due to the same reading speeds (with a larger number of threads), but no need for physical memory allocation.  It would be interesting to see if trying this with more samples, or more threads yields the same results.  Also, there are other things going on in this test case other then reading (it’s actually drawing the data as well until it converges), so it’d be interesting to see if there’s any difference in drawing time between the two – although frame rates during the tests both leveled out at 30 fps.

Has anyone else out there dealt with multiple OpenGL contexts?  How has performance been affected in your application?

 

Don’t Forget The Fundamentals

A big project I’m currently involved with is developing an out of core point cloud renderer for the vizHOME project.  This application enables viewing of LiDAR scanned homes within our C6-CAVE as well as the Oculus Rift and DSCVR system.  The application is fully functional and I’m at the point of optimizing it.  In working on optimizations for the application, I came across some “gotchas” that I had definitely overlooked.

Early on in the development, I decided at some point to work with std::list rather than std::vector to serve as a type of priority queue when rendering.  At the time, this was due to some issues with getting a sort to work correctly with a vector of pointers to objects. The problem had something to do with how the predicate for std::sort was being declared – check out the fun syntax for the now correct declaration:

bool nearDistSort(const VizHomePCD * const & a, const VizHomePCD * const & b);

I’ve recently discovered that working with std::list is a really really bad idea in the majority of cases.. Adding some profiling code to the project showed that basic functions within std::list like push_back and pop_front were eating a significant amount of program time, particularly when the list reached up to 2,000-3,000 elements.

The following blog post confirms the poor, poor performance of std::list vs. std::vector in the majority of cases:

List vs. Vector vs. Deque

Fundamental lesson #1: carefully consider what data structures from STL are really appropriate for your application – and in most cases prefer vector/deque over list.

——————————————————————————————————————

In performing some code re-factoring and replacing an older math class (custom class a visiting professor wrote) with a newer and more widely-used math class (glm), I came across a case of very very poor naming of a function in glm..

It turns out that glm has two different length() functions.  If you call the length() function on a declared object, i.e.

glm::vec3 test(1.f, 5.f, 20.f)

test.length()

The length() call here will always return 3, as opposed to the length of the vector test.  What?  The function is returning how many elements it holds rather than calculating the length.. glm also has a global function length(vector type) to calculate the actual vector length instead, i.e.

glm::length(test) 

Calculates the correct length.

Fundamental lesson #2: don’t make any assumptions about what a function is doing, actually look at the code (if possible) before using the function.  And name your functions clearly (particularly when millions of people will be using them).

Having worked with C++ for over 15 years now, these small fundamental things can still easily go unnoticed and cause chaos without careful testing.

Fundamental lesson  #3: always profile your application.

 

Welcome

Welcome!  The website is meant to serve as a location to share various thoughts and excerpts from a virtual reality researcher and software developer.  I have been employed at the University of Wisconsin Madison’s Living Environment’s Laboratory since February 2012 where I have been developing software, maintaining hardware, working with students, and contributing to virtual reality research.  With a new wave of low-cost virtual reality hardware on the horizon, I hope the page serves as a place where common problems can be discussed with others interested and working in the field of virtual reality.