NO Projects Yet Assigned for Spring 2008 CSE613 (What follows is from 2003)
Projects for CSE613 Fall 2003 — v1 (See project
hints at end)

Taken: raytrace water-spatial ocean barnes fmm

 

Submitting your final reports for cse613 projects:

You each have a /home/u26/acctname directory for your project this semester. You can cd /home/u26 from compserv1 and from sbcube6.

I would like you to finish your projects and submit your final reports to me by Saturday 20 December, the end of final exams if you have no final exams. Please email your final report to ldw as a text, pdf, or doc (MS Word 97/98 or earlier) file. Make sure you include your name, your project, the submission date and report title. In your email, also tell me the exact pathname of the zip file containing all files that you wish to submit as part of your project report. Slide a printed copy of your final report under the door of my lab 1306 Computer Science, before noon on Wednesday 17 December, if it is ready then. Otherwise, reach me by email to discuss how to get a printed copy to me for grading by Saturday 20 December. If you have finals during finals week, make arrangements with me so you do not work on this project during finals week until your exams are done. I will give you time, within reason, to finish without penalty.

Please DO NOT SEND HUGE zip files by email. My disk quota is too small to accept them.

Use your directory on /home/u26 to leave ALL your data as a zip (not gzip) file containing all source files and execution records that you want me to see when grading your report. Include the final report that you email to me and submit in printed form. In your official email by Saturday, tell me the exact pathname of your project zip file. If you do not want others to see your material, leave the outer level of your /home/u26 project directory public-read and public-execute (the norm), but create an inner directory with a name such as ForLDW and make that directory public-execute only by the command chmod 711 ForLDW

Put your zip file (and any other files that you want me to see) inside that ForLDW directory. Make sure the zip file is public-read (chmod 744 FILE.z). Given your email that tells me the pathname of your zip file (e.g., /home/u26/ACCT/ForLDW/FILE.z), I can cd into /home/u26/ACCT/ForLDW and copy FILE.z but others who do not know the name FILE.z cannot access it since the directory for ForLDW is not public-readable.

If you are not satisfied with your results on the project, you may request a few more days after the final exam to finish. Send email to ldw.  

 

Project Assignment for CSE613 Fall 2003

Each of you must select a different one of the C implementations of the SPLASH-2 algorithms found at /home/u26/cse502, /home/u26/lw/cse613splash2, http://www-flash.stanford.edu/pub/flash and ftp://www-flash.stanford.edu/apps/SPLASH2. As posted, the codes run only sequentially on a single processor of the SGI machines sbcube6 or sbspimII. The profiling and code analysis tools are very good on the SGI machines. Each of you must:

  1. Compile and execute your code on the sbcube6 SGI and save the output;
  2. Use the perfex package to count dynamically all load and store instructions executed by the sequential code.
  3. Add #pragma instructions to the C code and replace the macros calls for parallel synchronization operations by the proper #pragmas for correct parallel execution on 1, 2, 4, 6, 8, and 12 sbcube6 CPUs. Verify that you get exactly the same output as from runs of the sequential code, except for timing related values. If not, correct your lock and synchronization pragmas.
  4. Use perfex to count dynamically all load and store instructions executed by the parallel code for 1, 2, 4, 6, 8, and 12 CPUs. Graph the counts and extrapolate to 32 CPUs as given in the text table and in the Splash2 overview paper included with the Splash2 C source files. The values of dynamic counts of load and store operations extrapolated to 32 CPUs should closely (say, within 10%) match the Splash2 reported counts for total reads and writes in the parallel code.
  5. Correct your use of locks and other synchronizations until the parallel code actually speeds up as you go from 1 to 2 … to 8 to 12 CPUs. At first, the parallel runs may even be slower than the sequential run. The most common mistakes that slow runs, especially for larger numbers of CPUs, are uses of overly general #pragma synchronize commands and #pragma lock commands. Use of globally shared variables by multiple CPUs requires locks for correct behavior. Frequent access shared locked variables can drastically slow execution. Use private variables wherever feasible. The #pragma alock (array lock) command allows the creation and use of hundreds of fine-grained locks, as needed by some of the applications with dynamic data structures of hundreds of computation cells. Not all writes to shared variables need be lock-protected. If you can otherwise guarantee that only one processor at a time as access to a variable, then locking is not needed. An example is a variable output by one processor but read by other processors only after a phase-ending #pragma synchronize command. A variable that is only read, even by many processors, need not (should not) be locked. After eliminating excess locking, the speedup graph for your parallel code should closely match that given in the Splash2 paper. Talk to me if it does not.
  6. Once you have your parallel code correctly but efficiently locked so that it gives the same output as the sequential codes for all numbers of CPUs and so that its speedup curve closely matches that in the Splash2 paper, recheck your prefex counts of total writes (stores) and total reads (loads) for each number of CPUs. Then identify the shared writes in your code and add an increment to an entry in a large temporary array used to count shared writes. Run your instrumented code for the same numbers of CPUs as before using perfex to again count all writes (stores). The difference between your write counts before adding the increment commands and after adding them is your estimate of the number of shared writes. Extrapolate your estimates to 32 CPUs. If you marked too many writes as shared, your shared write count for 32 CPUs will be high. Systematically reduce your marked writes until the extrapolated count of shared writes matches that in the paper.
  7. Remove the shared write markers remaining in your code. For each substitute a delay mechanism for each marker. Have a mechanism that corresponds to execution delays of 10 nanoseconds (ns), 10 ns, 1 microsecond (m s), 10 m s,100 m s, 1 millisecond (ms), and 100 ms - all times from 10 ns to 100 ms in steps of 10X. For 10 ns, you may assume that 4 ALU instructions on sbcube6 equals 10 ns. If one of you checks this by timing - via clock() - 4,000 ALU instructions repeated 1,000,000 times in a loop and finds the count to be closer to 2 or 3, please tell me and the rest of the class by email. You may want to use inline code for both 10 and 100 ns delays. For longer delays, use a delay function call such as suggested in the Splash2SGIhints file for Hint 5. Calibrate whatever delay function you chose.
  8. For each delay function value from 10 ns to 100 ms, run your sequential code and your parallel codes for at least 1, 2, 4, 6, 8, and 12 CPUs. Record the run-times carefully, until execution time per run takes more than 30 minutes or execution times have clearly become much slower than for slightly shorter delay values. Plot your run-times for each number of CPUs. If there is a sharp drop in performance - increase in execution time - between two delay function values, you do not need to make many more runs for that number of CPUs. However, consider runs for 2X, 4X, 8X, 10X and 16X the delay time before the drop in performance for a given number of CPUs to get a more detailed picture of the performance drop. If your times are noisy because of other jobs sharing sbcube6, make several runs at the same delay value and number of runs and plot the MINIMUM times.
  9. Examine your results in tables and graphs and predict the execution time of your parallel code as a function of number of CPUs running concurrently and the delay time imposed before each shared write that you identified.
  10. Write a carefully worded and organized report of 8 to 20 pages that gives your results, explains exactly what methods you used to validate and measure your code, and what conclusions you draw from your measurements about the behavior of your parallel code if it runs on CPUs that are more and more separated in physical distance from each other.

Include the raw data used to generate your plots in tables near the end of your report. You will be asked to handin both printed and electronic copies of your report, as well as a zip file containing your original and changed versions of the C-code and your script-produced files showing the raw data from your runs.

Finally, in your report, tell me any problems that you had doing this project that were especially annoying and unnecessary. Perhaps, I can eliminate such problems for future students. Good Luck. Larry

=========================================================

7 Hints and Questions for Splash2 Parallel Code Project:

1) It would REALLY REALLY help if we knew what the MACROS are supposed to do. Specific what are the inputs and outputs? What are the preconditions? What are the postconditions? The names of the MACROS are not very useful for trying to figure out what they do. There are about 2K lines in our code. It would really be a lot of extra uninstructive work for us to sift through the code to determine what the NULL MACROS are supposed to do.

Just do diff water.c water.C and you will see.

=========================================================

2) Is there documentation on these MACROS?

No, ignore them except for the diff of .C and .c main files.

What they do is convert the generic system calls into the right calls for the SGI IRIX version of unix. The null macros eliminate all multiprocessor synchronization calls so the default .c file works for one processor, not more.

=========================================================

3) Why can't I login to sbspimii?

The name of the two SGI systems are sbspimII and sbcube6. The capitalization of "II" in the name sbspimII is significant.

=========================================================

4) How exactly do you prefer us to measure time for our program?

Two methods we have been using

A. Millisecond print out at start and end of execution

B. Millisecond print out of system clock and CPU times.

Is one prefered over the other?

The real question is how accurate are your timing values. I prefer that you use either clock time or CPU time as reported by the operating system, but only times that are reported with millisecond accuracy. The times that you use should be reasonably consistent for several runs of the same job. One runtime by itself is not very convincing.

 

What is the range of your run times? If you are comparing runtimes of 50 ms vs 80 ms, you need a more precise timing measure. One way is to run the code 10 (or 100) times in a loop and divide the total end-start time difference by 10 (or 100) to get more accurate measurements.

=========================================================

5) Please tell me where to find profiling tools on the SGI computers. What are their names? I could not find a man page for gprof nor any gprof binary. What should I use on the SGI machines?

 

Start with man ssrun . There is also an earlier execution time profiler called pixie. See man pixie . Pixie can be started from ssrun. See also man perfex .

For an overview of SpeedShop (ssrun) and perfex execution performance measurement tools on SGI computers, see OvervuSGIperfexProfileTools.htm

 

If you cannot see man pages on SGI, copy the MANPATH entries from my ~ldw/.cshrc account environment setup file. The GNU gprof tool is not in the list of profiling tools on SGI. Here is how I found ssrun:

 

man -k prof lists, among unrelated commands, these commands for job time profiling:

dprof (1) - a memory access profiling tool

fbdump (1) - Writes compiler feedback files from prof(1)

kernprof (1) - Special executable for SpeedShop performance measurements on the UNIX kernel

prof (1) - Analyzes and displays SpeedShop performance data

profil (2) - execution time profile

sprofil (2) - execution time profile for disjoint text spaces

 

man -k speed lists these commands for job time profiling with the SpeedShop suite of tools:

calloc, free, malloc, memalign, realloc, ssmalloc_error, valloc (3) - SpeedShop memory allocation library

fpe_trace_option (3) - SpeedShop Floating-Point Exception (FPE) tracing library

io_ss (3) - SpeedShop input/output (I/O) tracing library

kernprof (1) - Special executable for SpeedShop performance measurements on the UNIX kernel

prof (1) - Analyzes and displays SpeedShop performance data

SpeedShop (1) - An integrated package of performance tools

ssaggregate (1) - Combines multiple SpeedShop experiment files into one

ssdump (1) - print out the contents of SpeedShop performance experiment data files

ssrt_buffer_clear, ssrt_caliper_point, ssrt_experiment_stop, ssrt_interface_routine (3) - Invokes SpeedShop runtime library routines

ssrun (1) - Collects SpeedShop and WorkShop performance data

sswsextr (1) - Extracts working set files from SpeedShop ideal-time experiment

=========================================================

6) How can I get accurate run times to calculate my improvement percentage? I make many runs, but the average jumps all over the place depending on the time of day.

Do NOT use the AVERAGE of many runs. Take many runs, but USE the MINIMUM RUNTIME value. To be convincing, take many runs and show me, via a table and a graph, that there are several runs almost as short as the one you have observed as the minimum. Get the minimum for each optimization level and both before and after you make your improvements to the best optimized code. It is OK and normal that some runtimes are much larger than the minimum. Show all runtimes, not just the closest few. Comment about the differences in your runtimes from the largest to smallest. Which runtimes are unusual?

If you have trouble getting consistent minima, run your measurements at times of the day when there are few other users (4 to 8 AM in most computer science departments); or use a universally accessible (so I can check your results) computer that is rarely used by others, at least for parts of the day. However, all your calibration runs to measure execution times, must be on this same machine. Be sure to identify your machine and run parameters in your report.

=========================================================

7) What is expected for the output and documentation of the project? We are almost done with improvements and are interested in starting and completing our documenting of the project.

Here is one format for your report. You may use another format so long as it tells me clearly what you did in your project.

 

SUMMARY of your work (1 or 2 paragraph abstract)

 

INTRODUCTION

Who are you? What code did you study? Generally how does the code work (I do not want lots of details)?

 

METHODOLOGY

Memory access counting

How did you count total reads, total writes, and shared writes for your sequential and parallel codes?

Timing

What method did you use to determine the run time for each of your runs at each of the delay values? How fast did the code run of the two machines that you used to determine the best optimization level? How many runs did you make with each version of the code to determine minimum run times. What were those run times (I want to see them to be convinced that your times are accurate)?

Comparison of run outputs

How did you determine that the results of two runs were the same? Give a specific example from your runs.

Computer and compiler optimizations used

What computer, compiler and compiler optimization level did you chose for your runs?

Modifications to the code.

Generally what kinds of modifications did you use to make your code efficiently in parallel? What did you have to do to get the same results in parallel as from the sequential code? How did you have to modify the code and #pragmas to get the parallel code to run efficiently before delays were added. In what places in the code was special care needed using the right #pragmas? Did you make any other modifications to the Splash2 source code? 

RESULTS

How well did your results on code read counts, write counts, and shared write counts match the results in the Splash2 paper for 32 CPUs? What were the counts for the sequential code? What, for each number of CPUs running the final version of the parallel code: 1, 2, 4, 6, 8, and 12 CPUs? How many static locations in your code did you mark as being shared writes? In general, where were they located and what information did they pass to other processors and under what synchronization conditions?

How well did your results on parallel code speedup match the results in the Splash2 paper for 1 to 32 CPUs? What were the runtimes for the sequential code? What, for each number of CPUs running the final version of the parallel code: 1, 2, 4, 6, 8, and 12 CPUs?

How did the runtimes for your parallel code change as you increased the delay time before each memory store counted as a shared write.

 

CONCLUSIONS

Examine your results in tables and graphs and predict the execution time of your parallel code as a function of number of CPUs running concurrently and the delay time imposed before each shared write that you identified.

Are any of the shared values written by your code, read by more than one remote CPU before the value is changed again? Where are any multiply shared writes in your parallel code? If there are any, suggest how to modify your code to reflect the greater performance degradation from shared written values that are read by one CPU versus those shared written values that are read by many different CPUs.

What conclusions do you draw from your measurements about the behavior of your parallel code if it runs on CPUs that are more and more separated in physical distance from each other. Assuming that each shared write causes a delay equal to the propagation delay for a signal travelling at 2/3 light-speed (so, 200 kilometers per millisecond), is there any distance at which adding CPUs will not make your parallel code run faster than the sequential code on a single CPU?

 

DIARY

(Comments here will only improve your grade, not detract from it) What, if any, unnecessary problems did you have in doing this project? How did you solve the problem(s)? What improvements do you think are needed to help students do similar projects in the future?

 

General comment: some of the data that you put into tables may be easier for me to understand if you include a graph of the data values, in addition to the table.

 

=========================================================