NO Projects Yet Assigned for Spring 2008 CSE613 (What follows is from 2003)
Projects for CSE613 Fall 2003 v1 (See project
Taken:
raytrace water-spatial ocean barnes fmm
Submitting your final reports for cse613 projects:
You each have a /home/u26/acctname directory for your project this semester. You can cd /home/u26 from compserv1 and from sbcube6.
I would like you to finish your projects and submit your final reports to me by Saturday 20 December, the end of final exams if you have no final exams. Please email your final report to ldw as a text, pdf, or doc (MS Word 97/98 or earlier) file. Make sure you include your name, your project, the submission date and report title. In your email, also tell me the exact pathname of the zip file containing all files that you wish to submit as part of your project report. Slide a printed copy of your final report under the door of my lab 1306 Computer Science, before noon on Wednesday 17 December, if it is ready then. Otherwise, reach me by email to discuss how to get a printed copy to me for grading by Saturday 20 December. If you have finals during finals week, make arrangements with me so you do not work on this project during finals week until your exams are done. I will give you time, within reason, to finish without penalty.
Please DO NOT SEND HUGE zip files by email. My disk quota is too small to accept them.
Use your directory on /home/u26 to leave ALL your data as a zip (not gzip) file containing all source files and execution records that you want me to see when grading your report. Include the final report that you email to me and submit in printed form. In your official email by Saturday, tell me the exact pathname of your project zip file. If you do not want others to see your material, leave the outer level of your /home/u26 project directory public-read and public-execute (the norm), but create an inner directory with a name such as ForLDW and make that directory public-execute only by the command chmod 711 ForLDW
Put your zip file (and any other files that you want me to see) inside that ForLDW directory. Make sure the zip file is public-read (chmod 744 FILE.z). Given your email that tells me the pathname of your zip file (e.g., /home/u26/ACCT/ForLDW/FILE.z), I can cd into /home/u26/ACCT/ForLDW and copy FILE.z but others who do not know the name FILE.z cannot access it since the directory for ForLDW is not public-readable.
If you are not satisfied with your results on the project, you may request a few more days after the final exam to finish. Send email to ldw.
Project Assignment for CSE613 Fall 2003
Each of you must select a different one of the C implementations of the SPLASH-2 algorithms found at /home/u26/cse502, /home/u26/lw/cse613splash2,
http://www-flash.stanford.edu/pub/flash and ftp://www-flash.stanford.edu/apps/SPLASH2. As posted, the codes run only sequentially on a single processor of the SGI machines sbcube6 or sbspimII. The profiling and code analysis tools are very good on the SGI machines. Each of you must:Include the raw data used to generate your plots in tables near the end of your report. You will be asked to handin both printed and electronic copies of your report, as well as a zip file containing your original and changed versions of the C-code and your script-produced files showing the raw data from your runs.
Finally, in your report, tell me any problems that you had doing this project that were especially annoying and unnecessary. Perhaps, I can eliminate such problems for future students. Good Luck. Larry
=========================================================
7 Hints and Questions for Splash2 Parallel Code Project:
1) It would REALLY REALLY help if we knew what the MACROS are supposed to do. Specific what are the inputs and outputs? What are the preconditions? What are the postconditions? The names of the MACROS are not very useful for trying to figure out what they do. There are about 2K lines in our code. It would really be a lot of extra uninstructive work for us to sift through the code to determine what the NULL MACROS are supposed to do.
Just do diff water.c water.C and you will see.
=========================================================
2) Is there documentation on these MACROS?
No, ignore them except for the diff of .C and .c main files.
What they do is convert the generic system calls into the right calls for the SGI IRIX version of unix. The null macros eliminate all multiprocessor synchronization calls so the default .c file works for one processor, not more.
=========================================================
3) Why can't I login to sbspimii?
The name of the two SGI systems are sbspimII and sbcube6. The capitalization of "II" in the name sbspimII is significant.
=========================================================
4) How exactly do you prefer us to measure time for our program?
Two methods we have been using
A. Millisecond print out at start and end of execution
B. Millisecond print out of system clock and CPU times.
Is one prefered over the other?
The real question is how accurate are your timing values. I prefer that you use either clock time or CPU time as reported by the operating system, but only times that are reported with millisecond accuracy. The times that you use should be reasonably consistent for several runs of the same job. One runtime by itself is not very convincing.
What is the range of your run times? If you are comparing runtimes of 50 ms vs 80 ms, you need a more precise timing measure. One way is to run the code 10 (or 100) times in a loop and divide the total end-start time difference by 10 (or 100) to get more accurate measurements.
=========================================================
5) Please tell me where to find profiling tools on the SGI computers. What are their names? I could not find a man page for gprof nor any gprof binary. What should I use on the SGI machines?
Start with
man ssrun . There is also an earlier execution time profiler called pixie. See man pixie . Pixie can be started from ssrun. See also man perfex .For an overview of SpeedShop (ssrun) and perfex execution performance measurement tools on SGI computers, see
OvervuSGIperfexProfileTools.htm
If you cannot see man pages on SGI, copy the MANPATH entries from my ~ldw/.cshrc account environment setup file. The GNU gprof tool is not in the list of profiling tools on SGI. Here is how I found ssrun:
man -k prof
lists, among unrelated commands, these commands for job time profiling:dprof
(1) - a memory access profiling toolfbdump
(1) - Writes compiler feedback files from prof(1)kernprof
(1) - Special executable for SpeedShop performance measurements on the UNIX kernelprof
(1) - Analyzes and displays SpeedShop performance dataprofil
(2) - execution time profilesprofil
(2) - execution time profile for disjoint text spaces
man -k speed
lists these commands for job time profiling with the SpeedShop suite of tools:calloc
, free, malloc, memalign, realloc, ssmalloc_error, valloc (3) - SpeedShop memory allocation libraryfpe_trace_option
(3) - SpeedShop Floating-Point Exception (FPE) tracing libraryio_ss
(3) - SpeedShop input/output (I/O) tracing librarykernprof
(1) - Special executable for SpeedShop performance measurements on the UNIX kernelprof
(1) - Analyzes and displays SpeedShop performance dataSpeedShop
(1) - An integrated package of performance toolsssaggregate
(1) - Combines multiple SpeedShop experiment files into onessdump
(1) - print out the contents of SpeedShop performance experiment data filesssrt_buffer_clear
, ssrt_caliper_point, ssrt_experiment_stop, ssrt_interface_routine (3) - Invokes SpeedShop runtime library routinesssrun
(1) - Collects SpeedShop and WorkShop performance datasswsextr
(1) - Extracts working set files from SpeedShop ideal-time experiment=========================================================
6) How can I get accurate run times to calculate my improvement percentage? I make many runs, but the average jumps all over the place depending on the time of day.
Do
NOT use the AVERAGE of many runs. Take many runs, but USE the MINIMUM RUNTIME value. To be convincing, take many runs and show me, via a table and a graph, that there are several runs almost as short as the one you have observed as the minimum. Get the minimum for each optimization level and both before and after you make your improvements to the best optimized code. It is OK and normal that some runtimes are much larger than the minimum. Show all runtimes, not just the closest few. Comment about the differences in your runtimes from the largest to smallest. Which runtimes are unusual?If you have trouble getting consistent minima, run your measurements at times of the day when there are few other users (4 to 8 AM in most computer science departments); or use a universally accessible (so I can check your results) computer that is rarely used by others, at least for parts of the day. However, all your calibration runs to measure execution times, must be on this same machine. Be sure to identify your machine and run parameters in your report.
=========================================================
7) What is expected for the output and documentation of the project? We are almost done with improvements and are interested in starting and completing our documenting of the project.
Here is one format for your report. You may use another format so long as it tells me clearly what you did in your project.
SUMMARY of your work (1 or 2 paragraph abstract)
INTRODUCTION
Who are you? What code did you study? Generally how does the code work (I do not want lots of details)?
METHODOLOGY
Memory access counting
How did you count total reads, total writes, and shared writes for your sequential and parallel codes?
Timing
What method did you use to determine the run time for each of your runs at each of the delay values? How fast did the code run of the two machines that you used to determine the best optimization level? How many runs did you make with each version of the code to determine minimum run times. What were those run times (I want to see them to be convinced that your times are accurate)?
Comparison of run outputs
How did you determine that the results of two runs were the same? Give a specific example from your runs.
Computer and compiler optimizations used
What computer, compiler and compiler optimization level did you chose for your runs?
Modifications to the code.
Generally what kinds of modifications did you use to make your code efficiently in parallel? What did you have to do to get the same results in parallel as from the sequential code? How did you have to modify the code and #pragmas to get the parallel code to run efficiently before delays were added. In what places in the code was special care needed using the right #pragmas? Did you make any other modifications to the Splash2 source code?
RESULTS
How well did your results on code read counts, write counts, and shared write counts match the results in the Splash2 paper for 32 CPUs? What were the counts for the sequential code? What, for each number of CPUs running the final version of the parallel code: 1, 2, 4, 6, 8, and 12 CPUs? How many static locations in your code did you mark as being shared writes? In general, where were they located and what information did they pass to other processors and under what synchronization conditions?
How well did your results on parallel code speedup match the results in the Splash2 paper for 1 to 32 CPUs? What were the runtimes for the sequential code? What, for each number of CPUs running the final version of the parallel code: 1, 2, 4, 6, 8, and 12 CPUs?
How did the runtimes for your parallel code change as you increased the delay time before each memory store counted as a shared write.
CONCLUSIONS
Examine your results in tables and graphs and predict the execution time of your parallel code as a function of number of CPUs running concurrently and the delay time imposed before each shared write that you identified.
Are any of the shared values written by your code, read by more than one remote CPU before the value is changed again? Where are any multiply shared writes in your parallel code? If there are any, suggest how to modify your code to reflect the greater performance degradation from shared written values that are read by one CPU versus those shared written values that are read by many different CPUs.
What conclusions do you draw from your measurements about the behavior of your parallel code if it runs on CPUs that are more and more separated in physical distance from each other. Assuming that each shared write causes a delay equal to the propagation delay for a signal travelling at 2/3 light-speed (so, 200 kilometers per millisecond), is there any distance at which adding CPUs will not make your parallel code run faster than the sequential code on a single CPU?
DIARY
(Comments here will only improve your grade, not detract from it) What, if any, unnecessary problems did you have in doing this project? How did you solve the problem(s)? What improvements do you think are needed to help students do similar projects in the future?
General comment: some of the data that you put into tables may be easier for me to understand if you include a graph of the data values, in addition to the table.
=========================================================