Here are key links that I probably mentioned some or all in our discussions:
SYCL special topic: Cornell class (9/14 and 9/21)
Presentation from September 14, 2023 is here https://github.com/jamesreinders/syclmultiple/blob/master/20230914%20SYCL%20class%20Reinders%20Cornell.pdf
Read https://tinyurl.com/ReadmeIDC
Get started by signing up for an account - and getting to the part where you ssh to a head node.
Then do these steps to grab the program, support files, etc. and login to try it out:
git clone https://github.com/jamesreinders/syclmultiple.git
srun --pty bash
source /opt/intel/oneapi/setvars.sh
unset ONEAPI_DEVICE_SELECTOR
cd syclmultiple
make
You should see the program compile and run - and it should look a little like this:
icpx -o edge -fsycl edge.cpp
./edge goldfish.png
Input file: goldfish.png
Output file: blurred_goldfish.png
Running on Intel(R) Data Center GPU Max 1100
UUID = 134.128.218.11.47.0.0.0.41.0.0.0.0.0.0.0
Second queue is running on Intel(R) Data Center GPU Max 1100
UUID = 134.128.218.11.47.0.0.0.58.0.0.0.0.0.0.0
inImgWidth: 512
inImgHeight: 512
channels: 3
filterWidth: 11
halo: 5
profiling: Operation completed on device1 in 6.03712e+06 nanoseconds (0.00603712 seconds)
chrono: Operationd completed on device1 in 3.19526e+08 nanoseconds (0.319526 seconds)
chrono more than profiling by 3.13489e+08 nanoseconds (0.313489 seconds)
First 800 digits of pi: 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
profiling: Operation completed on device2 in 1.34719e+08 nanoseconds (0.134719 seconds)
Note: Something is up with the profiling - it is suggesting pi took longer than the image processing by a couple orders of magnitude. Something is up. Kudos if you figure out what (I do not know - yet).
As an aside - if you have not read the article (or seen their talk) titled "A New Golden Age for Computer Architecture" - please do (see next section for links).
Your project for the next week is to modify the program to use multiple GPUs to affect a speed-up.
The program already uses two GPUs - one to blur, and one to compute 800 digits of pi.
Your goal to have the Queue2 (that does the 800 digits of pi) help with blurring. Can you split the work (half an image on Queue1 and half on Queue2) ?
Pay attention to performance - try to make things faster.
The system you are on has 4 GPUs - so you can extend your thinking to try to use four GPUs.
The program already finds four GPUs (assuming you unset ONEAPI_DEVICE_SELECTOR). You may want to learn more about this environment variable to control your environment (how many cards, or which ones, you program sees). You can read more at https://intel.github.io/llvm-docs/EnvironmentVariables.html. When you first login, the variable is set to limit you to one card. That's why I have you 'unset" it so that your program will see everything. The program "sycl-ls" will list what can be seen (sycl-ls, like most tools, only appear after you source the setvars.sh).
Can you find useful work for all the GPU devices?
Extend the program any way you wish - even try other algorithms if you please.
Feel free to email me (james.r.reinders) at intel.com - please put "Cornell Class" as the first words in your subject. I will usually reply within a day, but probably not more often. Please be very very clear. Attach screenshots or code if you need to do so.
Please always attach a screen shot of your session if you are having trouble using the Developer Cloud account. Please run "sycl-ls" and show me what it says. If it doesn't find "sycl-ls" - issue the command: source /opt/intel/oneapi/setvars.sh
If sycl-ls does not show GPU - try the command: unset ONEAPI_DEVICE_SELECTOR
By end-of-day Wednesday September 20th – email to me:
- Short explanation of what you did (500 words or less)
- Output from your program (describe in 100 words or less what it means)
- Your source code (only files you changed or created)
A New Golden Age for Computer Architecture
In my opinion, THE most important paper/talk for today's age - how we got here, and where we go next.
- video - David Patterson - A New Golden Age for Computer Architecture: History, Challenges and Opportunities
- PDF of A New Golden Age for Computer Architecture by John Hennessy and David Patterson
For fun: Musings about ChatGPT
Access to Ponte Vecchios
LIMITED TIME "beta" (early access) to use Intel Data Center Max GPUs (Ponte Vecchio GPUs) with Intel 4th Gen Xeon Processors (Sapphire Rapids CPUs).
It's a SLURM environment, with interactive Linux command line access as well as Jupyter notebooks, with system that have 4 GPU cards per node!!! Lots of performance.
Learn SYCL
We have an uncorrected PROOF of our 2nd edition book 'Data Parallel C++" - that teaches SYCL programming.
It should be available online and in print by Q4 2023 - it is at the publisher now being finalized.
- UNCORRECTED 2nd edition proof: https://www.dropbox.com/s/ypsa9uvlmzz4hy5/SYCL-UncorrectedProof-DataParallelC%2B%2B-second-edition-June-2023.pdf?dl=0 Please email me if you see anything we should correct.
- First edition in PDF: https://link.springer.com/book/10.1007/978-1-4842-5574-2
Recent News Bites / Blogs
Some recent blog with cool information related to areas that I work in:
- Velocity Bench - a cool effort by compiler engineers to evaluate hardware/software stacks to highlight tuning opportunities - positive press on it by HPCwire. Helps look at hardware/software stack results across 15 areas (more coming) using SYCL, CUDA, OpenCL, and OpenMP. Since all aim to give similar access to hardware performance, this is an invaluable tool for tuning stacks.
- Llama 2 benchmarking on Intel hardware - released in support of Meta announcement.
- Blog about APX (Advanced Performance Extensions) concurrently with four technical documents with APX details. Another step in the evolution of x86 instruction set - should happen in products several years from now.
oneAPI
- Latest blog from me: https://www.intel.com/content/www/us/en/developer/articles/news/oneapi-2023-2.html
- Toolkits from Intel (intel.com/oneAPI)
- Industry initiative (oneAPI.com)
Student Ambassador Program - learn, possible internships, great interactions
Learn more: https://github.com/jamesreinders/syclmultiple/blob/master/Student%20Ambassador%201-Pager.pdf
Jobs: Internships and Graduates - Jobs at Intel
- visit https://jobs.intel.com/en/internships (there is also a link for Graduates there)
- application advice:
- find positions you are interested in and apply
- visit often to look for new positions (every 2 weeks is good)
- it is okay to upload updated resumes from time to time even for the same position
- resume advice:
- make sure you have a clear objective (aka "I'm looking for a summer 2024 internship, willing to relocate anywhere for summer...")
- make sure the most critical information happens in the first 1/2 page - don't assume a manager will read further than that unless it is interesting by then
- don't sweat squeezing onto one page - it is nice to fit one page, but don't do tricks with margins and small fonts to do that - focus on good content
- Good luck - it's been my great fortune to work at Intel for a long time - it's a high energy place with lots of very smart people solving problems (aka "engineering") - it drives us all to do the best we can together!
If you have feedback, or questions, please drop me a note at reinders AT intel.com.
The QR code for this page is: