Short message: use this:
alegarra@genotoul2 ~/work $ cat test_omp.sh
#!/bin/bash
#$ -pe parallel_smp 10
export OMP_NUM_THREADS=10
your_program
alegarra@genotoul2 ~/work $ qsub test_omp.sh
where 10 is the number of threads that you want.
Long explanation:
I have a simple example that just creates a big matrix and inverts it. Let's try a 1000 x 1000 matrix, and the program runs smoothly without any particular command or request for parallel:
$cat test_omp.sh
#!/bin/bash
echo "1000 1000" | /save/alegarra/progs/parallel_inverse/a.out
alegarra@genotoul2 ~/work $ qsub test_omp.sh
nanim,nsnpp
1000 1000
Z=rnd()
GG=ZZ
Dgemm MKL #threads= 20 40 Elapsed omp_get_time: 0.1530
GG=GG/(tr/n)
GG=0.95GG+0.05I
GG^-1
Inverse LAPACK MKL dpotrf/i #threads= 40 40 Elapsed omp_get_time: 0.1071
Epilog : job finished at Thu Mar 9 15:45:53 CET 2017
However, this fails if I try a larger 10000 x 10000 matrix:
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
OMP: Error #178: Function pthread_getattr_np failed:
OMP: System error #12: Cannot allocate memory
forrtl: error (76): Abort trap signal
The reason is that every thread allocates memory, so that this is too much.
Then I try by requesting explicitly 4 threads and 1000 x 1000 matrix. This is doable by requesting the adequate environment:
-pe parallel_smp demands X cores on the same node (multi-thread, OpenMP) as explained in the docs.
Do not forget putting the SAME number in -pe parallel_smp and in export OMP_NUM_THREADS
alegarra@genotoul2 ~/work $ cat test_omp.sh
#!/bin/bash
export OMP_NUM_THREADS=4
echo "1000 1000" | /save/alegarra/progs/parallel_inverse/a.out
alegarra@genotoul2 ~/work $ qsub -pe parallel_smp 4 test_omp.sh
and it does work:
alegarra@genotoul2 ~/work $ cat test_omp.sh.o8160541
nanim,nsnpp
1000 1000
Z=rnd()
GG=ZZ
Dgemm MKL #threads= 4 4 Elapsed omp_get_time: 0.0320
GG=GG/(tr/n)
GG=0.95GG+0.05I
GG^-1
Inverse LAPACK MKL dpotrf/i #threads= 4 4 Elapsed omp_get_time: 0.0230
Epilog : job finished at Thu Mar 9 16:01:30 CET 2017
now, try with 10000 x 10000:
alegarra@genotoul2 ~/work $ cat test_omp.sh.o8160558
nanim,nsnpp
10000 10000
Z=rnd()
GG=ZZ
Dgemm MKL #threads= 4 4 Elapsed omp_get_time: 45.8333
GG=GG/(tr/n)
GG=0.95GG+0.05I
GG^-1
Inverse LAPACK MKL dpotrf/i #threads= 4 4 Elapsed omp_get_time: 29.2803
Epilog : job finished at Thu Mar 9 16:05:11 CET 2017
and it also works. I can also put the -pe within the script:
alegarra@genotoul2 ~/work $ cat test_omp.sh
#!/bin/bash
#$ -pe parallel_smp 4
export OMP_NUM_THREADS=4
echo "10000 10000" | /save/alegarra/progs/parallel_inverse/a.out
alegarra@genotoul2 ~/work $ qsub test_omp.sh
Now, I try with 10 threads and it also works. Then I will try using yams:
A couple of tricks more. To know the parallel environments in the cluster and some characteristics:
A couple of tricks more. To know the parallel environments in the cluster and some characteristics:
alegarra@genotoul2 ~/work/mtr_2015/mf/methodR $ qconf -spl
parallel_10
parallel_20
parallel_40
parallel_48
parallel_fill
parallel_fill_amd
parallel_rr
parallel_rr_amd
parallel_smp
parallel_testq
alegarra@genotoul2 ~/work/mtr_2015/mf/methodR $ qconf -sp parallel_smp
pe_name parallel_smp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE