Note: This discussion is about an older version of the COMSOL Multiphysics® software. The information provided may be out of date.

Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

COMSOL in parallel mode: a speedup of 20% using 4 cores?

Please login with a confirmed email address before reporting spam

Hello everyone,

I have performed the following experiment. Using the command -np #cores, I have forced COMSOL Multiphysics to use 1, 2, 4 or 8 of the available cores on my computer. I also used the default configuration of COMSOL Multiphysics, i.e, without the command -np. I performed the experiment using two models comprising two drift-diffusion equations coupled with the Poisson equation - one is quasi-linear, the other is highly non-linear. The solver used was Pardiso. The results are in the attached file.

I have obtained a speed-up of about 20% when I activated 4 cores, as compared with the duration of calculations using only 1 core. Do you believe that this improvement of 20% is the typical speedup that can be obtained? Does anyone have some more data or experience on this issue?

Thank you for your help.

Kind regards,
Pedro


42 Replies Last Post 2014/09/02 5:10 GMT-4
Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2009/09/23 7:18 GMT-4
Hi

The improvement with multicore is very model dependent, we see a 2-6 times improvement for a 8 core CPU, for models that fully fit into he RAM and that can be multithreaded for "long" periods to fit for all cores (linux based OS).

Do not forget that 8 cores with each 4Gb of ram requires 32 Gb total RAM + something for the OS too. If you do not have enough RAM you will only see the disck access time for shwapping, and this is rather slow, especially for MS based OS.

In unix/linux use the "top" command and look at your threads (-H) to see if you are really are using what you think, for MS OS you can use the "procesexplorer" tool from www.sysinternals.com

CPU is not all, you need, as usual, a globally optimised system ;)

Good luck
Ivar
Hi The improvement with multicore is very model dependent, we see a 2-6 times improvement for a 8 core CPU, for models that fully fit into he RAM and that can be multithreaded for "long" periods to fit for all cores (linux based OS). Do not forget that 8 cores with each 4Gb of ram requires 32 Gb total RAM + something for the OS too. If you do not have enough RAM you will only see the disck access time for shwapping, and this is rather slow, especially for MS based OS. In unix/linux use the "top" command and look at your threads (-H) to see if you are really are using what you think, for MS OS you can use the "procesexplorer" tool from www.sysinternals.com CPU is not all, you need, as usual, a globally optimised system ;) Good luck Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2009/09/23 7:26 GMT-4
Hi,

Comsol doesn't use the swap memory/virtual memory (disc memory), only the RAM is used.
Under Linux system, Comsol takes memory as much as is needed, which means that the other running processes (for the OS) have to switch to the swap memory. This is also the case under MS windows
Hi, Comsol doesn't use the swap memory/virtual memory (disc memory), only the RAM is used. Under Linux system, Comsol takes memory as much as is needed, which means that the other running processes (for the OS) have to switch to the swap memory. This is also the case under MS windows

Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2009/09/23 9:41 GMT-4
Hi Faycal

I do not 100% agree with you there. COMSOL might not use the OS swap (that I havnt checked) but it definitively access the local disk heavily, if you happen to have your -tmpdir pointing on a network disk, you will see the effect: on your network and on theprocessing time.

Basical rule for any heavy FEM tool (not only COMSOL): run it from a local account on a good fast workstation in stand-alone mode.

On the other hand, I was able to do 80% of my work on a powerful laptop, now we have swithed, the workstation is much faster for large jobs, the same for the others, but I'm tied to my workstation ;)

Ivar
Hi Faycal I do not 100% agree with you there. COMSOL might not use the OS swap (that I havnt checked) but it definitively access the local disk heavily, if you happen to have your -tmpdir pointing on a network disk, you will see the effect: on your network and on theprocessing time. Basical rule for any heavy FEM tool (not only COMSOL): run it from a local account on a good fast workstation in stand-alone mode. On the other hand, I was able to do 80% of my work on a powerful laptop, now we have swithed, the workstation is much faster for large jobs, the same for the others, but I'm tied to my workstation ;) Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2009/09/23 11:11 GMT-4
Thank you Ivar and Faycal for your input.

The model is running in a workstation with two quad-core Intel Xeon processors and 8Gb of RAM. It is a shared memory environment. The model is not very large, only about 40000 DOFs, so virtual memory is not used and not the cause for the small speedup...

Maybe a good option is to select a model, for example this one: comsol.com/community/exchange/62/, and testing it in my computer with 1, 2, 4 or 8 active cores. I will do this and then post the results here.

Kind regards,
Pedro
Thank you Ivar and Faycal for your input. The model is running in a workstation with two quad-core Intel Xeon processors and 8Gb of RAM. It is a shared memory environment. The model is not very large, only about 40000 DOFs, so virtual memory is not used and not the cause for the small speedup... Maybe a good option is to select a model, for example this one: http://comsol.com/community/exchange/62/, and testing it in my computer with 1, 2, 4 or 8 active cores. I will do this and then post the results here. Kind regards, Pedro

Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2009/09/23 17:38 GMT-4
Hi

I tried your model, it runs in less than 1.3 sec on one CPU, I do not even have time to see the RAM load

Here is some quick tests on other files from the model library, "model reset" before run

Linux 64 bit, 24GB ram, 8 core INTEL CPU (with 15 other sleeping users, 15GB ram used by others, but 100% idle tonight)

File cylinder_flow.mph (model reset before) Comsol V3.5a
CPU 1 651.718sec Normal mesh 16764 dofs < 1.68GbRAM 102% (7 sec time simulation, 872 iteration steps)
CPU 2 425.439sec Normal mesh 16764 dofs < 1.78GbRAM 201% (7 sec time simulation, 872 iteration steps)
CPU 4 401.757sec Normal mesh 16764 dofs < 1.86GbRAM 401% (7 sec time simulation, 872 iteration steps)
CPU 8 400.527sec Normal mesh 16764 dofs < 2.15GbRAM 798% (7 sec time simulation, 872 iteration steps)

But 872 iterations between 3 solvers in 400 sec is 2 iterations per second,
most obviously RAM refresh is limiting, as well as little RAM used

maragoni.mph, reset model before, using extremely fine mesh
cpu 8 28.893sec extremely fine mesh 85054 dofs 14 iteration steps 2.4GbRAM 300-750%
cpu 4 32.315sec extremely fine mesh 85054 dofs 14 iteration steps 2.1GbRAM 200-350%
cpu 2 41.644sec extremely fine mesh 85054 dofs 14 iteration steps 1.9GbRAM 130-170%
cpu 1 61.635sec extremely fine mesh 85054 dofs 14 iteration steps 2.1GbRAM 101%

Obviously the results are rather solver dependent, and these tests hardly uses any RAM

Anyhow a good model, well optimised, should run for at most some seconds on a laptop, dont you think so ?

;) Good night
Ivar
Hi I tried your model, it runs in less than 1.3 sec on one CPU, I do not even have time to see the RAM load Here is some quick tests on other files from the model library, "model reset" before run Linux 64 bit, 24GB ram, 8 core INTEL CPU (with 15 other sleeping users, 15GB ram used by others, but 100% idle tonight) File cylinder_flow.mph (model reset before) Comsol V3.5a CPU 1 651.718sec Normal mesh 16764 dofs < 1.68GbRAM 102% (7 sec time simulation, 872 iteration steps) CPU 2 425.439sec Normal mesh 16764 dofs < 1.78GbRAM 201% (7 sec time simulation, 872 iteration steps) CPU 4 401.757sec Normal mesh 16764 dofs < 1.86GbRAM 401% (7 sec time simulation, 872 iteration steps) CPU 8 400.527sec Normal mesh 16764 dofs < 2.15GbRAM 798% (7 sec time simulation, 872 iteration steps) But 872 iterations between 3 solvers in 400 sec is 2 iterations per second, most obviously RAM refresh is limiting, as well as little RAM used maragoni.mph, reset model before, using extremely fine mesh cpu 8 28.893sec extremely fine mesh 85054 dofs 14 iteration steps 2.4GbRAM 300-750% cpu 4 32.315sec extremely fine mesh 85054 dofs 14 iteration steps 2.1GbRAM 200-350% cpu 2 41.644sec extremely fine mesh 85054 dofs 14 iteration steps 1.9GbRAM 130-170% cpu 1 61.635sec extremely fine mesh 85054 dofs 14 iteration steps 2.1GbRAM 101% Obviously the results are rather solver dependent, and these tests hardly uses any RAM Anyhow a good model, well optimised, should run for at most some seconds on a laptop, dont you think so ? ;) Good night Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/01/14 10:48 GMT-5
Hi all,

I am trying to do a 3D computation having nearly 1.5 million DOF. I (probably) use a lot of memory for defining geometries and domains as mine is an externally generated mesh.

First I had problem with JAVA heap size which I have increased from 1GB to 2GB in the configuration file. I still get some GC memory errors.

COMSOL shows 8GB RAM (all that I have) and some 14GB of Virtual memory. I have 8GB for swap. So now COMSOL is using less than RAM+SWAP.

Now the question is will I be able to solve larger systems if I increase the swap size?
I understand it will be very slow and the best solution is to get more RAM, but before buying I want to see if I can push my domain size as a test case for demonstration.

Regards,
Kodanda
Hi all, I am trying to do a 3D computation having nearly 1.5 million DOF. I (probably) use a lot of memory for defining geometries and domains as mine is an externally generated mesh. First I had problem with JAVA heap size which I have increased from 1GB to 2GB in the configuration file. I still get some GC memory errors. COMSOL shows 8GB RAM (all that I have) and some 14GB of Virtual memory. I have 8GB for swap. So now COMSOL is using less than RAM+SWAP. Now the question is will I be able to solve larger systems if I increase the swap size? I understand it will be very slow and the best solution is to get more RAM, but before buying I want to see if I can push my domain size as a test case for demonstration. Regards, Kodanda

Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/01/14 15:45 GMT-5
Hi

I cannot really tell, as I'm getting lazy, when I see my WS starts to swap I stop the solver and re-study my model to make it lighter

--
Good luck
Ivar
Hi I cannot really tell, as I'm getting lazy, when I see my WS starts to swap I stop the solver and re-study my model to make it lighter -- Good luck Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/01/15 9:42 GMT-5
This is quite an interesting topic for me now.

I have a FSI model, and did a steady-state run on my model on a 48 core ~300GB ram, with 500k DOF.

The results are crazy. It seems if I want to do 50 simulations it is the best to run 50 single cores! I have optimized the Pardiso solver to the maximum possible speedup.

Attached are my results.

aka. A real disappointment.


--
Comsol 4.1
Ubuntu 10.04.1
This is quite an interesting topic for me now. I have a FSI model, and did a steady-state run on my model on a 48 core ~300GB ram, with 500k DOF. The results are crazy. It seems if I want to do 50 simulations it is the best to run 50 single cores! I have optimized the Pardiso solver to the maximum possible speedup. Attached are my results. aka. A real disappointment. -- Comsol 4.1 Ubuntu 10.04.1


Andrew Prudil Nuclear Materials

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/01/15 15:23 GMT-5
Danial,
I don't think that should be too much of a surprise to anyone. Here is why:

When you write parallel code you have some sections which can be done in parallel, and other that must be done in serial. Because of this, the best parallel programs could ever hope for is to have each core running all the time. So when you run real code, and some cores have to run less time to wait for others to catch up to do the serial parts you loose that cpu time. If you run 50 instances of a single core program, each program can use the one core, and never have to wait for the other cores. It is disappointing I know, but effective CPU out put for parallel tasks scales at a power less than 1, always.
Danial, I don't think that should be too much of a surprise to anyone. Here is why: When you write parallel code you have some sections which can be done in parallel, and other that must be done in serial. Because of this, the best parallel programs could ever hope for is to have each core running all the time. So when you run real code, and some cores have to run less time to wait for others to catch up to do the serial parts you loose that cpu time. If you run 50 instances of a single core program, each program can use the one core, and never have to wait for the other cores. It is disappointing I know, but effective CPU out put for parallel tasks scales at a power less than 1, always.

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/01/16 12:54 GMT-5
First, if you want distributed parallel processing (DPP), then you must use the MUMPS direct solver. No other solver works in DPP. That is stated in the COMSOL documentation.

Second, if you have a single node, multiple cores can be used in shared-memory parallel processing with several solvers and also in other parts of the solution process such as meshing and finite-element assembly. There is essentially no constraint on the shared-memory mode of parallelism in COMSOL.

See my paper on the conference CD for some benchmarking of COMSOL v4-beta2 using DPP. I am still benchmarking with v4.1, but I can say it is at least as good as v4-beta2. Still limited to MUMPS, but it does work.
First, if you want distributed parallel processing (DPP), then you must use the MUMPS direct solver. No other solver works in DPP. That is stated in the COMSOL documentation. Second, if you have a single node, multiple cores can be used in shared-memory parallel processing with several solvers and also in other parts of the solution process such as meshing and finite-element assembly. There is essentially no constraint on the shared-memory mode of parallelism in COMSOL. See my paper on the conference CD for some benchmarking of COMSOL v4-beta2 using DPP. I am still benchmarking with v4.1, but I can say it is at least as good as v4-beta2. Still limited to MUMPS, but it does work.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 16:20 GMT-5
Hi,
I am new in working with COMSOL. I have problem with running COMSOL in parallel. Would you please let me know what should I do for using all the cores, step by step?

Best regards,
Giti
Hi, I am new in working with COMSOL. I have problem with running COMSOL in parallel. Would you please let me know what should I do for using all the cores, step by step? Best regards, Giti

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 16:23 GMT-5
What is your problem running in parallel ?

Do you have a shared-memory system ?

Do you have a distributed-parallel processing system ?

We must know these before we can get started with any help.
What is your problem running in parallel ? Do you have a shared-memory system ? Do you have a distributed-parallel processing system ? We must know these before we can get started with any help.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 16:24 GMT-5

First, if you want distributed parallel processing (DPP), then you must use the MUMPS direct solver. No other solver works in DPP. That is stated in the COMSOL documentation.

Second, if you have a single node, multiple cores can be used in shared-memory parallel processing with several solvers and also in other parts of the solution process such as meshing and finite-element assembly. There is essentially no constraint on the shared-memory mode of parallelism in COMSOL.

See my paper on the conference CD for some benchmarking of COMSOL v4-beta2 using DPP. I am still benchmarking with v4.1, but I can say it is at least as good as v4-beta2. Still limited to MUMPS, but it does work.


What I noticed that above 100k DOF usually Pardiso is faster than MUMS for any number of cores. So, I cannot see where PARDISO is not scaling well MUMS is scaling better.

So again, I don't think Comsol is scaling well. It is faster than single core, but for 20%-30% on many cores I don't see the benefit running many simulations. Comsol is strong on preprocessing, but weak on the solver side, contrary to well-stablished software. Please correct me if I'm wrong.

This is the subject of hot discussion between our colleagues. We run on single machine shared memory, many cores (48 core).

Edit: can we please stay on the same topic, and Giti can you please open a new thread.

--
Comsol 4.1
Ubuntu 10.04.1
[QUOTE] First, if you want distributed parallel processing (DPP), then you must use the MUMPS direct solver. No other solver works in DPP. That is stated in the COMSOL documentation. Second, if you have a single node, multiple cores can be used in shared-memory parallel processing with several solvers and also in other parts of the solution process such as meshing and finite-element assembly. There is essentially no constraint on the shared-memory mode of parallelism in COMSOL. See my paper on the conference CD for some benchmarking of COMSOL v4-beta2 using DPP. I am still benchmarking with v4.1, but I can say it is at least as good as v4-beta2. Still limited to MUMPS, but it does work. [/QUOTE] What I noticed that above 100k DOF usually Pardiso is faster than MUMS for any number of cores. So, I cannot see where PARDISO is not scaling well MUMS is scaling better. So again, I don't think Comsol is scaling well. It is faster than single core, but for 20%-30% on many cores I don't see the benefit running many simulations. Comsol is strong on preprocessing, but weak on the solver side, contrary to well-stablished software. Please correct me if I'm wrong. This is the subject of hot discussion between our colleagues. We run on single machine shared memory, many cores (48 core). Edit: can we please stay on the same topic, and Giti can you please open a new thread. -- Comsol 4.1 Ubuntu 10.04.1

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 16:44 GMT-5
Danial, in your case where you only have a single node, but several cores using the same shared memory, then PARDISO will be best. It will run in parallel on all your cores. We have a similar machine and see much better than your speedup on a problem large enough to fill the memory, but not use any disk swapping.

The latest result I have for distributed parallel processing on a multi-node cluster using both shared memory (per node) and distributed parallel processing we see a speedup of ~ 7.5 across 12 nodes and 96 cores. If we had more cores in our cluster, the speedup would be better because it has not rolled over yet. See attached plot. This cluster uses infiniband to communicate between nodes which helps tremendously. Also, across the nodes, the memory requirements are reduced significantly.

I see your scaling plot now. I am a little confused. You say you have a single-node shared memory machine. However, you also say you have 48 nodes available with a np=1. Perhaps we should be clear on terminology. We do agree that a node is a separate computer. So, if you have 48 nodes, that means you have a cluster with 48 separate computers in it. This is configured in COMSOL using the nn switch. Then, within each node, you have a number of cores on any number of processors all sharing memory. This is configured in COMSOL using the np switch. Is this also your understanding ?

If you are running all those cores on a single node, perhaps your hardware communication bus has saturated. What is the bus speed ?

As you can see from the plot I attached, I get nearly 400% speedup (compared to your 20-30% speedup) on only 8 cores and single node with np=8.

You must be doing something different.
Danial, in your case where you only have a single node, but several cores using the same shared memory, then PARDISO will be best. It will run in parallel on all your cores. We have a similar machine and see much better than your speedup on a problem large enough to fill the memory, but not use any disk swapping. The latest result I have for distributed parallel processing on a multi-node cluster using both shared memory (per node) and distributed parallel processing we see a speedup of ~ 7.5 across 12 nodes and 96 cores. If we had more cores in our cluster, the speedup would be better because it has not rolled over yet. See attached plot. This cluster uses infiniband to communicate between nodes which helps tremendously. Also, across the nodes, the memory requirements are reduced significantly. I see your scaling plot now. I am a little confused. You say you have a single-node shared memory machine. However, you also say you have 48 nodes available with a np=1. Perhaps we should be clear on terminology. We do agree that a node is a separate computer. So, if you have 48 nodes, that means you have a cluster with 48 separate computers in it. This is configured in COMSOL using the nn switch. Then, within each node, you have a number of cores on any number of processors all sharing memory. This is configured in COMSOL using the np switch. Is this also your understanding ? If you are running all those cores on a single node, perhaps your hardware communication bus has saturated. What is the bus speed ? As you can see from the plot I attached, I get nearly 400% speedup (compared to your 20-30% speedup) on only 8 cores and single node with np=8. You must be doing something different.


Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 16:56 GMT-5

Danial, in your case where you only have a single node, but several cores using the same shared memory, then PARDISO will be best. It will run in parallel on all your cores. We have a similar machine and see much better than
your speedup on a problem large enough to fill the memory, but not use any disk swapping.


I have a FSI problem similiar to the MEMS FSI problem in Comsol v3.5a. So, it is possible that my K matrix is very sparse? I am not sure.


The latest result I have for distributed parallel processing on a multi-node cluster using both shared memory (per node) and distributed parallel processing we see a speedup of ~ 7.5 across 12 nodes and 96 cores. If we had more cores in our cluster, the speedup would be better because it has not rolled over yet. See attached plot. This cluster uses infiniband to communicate between nodes which helps tremendously. Also, across the nodes, the memory requirements are reduced significantly.

That sounds amazing like a great speedup. However, not in my case. I don't have experience on cluster (DPP), since we could not bring Comsol to work like that (long licensing-linux issues).



Edit after your comment edit:
-------
Yes, 48 cores, on a single node.

I'll attach my both machines hw in a few minutes.



Many thanks,
Danial


--
Comsol 4.1
Ubuntu 10.04.1
[QUOTE] Danial, in your case where you only have a single node, but several cores using the same shared memory, then PARDISO will be best. It will run in parallel on all your cores. We have a similar machine and see much better than your speedup on a problem large enough to fill the memory, but not use any disk swapping. [/QUOTE] I have a FSI problem similiar to the MEMS FSI problem in Comsol v3.5a. So, it is possible that my K matrix is very sparse? I am not sure. [QUOTE] The latest result I have for distributed parallel processing on a multi-node cluster using both shared memory (per node) and distributed parallel processing we see a speedup of ~ 7.5 across 12 nodes and 96 cores. If we had more cores in our cluster, the speedup would be better because it has not rolled over yet. See attached plot. This cluster uses infiniband to communicate between nodes which helps tremendously. Also, across the nodes, the memory requirements are reduced significantly. [/QUOTE] That sounds amazing like a great speedup. However, not in my case. I don't have experience on cluster (DPP), since we could not bring Comsol to work like that (long licensing-linux issues). Edit after your comment edit: ------- Yes, 48 cores, on a single node. I'll attach my both machines hw in a few minutes. Many thanks, Danial -- Comsol 4.1 Ubuntu 10.04.1


Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:15 GMT-5
Here is the solver configuration (steady-state) with a sample log file.

Here I am using a segregated solver to add another PDE to solve (variable c), but in my tests above I have a fully coupled solver (without the second PDE) with single F.S.I physics.

--
Comsol 4.1
Ubuntu 10.04.1
Here is the solver configuration (steady-state) with a sample log file. Here I am using a segregated solver to add another PDE to solve (variable c), but in my tests above I have a fully coupled solver (without the second PDE) with single F.S.I physics. -- Comsol 4.1 Ubuntu 10.04.1


Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:24 GMT-5
Danial, I cannot tell from the information you sent what the computer bus speed is. How many processors do you have that hold the 48 cores ? Do you have 48 physical cores ? On our linux box here, we have 6 cores on each of two processors for a total of 12 physical cores. However, the linux operating system takes advantage of the "hyperthreading" feature of Intel processors which appears to the operating system to look like 24 cores (doubles the number of cores to the operating system, even though there are only 12 physical cores). We have found that hyperthreading actually only give us about 30% of a physical core. So, the most we could theoretically gain with this system is a speedup of (12*1.3) or 15.6 not 24. It is better than nothing, but hyperthreading can be misleading if you don't know what you have in the box. Is your OS using hyperthreading ? Is the bios on the box set to enable hyperthreading ?

If you have 8 physical cores on a processor, that would be 6 processors in your box without hyperthreading.

If you have 6 cores on a processor, with 4 processors in the box and hyperthreading enabled, that would be 48 and that might explain at least part of your problem. This would be my guess. I have not seen a 48-physical-core single node before.
Danial, I cannot tell from the information you sent what the computer bus speed is. How many processors do you have that hold the 48 cores ? Do you have 48 physical cores ? On our linux box here, we have 6 cores on each of two processors for a total of 12 physical cores. However, the linux operating system takes advantage of the "hyperthreading" feature of Intel processors which appears to the operating system to look like 24 cores (doubles the number of cores to the operating system, even though there are only 12 physical cores). We have found that hyperthreading actually only give us about 30% of a physical core. So, the most we could theoretically gain with this system is a speedup of (12*1.3) or 15.6 not 24. It is better than nothing, but hyperthreading can be misleading if you don't know what you have in the box. Is your OS using hyperthreading ? Is the bios on the box set to enable hyperthreading ? If you have 8 physical cores on a processor, that would be 6 processors in your box without hyperthreading. If you have 6 cores on a processor, with 4 processors in the box and hyperthreading enabled, that would be 48 and that might explain at least part of your problem. This would be my guess. I have not seen a 48-physical-core single node before.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:34 GMT-5

Danial, I cannot tell from the information you sent what the computer bus speed is. How many processors do you have that hold the 48 cores ? Do you have 48 physical cores ? On our linux box here, we have 6 cores on each of two processors for a total of 12 physical cores. However, the linux operating system takes advantage of the "hyperthreading" feature of Intel processors which appears to the operating system to look like 24 cores (doubles the number of cores to the operating system, even though there are only 12 physical cores). We have found that hyperthreading actually only give us about 30% of a physical core. So, the most we could theoretically gain with this system is a speedup of (12*1.3) or 15.6 not 24. It is better than nothing, but hyperthreading can be misleading if you don't know what you have in the box. Is your OS using hyperthreading ? Is the bios on the box set to enable hyperthreading ?

If you have 8 physical cores on a processor, that would be 6 processors in your box without hyperthreading.

If you have 6 cores on a processor, with 4 processors in the box and hyperthreading enabled, that would be 48 and that might explain at least part of your problem. This would be my guess. I have not seen a 48-physical-core single node before.


CPU:
AMD Opteron 6174 12-core Magny-Cours 2.2GHz Processor: goo.gl/rMk6I

Bus speed I have to ask the admin tomorrow.




--
Comsol 4.1
Ubuntu 10.04.1
[QUOTE] Danial, I cannot tell from the information you sent what the computer bus speed is. How many processors do you have that hold the 48 cores ? Do you have 48 physical cores ? On our linux box here, we have 6 cores on each of two processors for a total of 12 physical cores. However, the linux operating system takes advantage of the "hyperthreading" feature of Intel processors which appears to the operating system to look like 24 cores (doubles the number of cores to the operating system, even though there are only 12 physical cores). We have found that hyperthreading actually only give us about 30% of a physical core. So, the most we could theoretically gain with this system is a speedup of (12*1.3) or 15.6 not 24. It is better than nothing, but hyperthreading can be misleading if you don't know what you have in the box. Is your OS using hyperthreading ? Is the bios on the box set to enable hyperthreading ? If you have 8 physical cores on a processor, that would be 6 processors in your box without hyperthreading. If you have 6 cores on a processor, with 4 processors in the box and hyperthreading enabled, that would be 48 and that might explain at least part of your problem. This would be my guess. I have not seen a 48-physical-core single node before. [/QUOTE] CPU: AMD Opteron 6174 12-core Magny-Cours 2.2GHz Processor: http://goo.gl/rMk6I Bus speed I have to ask the admin tomorrow. -- Comsol 4.1 Ubuntu 10.04.1

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:36 GMT-5
Danial, another explanation could be your problem size and type. Is your problem 3D ? It is running very fast at only 227 cpu sec for a single iteration. How much memory is your problem using for > 2Mdof ? On the problem I just showed you, we have 1 Mdof, it takes about 20 GB of memory, 47286.764 cpu sec, to run 21 segregated-step iterations on a single node using the MUMPS direct solver. If you run on multiple nodes, the memory requirement drop down significantly per node (about 8 GB for the 12-node case).

Your problem may not be running long enough in the solver to see any speedup. It should be spending most of the time in PARDISO to see a speedup (or MUMPS if running a cluster).

Also, the other codes you mention are probably using explicit methods to solve. This takes many more iterations to converge. COMSOL is an implicit method, which uses far less iterations, but at the price or more memory.
Danial, another explanation could be your problem size and type. Is your problem 3D ? It is running very fast at only 227 cpu sec for a single iteration. How much memory is your problem using for > 2Mdof ? On the problem I just showed you, we have 1 Mdof, it takes about 20 GB of memory, 47286.764 cpu sec, to run 21 segregated-step iterations on a single node using the MUMPS direct solver. If you run on multiple nodes, the memory requirement drop down significantly per node (about 8 GB for the 12-node case). Your problem may not be running long enough in the solver to see any speedup. It should be spending most of the time in PARDISO to see a speedup (or MUMPS if running a cluster). Also, the other codes you mention are probably using explicit methods to solve. This takes many more iterations to converge. COMSOL is an implicit method, which uses far less iterations, but at the price or more memory.

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:45 GMT-5
OK. It is not the Intel hyperthreading issue. Our cluster uses 4-core and 8-core processors at two processors per node. We could have obtained the 12-core processor for extra money, but opted for more nodes instead (they were expensive). So, you are running a 4-processor, single-node, with 12 cores per processor yielding 48 cores on the motherboard. I think your are OK here.

My best guess is the problem is too small to see the speed up. Try to get it to spend more time in the solver by filling up all the memory.
OK. It is not the Intel hyperthreading issue. Our cluster uses 4-core and 8-core processors at two processors per node. We could have obtained the 12-core processor for extra money, but opted for more nodes instead (they were expensive). So, you are running a 4-processor, single-node, with 12 cores per processor yielding 48 cores on the motherboard. I think your are OK here. My best guess is the problem is too small to see the speed up. Try to get it to spend more time in the solver by filling up all the memory.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/08 17:51 GMT-5

My best guess is the problem is too small to see the speed up. Try to get it to spend more time in the solver by filling up all the memory.


I think you are quite right on this one. PARDISO only runs shortly on a 500k DOF. Can you elaborate more on how I can fill up the memory?

I see that Transient solver has "store solutions out of core", which I will disable. There is no such option in Steady solver, however.

Danial


--
Comsol 4.1
Ubuntu 10.04.1
[QUOTE] My best guess is the problem is too small to see the speed up. Try to get it to spend more time in the solver by filling up all the memory. [/QUOTE] I think you are quite right on this one. PARDISO only runs shortly on a 500k DOF. Can you elaborate more on how I can fill up the memory? I see that Transient solver has "store solutions out of core", which I will disable. There is no such option in Steady solver, however. Danial -- Comsol 4.1 Ubuntu 10.04.1

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/09 9:30 GMT-5
Hi James and Danial,

I am also working on the same 48-cores Linux server where Danial did some of the scaling experiments, so I can confirm what he mentioned and also reinforce this with other observations.

I have a rather large 3-D problem (coupled flow with mass transport) that can fill all the memory (~256 Gb), and still the parallel performance is bad.

Just to illustrate what I want to say, I ran a quick (smaller) test a few minutes ago (~8 Gb and -np 8).
The problem is that Pardiso works only part of the time in parallel:
-> Solving the linear system starts with the assembly (~1 min), done fast enough,
-> Then in the matrix factorization step the solver spends ~10 min doing something on 1 core. This was checked with 'htop'. Pardiso evolution counter stays all the time at 0 % in this first phase.
-> Finally, you have the parallelized part working on 8 cores, done in ~3 min. In this stage only, the counter advances from 0% to 100%. Indeed, this second phase of the solution process is gaining speed with more cores, but the first phase is limiting. You can imagine that for the "real" problem this is even more bothering.

Comsol recommends iterative solvers for large 3-D problems.
However, no iterative solver could work on my problem - at all. Neither the Comsol support could give me an answer why is that.

So, if we want to speed-up the parallel solution with direct solvers, we should "get rid" of the first phase of Pardiso. MUMPS does exactly the same. Could it be the preordering?

Cristian
Hi James and Danial, I am also working on the same 48-cores Linux server where Danial did some of the scaling experiments, so I can confirm what he mentioned and also reinforce this with other observations. I have a rather large 3-D problem (coupled flow with mass transport) that can fill all the memory (~256 Gb), and still the parallel performance is bad. Just to illustrate what I want to say, I ran a quick (smaller) test a few minutes ago (~8 Gb and -np 8). The problem is that Pardiso works only part of the time in parallel: -> Solving the linear system starts with the assembly (~1 min), done fast enough, -> Then in the matrix factorization step the solver spends ~10 min doing something on 1 core. This was checked with 'htop'. Pardiso evolution counter stays all the time at 0 % in this first phase. -> Finally, you have the parallelized part working on 8 cores, done in ~3 min. In this stage only, the counter advances from 0% to 100%. Indeed, this second phase of the solution process is gaining speed with more cores, but the first phase is limiting. You can imagine that for the "real" problem this is even more bothering. Comsol recommends iterative solvers for large 3-D problems. However, no iterative solver could work on my problem - at all. Neither the Comsol support could give me an answer why is that. So, if we want to speed-up the parallel solution with direct solvers, we should "get rid" of the first phase of Pardiso. MUMPS does exactly the same. Could it be the preordering? Cristian

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/09 10:09 GMT-5
Yes, I can confirm that the row preordering was the limiting step (because not done in parallel) !
Switching it off improved dramatically the solution time, without compromizing the quality of solution (for my problem, at least).
I will try to show you later also the scaling tests.
Cristian
Yes, I can confirm that the row preordering was the limiting step (because not done in parallel) ! Switching it off improved dramatically the solution time, without compromizing the quality of solution (for my problem, at least). I will try to show you later also the scaling tests. Cristian

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/09 10:36 GMT-5
When you run htop or top under linux, and then monitor comsolauncher while it is running, you should see the %cpu increase up to np*100% on each node running. So, in my case, on each node running with 8 cores, then I see significant periods of time while pardiso is running and the cpu is showing at 800%. In your case, since you have 48 cores, it should be 4800%. You said it was showing 100% which tells me that you are not running in parallel.

You should also be able to toggle the cpu listings by typing the "1" key (numeric 1) and show the composite cpu and then each individual cpu. You should see all 48 cores listed individually. If you do not see this, then it could be that your linux kernel is not set correctly. I have seen instances when a kernel will not be a parallel-enabled kernel by default. There are 3 separate settings on the kernel when it is configured for compilation on the memory size. You always want to use the >4G setting for the memory.

Also, the iterative solver is tricky to set up, but it does work once set up correctly. It does not run in distributed parallel mode now. I have found that manual meshing works best if you do not have your entire model using "free" meshing. The automatic multigrid remeshing only seems to work if you have the entire model in free mesh.

When you start up comsol, do you include "-np 48" on the command line ?
When you run htop or top under linux, and then monitor comsolauncher while it is running, you should see the %cpu increase up to np*100% on each node running. So, in my case, on each node running with 8 cores, then I see significant periods of time while pardiso is running and the cpu is showing at 800%. In your case, since you have 48 cores, it should be 4800%. You said it was showing 100% which tells me that you are not running in parallel. You should also be able to toggle the cpu listings by typing the "1" key (numeric 1) and show the composite cpu and then each individual cpu. You should see all 48 cores listed individually. If you do not see this, then it could be that your linux kernel is not set correctly. I have seen instances when a kernel will not be a parallel-enabled kernel by default. There are 3 separate settings on the kernel when it is configured for compilation on the memory size. You always want to use the >4G setting for the memory. Also, the iterative solver is tricky to set up, but it does work once set up correctly. It does not run in distributed parallel mode now. I have found that manual meshing works best if you do not have your entire model using "free" meshing. The automatic multigrid remeshing only seems to work if you have the entire model in free mesh. When you start up comsol, do you include "-np 48" on the command line ?

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/09 12:57 GMT-5
No James, somehow you misunderstood my previous message.

1. The simulations for the smaller test case were run in parallel with Comsol started with -np 8 (see previous post).

2. In htop I saw CPU ~100% for longer periods in the preordering phase of the Jacobian matrix, then CPU ~800% in the second phase. So, one phase is not parallelized, the other one is. The 0-100% refers to the PARDISO counter in Comsol, not to the htop CPU counter (please read previous post).

I am already running now a larger problem (3 million DOFs, with 80 Gb needed by PARDISO), and the speed-up by running on multiple cores is indeed very significant when you uncheck the row preordering box. The was the time-limiting calculation.
No James, somehow you misunderstood my previous message. 1. The simulations for the smaller test case were run in parallel with Comsol started with -np 8 (see previous post). 2. In htop I saw CPU ~100% for longer periods in the preordering phase of the Jacobian matrix, then CPU ~800% in the second phase. So, one phase is not parallelized, the other one is. The 0-100% refers to the PARDISO counter in Comsol, not to the htop CPU counter (please read previous post). I am already running now a larger problem (3 million DOFs, with 80 Gb needed by PARDISO), and the speed-up by running on multiple cores is indeed very significant when you uncheck the row preordering box. The was the time-limiting calculation.

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/09 13:19 GMT-5
OK. That makes sense. Yes, I recall the PARDISO completion notice (from memory) 0% 24% then 100%, but nothing in between. I was not aware of the slow down when "row preordering" is checked. I will have to remember that one.
OK. That makes sense. Yes, I recall the PARDISO completion notice (from memory) 0% 24% then 100%, but nothing in between. I was not aware of the slow down when "row preordering" is checked. I will have to remember that one.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/11 5:34 GMT-5
So if I may ask it in yes or no terms, can or should we disable row preordering to utilize comsol's parallelism to the maximum limit? If yes, should it not be for MUMPs solver as well?
So if I may ask it in yes or no terms, can or should we disable row preordering to utilize comsol's parallelism to the maximum limit? If yes, should it not be for MUMPs solver as well?

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/11 12:03 GMT-5
I have not tested with/without row preordering. My recommendation would be to try it and see. It may be problem dependent.
I have not tested with/without row preordering. My recommendation would be to try it and see. It may be problem dependent.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/12 6:29 GMT-5
The speed up is problem-dependent indeed, but in my case there was much gain with PARDISO when the preordering was disabled. I attach here my results for scale up experiments done on a 48-core server (2.2 GHz each CPU).

>>> Problem:
- Strongly coupled stationary laminar flow with solute mass transport. That is, the flow and mass transport equations must be solved simultaneously because the concentration affects liquid density, viscosity and (more importantly) the boundary conditions for flow (through osmotic pressure).
- Complex 3-d geometry with inlet, outlet, walls and periodic boundaries.

>>> Solution method:
- The system of equations is non-linear and with the given intial values 4 Newton iterations are needed for each solution.
The total solution time is reported in the attached file (roughly, you have to divide by 4 the time to see how much time one linear solution took)
- I tested on a mesh that leads to 3,057,962 DOFs. In these conditions PARDISO used maximum 80 to 90 Gb RAM.
- The tests are with nested dissection multithreaded preordering and without any preordering. However, I did not notice any improvement by the "multithreded" version. I use Comsol 4.1 update 2.

>>> Results:
- There is a clear improvement in speed when not using the preordering.
- All solutions returned are correct when 1, 4, 8, 16 and 32 cores are used.
However, interestingly, on 48 cores PARDISO did not correctly solve the system (neither with nor without preordering)!
- I attach results (scaling.pdf) for the total time (seconds), speed-up (time np/time 1) and parallel efficiency.
- For many cores, the preordering clearly becomes speed limiting and slows down the calculations 3-4 times!

Cristian
The speed up is problem-dependent indeed, but in my case there was much gain with PARDISO when the preordering was disabled. I attach here my results for scale up experiments done on a 48-core server (2.2 GHz each CPU). >>> Problem: - Strongly coupled stationary laminar flow with solute mass transport. That is, the flow and mass transport equations must be solved simultaneously because the concentration affects liquid density, viscosity and (more importantly) the boundary conditions for flow (through osmotic pressure). - Complex 3-d geometry with inlet, outlet, walls and periodic boundaries. >>> Solution method: - The system of equations is non-linear and with the given intial values 4 Newton iterations are needed for each solution. The total solution time is reported in the attached file (roughly, you have to divide by 4 the time to see how much time one linear solution took) - I tested on a mesh that leads to 3,057,962 DOFs. In these conditions PARDISO used maximum 80 to 90 Gb RAM. - The tests are with nested dissection multithreaded preordering and without any preordering. However, I did not notice any improvement by the "multithreded" version. I use Comsol 4.1 update 2. >>> Results: - There is a clear improvement in speed when not using the preordering. - All solutions returned are correct when 1, 4, 8, 16 and 32 cores are used. However, interestingly, on 48 cores PARDISO did not correctly solve the system (neither with nor without preordering)! - I attach results (scaling.pdf) for the total time (seconds), speed-up (time np/time 1) and parallel efficiency. - For many cores, the preordering clearly becomes speed limiting and slows down the calculations 3-4 times! Cristian


Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/12 10:25 GMT-5
Cristian,

Thank you for posting your results. Your results are similar to what I have observed here and are what I might have expected. It appears also that even the shared-memory parallel processing will saturate the bandwidth enough to plateau after enough cores are activated. It appears that you have effectively used the hardware you have to the extent possible.

One suggestion to save some additional clock time. You might expand your solver scheme to use the segregated solver, while still using the PARDISO direct solver within each segregated step. You could first separate into two steps: 1) the u-p laminar flow (u,v,w,p) and 2) the remaining variables. This will likely take more iterations, but still save overall cpu/clock time. You may also be able to break down the 2nd step further in some cases.

The other comment is that if COMSOL users have access to a cluster, the memory requirements per node also significantly reduces in addition to the speedup of the solution. I have had at least one case where a problem would not fit on a single node, but could be run on the multi-node cluster. I expect a lot more of that in the future here.
Cristian, Thank you for posting your results. Your results are similar to what I have observed here and are what I might have expected. It appears also that even the shared-memory parallel processing will saturate the bandwidth enough to plateau after enough cores are activated. It appears that you have effectively used the hardware you have to the extent possible. One suggestion to save some additional clock time. You might expand your solver scheme to use the segregated solver, while still using the PARDISO direct solver within each segregated step. You could first separate into two steps: 1) the u-p laminar flow (u,v,w,p) and 2) the remaining variables. This will likely take more iterations, but still save overall cpu/clock time. You may also be able to break down the 2nd step further in some cases. The other comment is that if COMSOL users have access to a cluster, the memory requirements per node also significantly reduces in addition to the speedup of the solution. I have had at least one case where a problem would not fit on a single node, but could be run on the multi-node cluster. I expect a lot more of that in the future here.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/02/12 10:38 GMT-5

One suggestion to save some additional clock time. You might expand your solver scheme to use the segregated solver, while still using the PARDISO direct solver within each segregated step. You could first separate into two steps: 1) the u-p laminar flow (u,v,w,p) and 2) the remaining variables. This will likely take more iterations, but still save overall cpu/clock time. You may also be able to break down the 2nd step further in some cases.


James, using a segregated solver seemed to me possible initially. However, the problem is too "strongly" coupled (velocity u affects concentration c and c affects u, as well). I tried segregated solution and it did not work at all. It is better for weakly coupled equations (only u=>c, for example).
Anyway, thanks for the suggestion.

I am still trying to make the iterative solvers work, but this is not a problem for this queue...
[QUOTE] One suggestion to save some additional clock time. You might expand your solver scheme to use the segregated solver, while still using the PARDISO direct solver within each segregated step. You could first separate into two steps: 1) the u-p laminar flow (u,v,w,p) and 2) the remaining variables. This will likely take more iterations, but still save overall cpu/clock time. You may also be able to break down the 2nd step further in some cases. [/QUOTE] James, using a segregated solver seemed to me possible initially. However, the problem is too "strongly" coupled (velocity u affects concentration c and c affects u, as well). I tried segregated solution and it did not work at all. It is better for weakly coupled equations (only u=>c, for example). Anyway, thanks for the suggestion. I am still trying to make the iterative solvers work, but this is not a problem for this queue...

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2011/04/07 8:41 GMT-4
Anyhow a good model, well optimised, should run for at most some seconds on a laptop, dont you think so ?


How serious are you? We are doing 15 coupled diffusion-reaction-equations, 180000 DOFs and in order to let it run completely trough we need to limit the timestep ... It takes like 15 hours, speedup is with 2.5 for 16 cores pretty bad -- what do you think we could optimise? Is there a way to mintor closely what resources are needed in order to identify bottlenecks?

Thanks & best,
Philipp
[QUOTE]Anyhow a good model, well optimised, should run for at most some seconds on a laptop, dont you think so ?[/QUOTE] How serious are you? We are doing 15 coupled diffusion-reaction-equations, 180000 DOFs and in order to let it run completely trough we need to limit the timestep ... It takes like 15 hours, speedup is with 2.5 for 16 cores pretty bad -- what do you think we could optimise? Is there a way to mintor closely what resources are needed in order to identify bottlenecks? Thanks & best, Philipp

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/01/11 8:42 GMT-5
Hello,

This is a very interesting thread. However, I still do not understand if hyperthreading can slow down COMSOL or not (I know, you say that COMSOL does currently not benefit from hyperthreading).

I try to explain. My computer has 4 physical cores (8 if I enable hyperthreading), and 16 GiB RAM running in Windows 7 x64. The default COMSOL behaviour is to use only 4 threads for the solving of my models (4.4 million DOF magnetostatic FGMRES solver). When solving, Windows distributes the 4 threaded load in 8 hyperthreaded cores, achieving a 50% CPU load. However, 4 of these logical processors are not physical and share a physical core two by two. Therefore, if the OS does not correctly distribute the 4 threaded load between hyperthreaded processors that belong to different physical cores, the worst result could be that only two cores are processing the 4 threads (in 4 logical hyperthreaded processors), then slowing down the process. If hyperthreading is disabled, the OS can only distribute the 4 threaded load between 4 cores, and all of them are physical, then optimizing the processing power. Do you think I am right? Is it possible to tell Windows (or linux), to only use 4 logical processors that belong to different physical cores? I know that you can set the CPU affinity in Windows, but you cannot be sure of which numbered cores (from 0 to 7) belong to a specific physical core.

Thanks.
Hello, This is a very interesting thread. However, I still do not understand if hyperthreading can slow down COMSOL or not (I know, you say that COMSOL does currently not benefit from hyperthreading). I try to explain. My computer has 4 physical cores (8 if I enable hyperthreading), and 16 GiB RAM running in Windows 7 x64. The default COMSOL behaviour is to use only 4 threads for the solving of my models (4.4 million DOF magnetostatic FGMRES solver). When solving, Windows distributes the 4 threaded load in 8 hyperthreaded cores, achieving a 50% CPU load. However, 4 of these logical processors are not physical and share a physical core two by two. Therefore, if the OS does not correctly distribute the 4 threaded load between hyperthreaded processors that belong to different physical cores, the worst result could be that only two cores are processing the 4 threads (in 4 logical hyperthreaded processors), then slowing down the process. If hyperthreading is disabled, the OS can only distribute the 4 threaded load between 4 cores, and all of them are physical, then optimizing the processing power. Do you think I am right? Is it possible to tell Windows (or linux), to only use 4 logical processors that belong to different physical cores? I know that you can set the CPU affinity in Windows, but you cannot be sure of which numbered cores (from 0 to 7) belong to a specific physical core. Thanks.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/02/23 7:53 GMT-5
HI Iker,

as stated in the COMSOL Support Knowledge Base: " COMSOL does not benefit from hyperthreading"

Ref: www.comsol.fr/support/knowledgebase/1096/

Fortunately, only the node count or CPU count displayed by monitoring software is affected by hyperthreading, not the physical CPU usage.

Stephan
HI Iker, as stated in the COMSOL Support Knowledge Base: " COMSOL does not benefit from hyperthreading" Ref: http://www.comsol.fr/support/knowledgebase/1096/ Fortunately, only the node count or CPU count displayed by monitoring software is affected by hyperthreading, not the physical CPU usage. Stephan

Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/02/23 15:18 GMT-5
Hi

I would add that not all operations done during a iterative solving process can be distributed, therefore one sees the number of active CPUs and processors change from 1 to n and back regularly.

Furthermore, running on n or 2*n processors, these still need to exchange data with the RAM, and depending on when ond how this is done, only a few can access RAM over the bus lines available, here again one have multiplexing and possible blocking of one cpu w.r.t,,the others leaving the overall active CPU count at far less than 100%

--
Good luck
Ivar
Hi I would add that not all operations done during a iterative solving process can be distributed, therefore one sees the number of active CPUs and processors change from 1 to n and back regularly. Furthermore, running on n or 2*n processors, these still need to exchange data with the RAM, and depending on when ond how this is done, only a few can access RAM over the bus lines available, here again one have multiplexing and possible blocking of one cpu w.r.t,,the others leaving the overall active CPU count at far less than 100% -- Good luck Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/02/25 5:41 GMT-5
Hi Stephan,

Although COMSOL doesn't support hyper-threading, I did notice the same thing as Ikar. By default, I have Hyperthreading ON, using a similar system as Ikar. It is a Core i7 860 Quad core processor w/ HT. If I follow the KB article and look for NUMBER_OF_PROCESSORS on my system, it will list it as 8. What I noticed was:

1. When I run COMSOL (without choosing np), my CPU load is about 50% w/ about 6 cores being utilized.
2. When I run "COMSOL.exe -np8" and run the same problem, my CPU utilization is 90-98%.

I haven't benchmarked the performance in terms of computation time, but will I get any improvement from using -np 8? If not, does it make more sense for me to turn off Hyperthreading? As a note, I'm running an FSI problem that uses 2 segregated steps (MUMPS & PARDISO).
Hi Stephan, Although COMSOL doesn't support hyper-threading, I did notice the same thing as Ikar. By default, I have Hyperthreading ON, using a similar system as Ikar. It is a Core i7 860 Quad core processor w/ HT. If I follow the KB article and look for NUMBER_OF_PROCESSORS on my system, it will list it as 8. What I noticed was: 1. When I run COMSOL (without choosing np), my CPU load is about 50% w/ about 6 cores being utilized. 2. When I run "COMSOL.exe -np8" and run the same problem, my CPU utilization is 90-98%. I haven't benchmarked the performance in terms of computation time, but will I get any improvement from using -np 8? If not, does it make more sense for me to turn off Hyperthreading? As a note, I'm running an FSI problem that uses 2 segregated steps (MUMPS & PARDISO).

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/02/25 10:42 GMT-5
I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement.

We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system.
I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement. We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2012/02/27 3:51 GMT-5

I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement.

We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system.


I am not really sure. When COMSOL was not detecting Hyperthreading (before 4.1 version), it used as many threads as cores detected (physical and logical). Then I benchmarked my system and it was faster with HT disabled than with it enabled. I do not know what happens now in 4.2a version. More threads impose a bigger overhead to control them, and if the program is not very well optimized, it can be even slower.
[QUOTE] I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement. We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system. [/QUOTE] I am not really sure. When COMSOL was not detecting Hyperthreading (before 4.1 version), it used as many threads as cores detected (physical and logical). Then I benchmarked my system and it was faster with HT disabled than with it enabled. I do not know what happens now in 4.2a version. More threads impose a bigger overhead to control them, and if the program is not very well optimized, it can be even slower.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2013/03/15 13:58 GMT-4
Hello Ivar,

I just got a new Dell PC with 8 cores, 64G RAM and 64-bit windows 7 system. May I bother you to learn how to set up comsol 4.0 or 4.2 by using these 8 cores for windows during every simualtion?
Thank you a lot.
Guang
Hello Ivar, I just got a new Dell PC with 8 cores, 64G RAM and 64-bit windows 7 system. May I bother you to learn how to set up comsol 4.0 or 4.2 by using these 8 cores for windows during every simualtion? Thank you a lot. Guang

Ivar KJELBERG COMSOL Multiphysics(r) fan, retired, former "Senior Expert" at CSEM SA (CH)

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2013/03/15 16:28 GMT-4
Hi

normally there is nothing to "set up", but forget v4.0 use the latest version you have, it's far more stable.
COMSOL detects the number of nodes (chek if it's already the case for 4.2) else you must change your callig seqence to np-4 as if you have 8 multithreaded cores, I suggest to use only 50% of them, to leave you some response, and anycase, for heavy calculations you do not really gain a lot, as the RAM bandwidth gets too low.

Do not forget that not all operations can be parallelised, so you do not get a 4x gain, particularly not for "small moels" but its far better for large ones

--
Good luck
Ivar
Hi normally there is nothing to "set up", but forget v4.0 use the latest version you have, it's far more stable. COMSOL detects the number of nodes (chek if it's already the case for 4.2) else you must change your callig seqence to np-4 as if you have 8 multithreaded cores, I suggest to use only 50% of them, to leave you some response, and anycase, for heavy calculations you do not really gain a lot, as the RAM bandwidth gets too low. Do not forget that not all operations can be parallelised, so you do not get a 4x gain, particularly not for "small moels" but its far better for large ones -- Good luck Ivar

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 2013/09/06 7:27 GMT-4
I have seen your results. This trend is highly probable if your problem size is comparatively small for single core itself. What happens is, if you further try to partition the problem into smaller ones, there would be lot of communication overhead among cores compared to the computation. This could hamper your overall parallel performance.
I have seen your results. This trend is highly probable if your problem size is comparatively small for single core itself. What happens is, if you further try to partition the problem into smaller ones, there would be lot of communication overhead among cores compared to the computation. This could hamper your overall parallel performance.

Charalampos Ferekidis

Please login with a confirmed email address before reporting spam

Posted: 10 years ago 2014/09/02 5:10 GMT-4

I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement.

We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system.


I have made a similar observation on Windows 7-x64 running Comsol 4.4. When running rather small models (ca. 100.000 DOFs, PARADISO) on a 2x6-core machine with 48GB of RAM the CPU-usage goes up with the "-np 24" switch set, while solving time also increases by a by about20% :-(
So I think one is better of not using the -np-switch.
[QUOTE] I think hyperthreading will improve the performance of COMSOL even though it is "not supported". To COMSOL, or any process for that matter running on a Linux system anyway, the OS takes care of the hyperthreading such that an n-core system appears like a 2n-core system to the users. I have found that each hyperthread core is equivalent to about 1/3 of a real core in terms of cpu speed improvement. We run COMSOL on hyperthreaded machines routinely. For example, presently I have COMSOL running on a machine with 12 physical cores on it, but 24 total cores since it is hyperthreaded. The load factor periodically goes up to indicate that all 24 cores are being used and the turnaround time is likewise improved. I can start the COMSOL server with the "-np 24" switch enabled to use all the cores on that system. [/QUOTE] I have made a similar observation on Windows 7-x64 running Comsol 4.4. When running rather small models (ca. 100.000 DOFs, PARADISO) on a 2x6-core machine with 48GB of RAM the CPU-usage goes up with the "-np 24" switch set, while solving time also increases by a by about20% :-( So I think one is better of not using the -np-switch.

Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.