Message boards :
Number crunching :
Long running work units
Message board moderation
Author | Message |
---|---|
Send message Joined: 23 Mar 15 Posts: 3 Credit: 155,811,308 RAC: 0 |
Am I the only one getting the long running work units mixed in with regular running units? They appear to come in bunches. The 333 credit work units usually take between 6000 and 12000 seconds on the various machines I have but occasionally I get the 60,000 plus second work unit. These get the same 333 credits when the complete. Examples include http://universeathome.pl/universe/workunit.php?wuid=3718041, http://universeathome.pl/universe/workunit.php?wuid=3717416, http://universeathome.pl/universe/workunit.php?wuid=3717415http://universeathome.pl/universe/workunit.php?wuid=3717828 on a stock i7. Ok, let's ignore that last one. That's an ARM processor which took as long as the i7 ... A stock i3 sees http://universeathome.pl/universe/workunit.php?wuid=3675859 and http://universeathome.pl/universe/workunit.php?wuid=3675872 As you can see, sometimes the wingmen also get the really long run times, sometimes they get slightly longer runtimes and sometimes they get standard runtimes. All my machines run Intel processors at stock frequencies with decent cooling under Linux. There is no pattern to the wingmen as they run a variety of processors and operating systems. I have written it off as "one of those things" for a while until I looked into it this weekend and found that the wingmen are not affected the same way I am. So, does anyone else see a pile of these day long work units mixed in with their 2-3 hour units? Does anyone have any idea what is going on? |
Send message Joined: 21 Feb 15 Posts: 64 Credit: 65,733,511 RAC: 0 |
My '333' wu's on i7 are wu's between 10.900 and 15.900 sec. Longer wu's are granted with higher credit. This is how I see it. No dedicated cruncher, all productive pc's. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
@fractal, You just probably get sometimes resends of longer work units from past. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0 |
I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF All these work units carry the number 10 in position B! No matter what value the other positions have. Look in your work unit lists to verify this. |
Send message Joined: 23 Mar 15 Posts: 3 Credit: 155,811,308 RAC: 0 |
I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: Good observation. Not sure what that field means but, yes, all of the ones I spotted seem to have a "10" there. I wonder what it is about those work units that cause some machines to take so much longer than others. edit: I just went through 1000 of the 8000 units in my "valid tasks" list and you are 33 for 33. All 33/1000 have 10 in that position. http://universeathome.pl/universe/workunit.php?wuid=3676163 is interesting in it taking 2x as long as it should. Maybe 1/3 of the 33 took 2x as long as normal and the rest taking 8x as long. Occasionally the wingman takes the normal time, sometimes it takes 2x as long and rarely both of us take the full extra long time. Oh, and every unit gets 333 credits no matter how long they took to compute. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: This is very interesting... This "10" in WU name means, that is 10th incrementation through loop which generates one of the starting parameters ("idum"). I strongly suspect that it have conjunction with other one giving in result longer computing time and, probably, interesting results. Let me investigate this, please :) Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0 |
I wonder what it is about those work units that cause some machines to take so much longer than others.Further I see that only my AMD Bulldozer based (Opteron and Kaveri) machines are not affected by this long runtimes with the "10" work units (so take normal time)! AMD K8, K10, *Cat and Intel C2D and Core i CPUs all take the long time. Everything within Linux; Win, however, I have not noticed. |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
I have observed the same thing. It seems AMD runs faster on these "10" tasks but also some i3 and i5 CPU's seem to go faster. It's very interesting that I can find a minor correlation between the OS (Win vs. Linux). I see the same thing on my E3-1240V2 CPU's as I do on the 2p X5680 setup. However, they ALL get the same 333 points... how weird... In many cases, the i3/i5 CPU's that significantly do better are running windows instead of Linux, so some compiler thing may be going on as well... Tis for the developers to clear this up I think... :D |
Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0 |
I have had at least one long one before that ran over a day (about 30 hours, my wingman only took 5 hours). I currently have one that has been running for over 21 hours and has been stuck at 10.075% completed for many hours without changing. It is using a full CPU core so appears to be running OK. Both long units used this hardware, an AMD PhenomII X6, running Linux Fedora16 X64. Waiting to see if it completes, previous long runner got the same 333 points, so I assume this one will too, it will be a poor return on resource use but there are only a few of them. They are recent creations not resends. Conan |
Send message Joined: 6 Mar 15 Posts: 13 Credit: 20,214,952 RAC: 0 |
A couple of days ago I received some of those long-running WUs as well. And just like on your machine(s), they all got stuck at random levels of completion (which stands in contrast to the previous and occasionally occuring long-running WUs). Some ran longer than 9 hours and did not proceed farther than 10%, others were stuck at around 90% after running for 21 hours. I canceled all of them as they've appeared to be somehow corrupted to me. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows... I will try to find out why this happened. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0 |
Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine. Should I abort it? It is still using CPU and the Task Manager (Linux equivilent) says it has been running for over 3 Days. Conan |
Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0 |
Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine. Don't worry about this one as I have aborted it. It reached nearly 40 hours and had only made one checkpoint at 8 minutes 50 seconds then nothing after that. The percentage done remained at 10.075% all the time. The properties of the work unit (in BOINC manager) said it was using just 0.3 GFlops per second, which seems a bit slow. So even though it was using a CPU it did not appear to be doing much. Conan |
Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0 |
I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows...That affected only universe_bh_361_0 WUs like this one: http://universeathome.pl/universe/workunit.php?wuid=4823677 |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
I have now 4 long ones on two computers (total 8) that have run for 6 days on one setup and over 1 day on the other setup. Both are running Linux. I let them run because I was curious... I'll abort them all after I report what is in the slots directories now... Both setups are server grade with E3-1230V3 and an E3-1240V2 CPU's. Both run Linux Mint 17.3 . In any case, I checked the slot files and nothing is changing except the error.dat files some stuff... error.dat = error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 254568 error.dat2 = error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181) error.dat3 = error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181) The boinc_mmap_file has some binary junk in it... Nothing in the stderr.txt file. log.txt = 00:00:00 00:00:00 PROGRAM START: Sun Apr 24 23:43:52 2016 00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints 00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present 00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:03:53 00:03:53 making checkpoint: j: 1000; iidd: 118910 00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat2 00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat2 00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat2 00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat2 00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat3 00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat3 00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat3 00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat3 00:07:50 00:03:57 making checkpoint: j: 2000; iidd: 242302 00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat2 00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat2 00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat2 00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat2 00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat3 00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat3 00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat3 00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat3 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:16:53 2016 00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:00:00 00:00:00 random number generator initialised: 242302 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:28:15 2016 00:00:01 00:00:01 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:01 00:00:00 checkpoint read 00:00:01 00:00:00 default values set 00:00:01 00:00:00 Reading param.in file 00:00:01 00:00:00 PARAMIN: num_tested = 20000 00:00:01 00:00:00 PARAMIN: hub_val = 1000 00:00:01 00:00:00 PARAMIN: idum = -943500 00:00:01 00:00:00 PARAMIN: OUTPUT = 3 00:00:01 00:00:00 PARAMIN: Sal = -2.3 00:00:01 00:00:00 PARAMIN: Mmina = 5.0 00:00:01 00:00:00 PARAMIN: Mminb = 3.0 00:00:01 00:00:00 PARAMIN: Fa = 1 00:00:01 00:00:00 PARAMIN: ZZ = 0.0001 00:00:01 00:00:00 param.in file read 00:00:01 00:00:00 idum: -943500; num_tested: 20000 00:00:01 00:00:00 random number generator initialised: 242302 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:57:59 2016 00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:00:00 00:00:00 random number generator initialised: 242302 This task is one of four on this setup running over 6 days. I've restarted it a couple times to see if something changes... seems the completion percent is moving up... 8-) |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
2p setup just finished a couple dozen 12+ hours tasks like this one... and only 333 points? http://universeathome.pl/universe/workunit.php?wuid=4907485 I think even Credit New would give more points... but all in the same boat I guess... 8-) |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
We had experiment with Credit new... after few hours of work some tasks get 5 points where another ones gets 1500... Believe me, everything is better then credit new ;) Anyway, thank you for detailed info about errors - it is very usefully. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
Well, hope it helps. One other thing I notice (like others mentioned) is when the task properties are viewed, they report only 0.06 GFLOPS/Sec???? If the tasks are not doing much math, that is why credit new is so flaky. Computer: Linux-DX5680 Project Universe@Home Name universe_bh_362_16_20000_1-999999_335600_1 Application Universe BHspin 0.09 Workunit name universe_bh_362_16_20000_1-999999_335600 State Running Received 5/1/2016 10:50:38 PM Report deadline 5/15/2016 10:50:23 PM Estimated app speed 0.06 GFLOPs/sec Estimated task size 807 GFLOPs CPU time at last checkpoint 01:36:19 CPU time 01:42:02 Elapsed time 01:42:53 Estimated time remaining 02:20:58 Fraction done 42.160% Virtual memory size 12.80 MB Working set size 3.46 MB Directory slots/1 Process ID 2447 This is on a task near completion. If your WU's are not doing a lot of math, what are they doing? They use less than 4 MB of ram... just curious... 8-) |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
This is what BM reports, but I not really believe in the numbers... Just see CPU usage... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
This is what BM reports, but I not really believe in the numbers... Just see CPU usage... CPU usage is normal, as in 98% or higher. I see no problems in the longer running WU's, only that they run longer... excepting the ones that never finish! Hope you can figure those out. 8-) |