Long running work units Universe@Home

Author	Message
fractal Send message Joined: 23 Mar 15 Posts: 3 Credit: 155,811,308 RAC: 0	Message 1053 - Posted: 30 Jan 2016, 18:10:26 UTC Am I the only one getting the long running work units mixed in with regular running units? They appear to come in bunches. The 333 credit work units usually take between 6000 and 12000 seconds on the various machines I have but occasionally I get the 60,000 plus second work unit. These get the same 333 credits when the complete. Examples include http://universeathome.pl/universe/workunit.php?wuid=3718041, http://universeathome.pl/universe/workunit.php?wuid=3717416, http://universeathome.pl/universe/workunit.php?wuid=3717415 http://universeathome.pl/universe/workunit.php?wuid=3717828 on a stock i7. Ok, let's ignore that last one. That's an ARM processor which took as long as the i7 ... A stock i3 sees http://universeathome.pl/universe/workunit.php?wuid=3675859 and http://universeathome.pl/universe/workunit.php?wuid=3675872 As you can see, sometimes the wingmen also get the really long run times, sometimes they get slightly longer runtimes and sometimes they get standard runtimes. All my machines run Intel processors at stock frequencies with decent cooling under Linux. There is no pattern to the wingmen as they run a variety of processors and operating systems. I have written it off as "one of those things" for a while until I looked into it this weekend and found that the wingmen are not affected the same way I am. So, does anyone else see a pile of these day long work units mixed in with their 2-3 hour units? Does anyone have any idea what is going on? ID: 1053 · Rating: 0 · rate: / Reply Quote

alex Send message Joined: 21 Feb 15 Posts: 64 Credit: 65,733,511 RAC: 0	Message 1054 - Posted: 30 Jan 2016, 20:23:48 UTC My '333' wu's on i7 are wu's between 10.900 and 15.900 sec. Longer wu's are granted with higher credit. This is how I see it. No dedicated cruncher, all productive pc's. ID: 1054 · Rating: 0 · rate: / Reply Quote

Krzysztof Piszczek - wspieram ... Project administrator Project developer Project tester Send message Joined: 4 Feb 15 Posts: 849 Credit: 144,180,465 RAC: 0	Message 1055 - Posted: 30 Jan 2016, 21:38:15 UTC - in response to Message 1053. @fractal, You just probably get sometimes resends of longer work units from past. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT ID: 1055 · Rating: 0 · rate: / Reply Quote

cyrusNGC_224@P3D Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0	Message 1056 - Posted: 30 Jan 2016, 23:28:43 UTC I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF All these work units carry the number 10 in position B! No matter what value the other positions have. Look in your work unit lists to verify this. ID: 1056 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 23 Mar 15 Posts: 3 Credit: 155,811,308 RAC: 0	Message 1057 - Posted: 30 Jan 2016, 23:40:24 UTC - in response to Message 1056. Last modified: 31 Jan 2016, 0:04:35 UTC I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF All these work units carry the number 10 in position B! No matter what value the other positions have. Look in your work unit lists to verify this. Good observation. Not sure what that field means but, yes, all of the ones I spotted seem to have a "10" there. I wonder what it is about those work units that cause some machines to take so much longer than others. edit: I just went through 1000 of the 8000 units in my "valid tasks" list and you are 33 for 33. All 33/1000 have 10 in that position. http://universeathome.pl/universe/workunit.php?wuid=3676163 is interesting in it taking 2x as long as it should. Maybe 1/3 of the 33 took 2x as long as normal and the rest taking 8x as long. Occasionally the wingman takes the normal time, sometimes it takes 2x as long and rarely both of us take the full extra long time. Oh, and every unit gets 333 credits no matter how long they took to compute. ID: 1057 · Rating: 0 · rate: / Reply Quote

Krzysztof Piszczek - wspieram ... Project administrator Project developer Project tester Send message Joined: 4 Feb 15 Posts: 849 Credit: 144,180,465 RAC: 0	Message 1058 - Posted: 31 Jan 2016, 7:45:22 UTC - in response to Message 1056. I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground: universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF All these work units carry the number 10 in position B! No matter what value the other positions have. Look in your work unit lists to verify this. This is very interesting... This "10" in WU name means, that is 10th incrementation through loop which generates one of the starting parameters ("idum"). I strongly suspect that it have conjunction with other one giving in result longer computing time and, probably, interesting results. Let me investigate this, please :) Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT ID: 1058 · Rating: 0 · rate: / Reply Quote

cyrusNGC_224@P3D Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0	Message 1065 - Posted: 9 Feb 2016, 22:53:11 UTC - in response to Message 1057. I wonder what it is about those work units that cause some machines to take so much longer than others. Further I see that only my AMD Bulldozer based (Opteron and Kaveri) machines are not affected by this long runtimes with the "10" work units (so take normal time)! AMD K8, K10, *Cat and Intel C2D and Core i CPUs all take the long time. Everything within Linux; Win, however, I have not noticed. ID: 1065 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0	Message 1150 - Posted: 21 Apr 2016, 15:56:07 UTC Last modified: 21 Apr 2016, 15:57:20 UTC I have observed the same thing. It seems AMD runs faster on these "10" tasks but also some i3 and i5 CPU's seem to go faster. It's very interesting that I can find a minor correlation between the OS (Win vs. Linux). I see the same thing on my E3-1240V2 CPU's as I do on the 2p X5680 setup. However, they ALL get the same 333 points... how weird... In many cases, the i3/i5 CPU's that significantly do better are running windows instead of Linux, so some compiler thing may be going on as well... Tis for the developers to clear this up I think... :D ID: 1150 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0	Message 1153 - Posted: 27 Apr 2016, 1:56:16 UTC I have had at least one long one before that ran over a day (about 30 hours, my wingman only took 5 hours). I currently have one that has been running for over 21 hours and has been stuck at 10.075% completed for many hours without changing. It is using a full CPU core so appears to be running OK. Both long units used this hardware, an AMD PhenomII X6, running Linux Fedora16 X64. Waiting to see if it completes, previous long runner got the same 333 points, so I assume this one will too, it will be a poor return on resource use but there are only a few of them. They are recent creations not resends. Conan ID: 1153 · Rating: 0 · rate: / Reply Quote

TheHoosh Send message Joined: 6 Mar 15 Posts: 13 Credit: 20,214,952 RAC: 0	Message 1154 - Posted: 28 Apr 2016, 9:09:33 UTC A couple of days ago I received some of those long-running WUs as well. And just like on your machine(s), they all got stuck at random levels of completion (which stands in contrast to the previous and occasionally occuring long-running WUs). Some ran longer than 9 hours and did not proceed farther than 10%, others were stuck at around 90% after running for 21 hours. I canceled all of them as they've appeared to be somehow corrupted to me. ID: 1154 · Rating: 0 · rate: / Reply Quote

Krzysztof Piszczek - wspieram ... Project administrator Project developer Project tester Send message Joined: 4 Feb 15 Posts: 849 Credit: 144,180,465 RAC: 0	Message 1155 - Posted: 28 Apr 2016, 10:21:04 UTC I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows... I will try to find out why this happened. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT ID: 1155 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0	Message 1156 - Posted: 28 Apr 2016, 22:31:52 UTC Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine. Should I abort it? It is still using CPU and the Task Manager (Linux equivilent) says it has been running for over 3 Days. Conan ID: 1156 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 4 Feb 15 Posts: 49 Credit: 15,956,546 RAC: 0	Message 1157 - Posted: 29 Apr 2016, 8:08:58 UTC - in response to Message 1156. Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine. Should I abort it? It is still using CPU and the Task Manager (Linux equivilent) says it has been running for over 3 Days. Conan Don't worry about this one as I have aborted it. It reached nearly 40 hours and had only made one checkpoint at 8 minutes 50 seconds then nothing after that. The percentage done remained at 10.075% all the time. The properties of the work unit (in BOINC manager) said it was using just 0.3 GFlops per second, which seems a bit slow. So even though it was using a CPU it did not appear to be doing much. Conan ID: 1157 · Rating: 0 · rate: / Reply Quote

cyrusNGC_224@P3D Send message Joined: 21 Feb 15 Posts: 46 Credit: 926,538,317 RAC: 0	Message 1158 - Posted: 29 Apr 2016, 10:47:16 UTC - in response to Message 1155. I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows... I will try to find out why this happened. That affected only universe_bh_361_0 WUs like this one: http://universeathome.pl/universe/workunit.php?wuid=4823677 ID: 1158 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0	Message 1159 - Posted: 1 May 2016, 12:19:23 UTC Last modified: 1 May 2016, 12:25:15 UTC I have now 4 long ones on two computers (total 8) that have run for 6 days on one setup and over 1 day on the other setup. Both are running Linux. I let them run because I was curious... I'll abort them all after I report what is in the slots directories now... Both setups are server grade with E3-1230V3 and an E3-1240V2 CPU's. Both run Linux Mint 17.3 . In any case, I checked the slot files and nothing is changing except the error.dat files some stuff... error.dat = error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 254568 error.dat2 = error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181) error.dat3 = error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181) The boinc_mmap_file has some binary junk in it... Nothing in the stderr.txt file. log.txt = 00:00:00 00:00:00 PROGRAM START: Sun Apr 24 23:43:52 2016 00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints 00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present 00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present 00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:03:53 00:03:53 making checkpoint: j: 1000; iidd: 118910 00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat2 00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat2 00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat2 00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat2 00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat3 00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat3 00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat3 00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat3 00:07:50 00:03:57 making checkpoint: j: 2000; iidd: 242302 00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat2 00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat2 00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat2 00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat2 00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat3 00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat3 00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat3 00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat3 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:16:53 2016 00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:00:00 00:00:00 random number generator initialised: 242302 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:28:15 2016 00:00:01 00:00:01 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:01 00:00:00 checkpoint read 00:00:01 00:00:00 default values set 00:00:01 00:00:00 Reading param.in file 00:00:01 00:00:00 PARAMIN: num_tested = 20000 00:00:01 00:00:00 PARAMIN: hub_val = 1000 00:00:01 00:00:00 PARAMIN: idum = -943500 00:00:01 00:00:00 PARAMIN: OUTPUT = 3 00:00:01 00:00:00 PARAMIN: Sal = -2.3 00:00:01 00:00:00 PARAMIN: Mmina = 5.0 00:00:01 00:00:00 PARAMIN: Mminb = 3.0 00:00:01 00:00:00 PARAMIN: Fa = 1 00:00:01 00:00:00 PARAMIN: ZZ = 0.0001 00:00:01 00:00:00 param.in file read 00:00:01 00:00:00 idum: -943500; num_tested: 20000 00:00:01 00:00:00 random number generator initialised: 242302 00:00:00 00:00:00 PROGRAM START: Sun May 1 06:57:59 2016 00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2 00:00:00 00:00:00 checkpoint read 00:00:00 00:00:00 default values set 00:00:00 00:00:00 Reading param.in file 00:00:00 00:00:00 PARAMIN: num_tested = 20000 00:00:00 00:00:00 PARAMIN: hub_val = 1000 00:00:00 00:00:00 PARAMIN: idum = -943500 00:00:00 00:00:00 PARAMIN: OUTPUT = 3 00:00:00 00:00:00 PARAMIN: Sal = -2.3 00:00:00 00:00:00 PARAMIN: Mmina = 5.0 00:00:00 00:00:00 PARAMIN: Mminb = 3.0 00:00:00 00:00:00 PARAMIN: Fa = 1 00:00:00 00:00:00 PARAMIN: ZZ = 0.0001 00:00:00 00:00:00 param.in file read 00:00:00 00:00:00 idum: -943500; num_tested: 20000 00:00:00 00:00:00 random number generator initialised: 242302 This task is one of four on this setup running over 6 days. I've restarted it a couple times to see if something changes... seems the completion percent is moving up... 8-) ID: 1159 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0	Message 1161 - Posted: 2 May 2016, 2:24:19 UTC Last modified: 2 May 2016, 2:24:38 UTC 2p setup just finished a couple dozen 12+ hours tasks like this one... and only 333 points? http://universeathome.pl/universe/workunit.php?wuid=4907485 I think even Credit New would give more points... but all in the same boat I guess... 8-) ID: 1161 · Rating: 0 · rate: / Reply Quote

Krzysztof Piszczek - wspieram ... Project administrator Project developer Project tester Send message Joined: 4 Feb 15 Posts: 849 Credit: 144,180,465 RAC: 0	Message 1162 - Posted: 2 May 2016, 11:55:00 UTC - in response to Message 1161. We had experiment with Credit new... after few hours of work some tasks get 5 points where another ones gets 1500... Believe me, everything is better then credit new ;) Anyway, thank you for detailed info about errors - it is very usefully. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT ID: 1162 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0	Message 1163 - Posted: 2 May 2016, 12:59:15 UTC - in response to Message 1162. Well, hope it helps. One other thing I notice (like others mentioned) is when the task properties are viewed, they report only 0.06 GFLOPS/Sec???? If the tasks are not doing much math, that is why credit new is so flaky. Computer: Linux-DX5680 Project Universe@Home Name universe_bh_362_16_20000_1-999999_335600_1 Application Universe BHspin 0.09 Workunit name universe_bh_362_16_20000_1-999999_335600 State Running Received 5/1/2016 10:50:38 PM Report deadline 5/15/2016 10:50:23 PM Estimated app speed 0.06 GFLOPs/sec Estimated task size 807 GFLOPs CPU time at last checkpoint 01:36:19 CPU time 01:42:02 Elapsed time 01:42:53 Estimated time remaining 02:20:58 Fraction done 42.160% Virtual memory size 12.80 MB Working set size 3.46 MB Directory slots/1 Process ID 2447 This is on a task near completion. If your WU's are not doing a lot of math, what are they doing? They use less than 4 MB of ram... just curious... 8-) ID: 1163 · Rating: 0 · rate: / Reply Quote

Krzysztof Piszczek - wspieram ... Project administrator Project developer Project tester Send message Joined: 4 Feb 15 Posts: 849 Credit: 144,180,465 RAC: 0	Message 1164 - Posted: 2 May 2016, 13:11:16 UTC - in response to Message 1163. This is what BM reports, but I not really believe in the numbers... Just see CPU usage... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT ID: 1164 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0	Message 1165 - Posted: 2 May 2016, 17:42:36 UTC - in response to Message 1164. This is what BM reports, but I not really believe in the numbers... Just see CPU usage... CPU usage is normal, as in 98% or higher. I see no problems in the longer running WU's, only that they run longer... excepting the ones that never finish! Hope you can figure those out. 8-) ID: 1165 · Rating: 0 · rate: / Reply Quote