Message boards : Number crunching : extreme long wu's

extreme long wu's

Post to thread Subscribe


Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next
AuthorMessage
Matthias Lehmkuhl

Send message
Joined: 23 Feb 15
Posts: 2
Credit: 1,021,845
RAC: 0
Message 2309 - Posted: 24 Jul 2017, 16:45:42 UTC - in response to Message 2279.  

I've got a long running result, will cancel that result now with a CPU time of more than 11 days.
last checkpoint was 13.07.2017 and progress shows fraction_done of 0.450050
http://universeathome.pl/universe/workunit.php?wuid=10535753
wu_name: universe_bh2_160803_154_2_20000_1-999999_470000
result_name: universe_bh2_160803_154_2_20000_1-999999_470000_0
app_file: BHspin2_1_windows_intelx86.exe

error.dat does show:
error: bondi() accreted mass (6.458614) larger than envelope mass (4.354907) (2714882)
error: in Renv_con() unknown Ka type: 1, iidd_old: 2840284error: in Menv_con() unknown Ka type: 1, iidd_old: 2840284
error.dat2 does show:
error: bondi() accreted mass (5.652698) larger than envelope mass (5.233360) (240724)
error: bondi() accreted mass (7.216964) larger than envelope mass (6.333716) (2569362)
error.dat3 does show:
error: bondi() accreted mass (5.652698) larger than envelope mass (5.233360) (240724)
error: bondi() accreted mass (7.216964) larger than envelope mass (6.333716) (2569362)

log.txt contains:
00:00:00 00:00:00 PROGRAM START: Thu Jul 13 02:29:34 2017
00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints
00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present
00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -470000
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: golambda = 0.1
00:00:00 00:00:00 PARAMIN: Beta = 0.1
00:00:00 00:00:00 PARAMIN: Fa = 1.0
00:00:00 00:00:00 PARAMIN: Sigma3 = 0
00:00:00 00:00:00 PARAMIN: Sal = -2.7
00:00:00 00:00:00 PARAMIN: SS = 0
00:00:00 00:00:00 PARAMIN unknown parameter: name: SS; value: 0
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -470000; num_tested: 20000
00:05:24 00:05:24 making checkpoint: j: 1000; iidd: 282852
00:05:24 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:05:24 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:05:24 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:05:24 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:05:24 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:05:24 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:05:24 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:05:24 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:11:01 00:05:37 making checkpoint: j: 2000; iidd: 575529
00:11:01 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:11:01 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:11:01 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:11:01 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:11:01 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:11:01 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:11:01 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:11:01 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:16:28 00:05:27 making checkpoint: j: 3000; iidd: 869551
00:16:28 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:16:28 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:16:28 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:16:28 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:16:28 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:16:28 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:16:28 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:16:28 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:22:36 00:06:08 making checkpoint: j: 4000; iidd: 1164932
00:22:36 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:22:36 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:22:36 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:22:36 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:22:36 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:22:36 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:22:36 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:22:36 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:28:27 00:05:51 making checkpoint: j: 5000; iidd: 1449110
00:28:27 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:28:27 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:28:27 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:28:27 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:28:27 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:28:27 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:28:27 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:28:27 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:34:07 00:05:40 making checkpoint: j: 6000; iidd: 1740336
00:34:07 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:34:07 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:34:07 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:34:07 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:34:07 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:34:07 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:34:07 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:34:07 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:39:57 00:05:50 making checkpoint: j: 7000; iidd: 2037642
00:39:57 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:39:57 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:39:57 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:39:57 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:39:57 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:39:57 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:39:57 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:39:57 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:45:09 00:05:12 making checkpoint: j: 8000; iidd: 2308124
00:45:09 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:45:09 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:45:09 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:45:09 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:45:09 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:45:09 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:45:09 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:45:09 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:50:43 00:05:34 making checkpoint: j: 9000; iidd: 2581356
00:50:43 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:50:43 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:50:43 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:50:43 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:50:43 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:50:43 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:50:43 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:50:43 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:00:00 00:00:00 PROGRAM START: Thu Jul 13 02:29:34 2017
00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints
00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present
00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -470000
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: golambda = 0.1
00:00:00 00:00:00 PARAMIN: Beta = 0.1
00:00:00 00:00:00 PARAMIN: Fa = 1.0
00:00:00 00:00:00 PARAMIN: Sigma3 = 0
00:00:00 00:00:00 PARAMIN: Sal = -2.7
00:00:00 00:00:00 PARAMIN: SS = 0
00:00:00 00:00:00 PARAMIN unknown parameter: name: SS; value: 0
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -470000; num_tested: 20000
00:05:24 00:05:24 making checkpoint: j: 1000; iidd: 282852
00:05:24 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:05:24 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:05:24 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:05:24 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:05:24 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:05:24 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:05:24 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:05:24 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:11:01 00:05:37 making checkpoint: j: 2000; iidd: 575529
00:11:01 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:11:01 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:11:01 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:11:01 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:11:01 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:11:01 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:11:01 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:11:01 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:16:28 00:05:27 making checkpoint: j: 3000; iidd: 869551
00:16:28 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:16:28 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:16:28 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:16:28 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:16:28 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:16:28 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:16:28 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:16:28 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:22:36 00:06:08 making checkpoint: j: 4000; iidd: 1164932
00:22:36 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:22:36 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:22:36 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:22:36 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:22:36 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:22:36 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:22:36 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:22:36 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:28:27 00:05:51 making checkpoint: j: 5000; iidd: 1449110
00:28:27 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:28:27 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:28:27 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:28:27 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:28:27 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:28:27 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:28:27 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:28:27 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:34:07 00:05:40 making checkpoint: j: 6000; iidd: 1740336
00:34:07 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:34:07 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:34:07 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:34:07 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:34:07 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:34:07 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:34:07 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:34:07 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:39:57 00:05:50 making checkpoint: j: 7000; iidd: 2037642
00:39:57 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:39:57 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:39:57 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:39:57 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:39:57 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:39:57 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:39:57 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:39:57 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:45:09 00:05:12 making checkpoint: j: 8000; iidd: 2308124
00:45:09 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:45:09 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:45:09 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:45:09 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:45:09 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:45:09 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:45:09 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:45:09 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:50:43 00:05:34 making checkpoint: j: 9000; iidd: 2581356
00:50:43 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:50:43 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:50:43 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:50:43 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:50:43 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:50:43 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:50:43 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:50:43 00:00:00 gw_cpfile: error.dat appended to error.dat3
Matthias
ID: 2309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JugNut

Send message
Joined: 11 Mar 15
Posts: 30
Credit: 59,188,106
RAC: 186,598
Message 2310 - Posted: 28 Jul 2017, 13:53:51 UTC - in response to Message 2030.  
Last modified: 28 Jul 2017, 14:12:04 UTC

Hey krzys
I've just found a bunch of these bad WU's over all my PC's, what a mess.
I found out the hard way that even if I abort them they still don't die. They have to be manually killed from task manager. If you don't manually kill them after aborting them boinc thinks they've gone and assigns new work into the already loaded & running slot. Not good!!
Will kill this WU now, this is the fifth in the last few hours :(

The only way I can tell there locked up & not just long running is keeping an eye on the checkpointing. The WU below has been running for 15hrs 23mins and hasn't checkpointed for the last 13hrs. At least now I know what to look for & how to treat them. Aggressively...
As usual the stderr is empty but If you're interested I kept a copy of the slot directory before I aborted it, if you would like it just ask.

http://universeathome.pl/universe/results.php?hostid=1679&offset=0&show_names=0&state=6&appid=

Contents of error.dat3.....
error: bondi() accreted mass (6.024443) larger than envelope mass (5.618883) (60413)
error: bondi() accreted mass (7.907802) larger than envelope mass (7.402827) (144779)
error: bondi() accreted mass (5.415336) larger than envelope mass (2.590284) (146410)
error: bondi() accreted mass (9.456944) larger than envelope mass (9.022832) (258705)
error: bondi() accreted mass (9.386890) larger than envelope mass (5.976198) (356863)
error: bondi() accreted mass (11.491090) larger than envelope mass (7.780758) (361139)
error: bondi() accreted mass (6.318919) larger than envelope mass (5.818937) (438696)
error: bondi() accreted mass (5.645096) larger than envelope mass (5.213384) (445394)
error: bondi() accreted mass (5.773027) larger than envelope mass (5.230975) (671283)
error: bondi() accreted mass (12.410371) larger than envelope mass (8.284976) (693333)
error: bondi() accreted mass (8.904030) larger than envelope mass (6.786716) (702075)
error: bondi() accreted mass (6.480212) larger than envelope mass (6.192082) (750103)
error: bondi() accreted mass (5.496527) larger than envelope mass (4.505465) (818009)

EDIT: Just tried to kill another one but instead this time it killed my PC, (blue screened)
ID: 2310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gibson Praise
Avatar

Send message
Joined: 26 Feb 15
Posts: 3
Credit: 25,228,744
RAC: 26
Message 2322 - Posted: 12 Aug 2017, 2:50:54 UTC - in response to Message 2279.  

This is still an issue! :(

I have to ask -- is this problem being worked on?

The general response seems to be deal with it, these wus come in spurts and it is a cost of doing business. I can handle that .. but it is concerning that such a long-standing problem has still not been successfully addressed and does not seem to be a priority.
ID: 2322 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JugNut

Send message
Joined: 11 Mar 15
Posts: 30
Credit: 59,188,106
RAC: 186,598
Message 2330 - Posted: 14 Aug 2017, 6:09:46 UTC - in response to Message 2322.  
Last modified: 14 Aug 2017, 6:10:18 UTC

Another two never ending BHspin tasks :(

So far wingan involved has had the same problem, all though it wouldn't surprise me if they eventually validated.

http://universeathome.pl/universe/result.php?resultid=24645815
http://universeathome.pl/universe/result.php?resultid=24645698
ID: 2330 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JugNut

Send message
Joined: 11 Mar 15
Posts: 30
Credit: 59,188,106
RAC: 186,598
Message 2331 - Posted: 14 Aug 2017, 9:47:37 UTC
Last modified: 14 Aug 2017, 9:58:32 UTC

Another bad batch, these WU should have been in the above post but regardless they all had to be manually aborted. Only my raspberry pi's do not show this behaviour.
This is just a sample I have many more like em' "if" you're interested?

http://universeathome.pl/universe/workunit.php?wuid=10841172
http://universeathome.pl/universe/workunit.php?wuid=10860295
ID: 2331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 44
Credit: 2,530,446
RAC: 0
Message 2338 - Posted: 24 Aug 2017, 23:54:30 UTC
Last modified: 24 Aug 2017, 23:55:16 UTC

I have a BH Spin WU that has already been running for 1 and 1/2 days at 2 % with over 74 Days still to go and counting.

I suspect this is a faulty WU and will never finish plus will go over deadline anyway.

I will probably abort it this afternoon.

Conan
ID: 2338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 44
Credit: 2,530,446
RAC: 0
Message 2339 - Posted: 25 Aug 2017, 3:13:56 UTC
Last modified: 25 Aug 2017, 3:17:10 UTC

I have just noticed that the percentage done has not move for many, many hours so I am aborting this WU. Time to complete has reached 80 days, percentage 2.029% after 1 day 16 hours.

Conan
ID: 2339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 650
Credit: 85,833,298
RAC: 256,836
Message 2340 - Posted: 25 Aug 2017, 14:02:29 UTC - in response to Message 2339.  

If any WU calculate longer then 6 hours feel free to abort it.
Or, if it is no percentage progress over one hour.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 2340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 2341 - Posted: 25 Aug 2017, 18:59:26 UTC
Last modified: 25 Aug 2017, 19:00:33 UTC

That doesn't work for unattended machines.
Please fix your problem already, so computer resources aren't continually wasted!
You were provided details, 6 months ago.
ID: 2341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ritterm
Avatar

Send message
Joined: 6 Mar 15
Posts: 28
Credit: 16,721,329
RAC: 13,633
Message 2346 - Posted: 29 Aug 2017, 16:53:13 UTC - in response to Message 2340.  

If any WU calculate longer then 6 hours feel free to abort it...

But if it keeps checkpointing and appearing to advance toward completion, is there any reason to abort?
ID: 2346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>Libristes] erik

Send message
Joined: 21 Feb 15
Posts: 2
Credit: 56,485,727
RAC: 210,796
Message 2356 - Posted: 11 Sep 2017, 11:59:25 UTC

But if it keeps checkpointing and appearing to advance toward completion, is there any reason to abort?

No. everything seems normal now
ID: 2356 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mattmon

Send message
Joined: 29 May 17
Posts: 1
Credit: 2,938,600
RAC: 0
Message 2357 - Posted: 12 Sep 2017, 11:23:25 UTC - in response to Message 2340.  

If any WU calculate longer then 6 hours feel free to abort it.


Or turn off getting new tasks until this is fixed.
ID: 2357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 2359 - Posted: 15 Sep 2017, 12:09:38 UTC - in response to Message 2357.  
Last modified: 15 Sep 2017, 12:10:10 UTC

If any WU calculate longer then 6 hours feel free to abort it.


Or turn off getting new tasks until this is fixed.


EXACTLY.

I have several PCs doing BOINC work, and I can't monitor the details of every task that they do. This problem is real, and it wastes resources, making a CPU thread completely useless, as it spins its wheels on a task that won't complete...

The devs here should put more effort into solving this problem, instead of not caring about wasted resources. Hell, for that reason alone I'd set "No New Tasks", but I'll also do it because the tasks here sometimes don't work and waste my CPU.

I still hope for a fix, but in the meantime, you don't deserve my CPU if you're going to abuse it. "No New Tasks" for you.
ID: 2359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 177
Credit: 41,273,747
RAC: 181,679
Message 2360 - Posted: 18 Sep 2017, 11:01:41 UTC - in response to Message 2359.  

I may have mentioned this before, but the long runners seem to correlate with running other work units. I have been running Universe/BHspin v2 mostly by itself for a couple of months, and saw no long runners. Recently, I added LHC/SixTrack to this machine, and picked up a couple of long runners today.
http://universeathome.pl/universe/result.php?resultid=26150652
http://universeathome.pl/universe/result.php?resultid=26150707

That is not much proof, and may be hard to fix, but I mention it for what it is worth.
ID: 2360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cykodennis
Avatar

Send message
Joined: 4 Feb 15
Posts: 24
Credit: 6,003,527
RAC: 0
Message 2361 - Posted: 19 Sep 2017, 8:03:39 UTC - in response to Message 2360.  

AFAIR, i can confirm this. Things started to get messy on my machine, when i ran LHC & Universe.
Doesn't necessarily have to mean something, however....
"I should bring one important point to the attention of the authors and that is, the world is not the United States..."
ID: 2361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 177
Credit: 41,273,747
RAC: 181,679
Message 2362 - Posted: 19 Sep 2017, 13:40:08 UTC - in response to Message 2361.  

I am going to try a little trick, and see how it works.

Normally, LHC/SixTrack has either a lot of work or none at all. So instead of mixing it up with Universe, I have set Universe to 0 resource share. That way, when SixTrack has work, it will run by itself. And then, when SixTrack is out of work, Universe will run by itself. Maybe it will avoid some of the problems.
ID: 2362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 5,407,933
RAC: 0
Message 2363 - Posted: 19 Sep 2017, 16:37:51 UTC
Last modified: 19 Sep 2017, 16:40:41 UTC

I don't run Sixtrack, but I have got 4 stuck WUs and 1 suspect one.

They're all named universe_bh2_160803_181_*.

http://universeathome.pl/universe/workunit.php?wuid=11051278
http://universeathome.pl/universe/workunit.php?wuid=11051513
http://universeathome.pl/universe/workunit.php?wuid=11051517
http://universeathome.pl/universe/workunit.php?wuid=11051573
http://universeathome.pl/universe/workunit.php?wuid=11051697

error.dat files
error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 288457

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 682603

error: bondi() accreted mass (35.166687) larger than envelope mass (33.435905) (357777)
error: bondi() accreted mass (9.873656) larger than envelope mass (8.763593) (395479)
error: bondi() accreted mass (12.880583) larger than envelope mass (9.454412) (459214)
error: bondi() accreted mass (9.679004) larger than envelope mass (8.249970) (469690)
error: bondi() accreted mass (9.564457) larger than envelope mass (9.345383) (585174)
error: bondi() accreted mass (10.187786) larger than envelope mass (6.455279) (611918)
error: bondi() accreted mass (5.985729) larger than envelope mass (4.112997) (645555)

error: bondi() accreted mass (35.166687) larger than envelope mass (33.435905) (357777)
error: bondi() accreted mass (9.873656) larger than envelope mass (8.763593) (395479)
error: bondi() accreted mass (12.880583) larger than envelope mass (9.454412) (459214)
error: bondi() accreted mass (9.679004) larger than envelope mass (8.249970) (469690)
error: bondi() accreted mass (9.564457) larger than envelope mass (9.345383) (585174)
error: bondi() accreted mass (10.187786) larger than envelope mass (6.455279) (611918)
error: bondi() accreted mass (5.985729) larger than envelope mass (4.112997) (645555)

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 446956

error: bondi() accreted mass (5.599060) larger than envelope mass (4.544127) (28008)
error: bondi() accreted mass (5.330190) larger than envelope mass (4.395586) (105539)
error: bondi() accreted mass (2.216546) larger than envelope mass (1.953147) (135074)
error: bondi() accreted mass (5.860709) larger than envelope mass (3.212747) (195016)
error: bondi() accreted mass (5.714418) larger than envelope mass (5.400754) (218800)
error: bondi() accreted mass (5.962354) larger than envelope mass (5.099944) (257882)
error: bondi() accreted mass (9.910301) larger than envelope mass (8.742725) (301521)
error: bondi() accreted mass (7.603677) larger than envelope mass (6.561131) (313873)
error: bondi() accreted mass (5.856054) larger than envelope mass (5.343091) (321142)
error: bondi() accreted mass (5.580022) larger than envelope mass (4.316905) (340643)

error: bondi() accreted mass (5.599060) larger than envelope mass (4.544127) (28008)
error: bondi() accreted mass (5.330190) larger than envelope mass (4.395586) (105539)
error: bondi() accreted mass (2.216546) larger than envelope mass (1.953147) (135074)
error: bondi() accreted mass (5.860709) larger than envelope mass (3.212747) (195016)
error: bondi() accreted mass (5.714418) larger than envelope mass (5.400754) (218800)
error: bondi() accreted mass (5.962354) larger than envelope mass (5.099944) (257882)
error: bondi() accreted mass (9.910301) larger than envelope mass (8.742725) (301521)
error: bondi() accreted mass (7.603677) larger than envelope mass (6.561131) (313873)
error: bondi() accreted mass (5.856054) larger than envelope mass (5.343091) (321142)
error: bondi() accreted mass (5.580022) larger than envelope mass (4.316905) (340643)

error: bondi() accreted mass (9.476663) larger than envelope mass (8.257562) (3481)
error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 21404


What a waste!
ID: 2363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 5,407,933
RAC: 0
Message 2364 - Posted: 19 Sep 2017, 16:40:05 UTC - in response to Message 2363.  
Last modified: 19 Sep 2017, 16:40:25 UTC

Delete this post.
ID: 2364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nexiagsi16v

Send message
Joined: 28 Feb 15
Posts: 8
Credit: 8,705,913
RAC: 8,010
Message 2367 - Posted: 20 Sep 2017, 18:20:43 UTC

ID: 2367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hsdecalc

Send message
Joined: 2 Mar 15
Posts: 6
Credit: 2,605,104
RAC: 6,128
Message 2375 - Posted: 25 Sep 2017, 12:35:37 UTC

I have regular endless tasks to. Win 10 PC which run 7/24. Aborting the last WU after 14 hours I found the process two days later still running in Taskmanager!!! So the process wasn´t canceled but removed from Boinc. I have 80 hours wasted time. So I can´t run no more WUs because of this bad behavior.
ID: 2375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next Post to thread

Message boards : Number crunching : extreme long wu's