Message boards :
Number crunching :
extreme long wu's
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
Author | Message |
---|---|
Send message Joined: 12 May 15 Posts: 3 Credit: 5,097,036 RAC: 0 |
What are these errors??? I've canceled 8 task about 130-145 hours, which they were suspended, but occupy 100% CPU. After last 24h task progress was zero. |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
I've just aborted another one. No New Tasks set. There is a serious flaw in the application here, I wonder how many crunchers are having their systems time wasted. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
It must be specific to some types of systems. I haven't seen it in my current logs at all, and I think only one earlier, as posted above. http://universeathome.pl/universe/results.php?hostid=57772&offset=0&show_names=0&state=4&appid= I am on Ubuntu 16.10, running on an i7-4790, and this is a dedicated crunching machine that I leave on 24/7. Maybe if people report more about their systems, some sort of pattern will emerge. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
I've just aborted another one. No New Tasks set. There is a serious flaw in the application here, I wonder how many crunchers are having their systems time wasted. Error rate is 0.6106% - all errors, not only this one... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 21 Feb 15 Posts: 8 Credit: 5,355,308 RAC: 0 |
I have aborted 6 wu the last days! At one point, the Checkpoint stop working, and they go on "forever" (seems like :)) The last had run for 20 hours(25.xxx%), when chechpointing stopped,, after running for aprox 1hour! i restarted Boinc, and the wu started from late CP (20.xxx%), and came to a hold again when reaching same run 25.xxx% as last! (after a short time) i did let it run for 3-4 hour more, No progress, so i aborted it... |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
Only what I can suggest is to abort work units which computer longer then usual tasks on particular computer. It's very rare situation and I suspect is that is more due to particular configuration then to source code and/or algorithm. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
That is fine if people are sitting in front of their machines watching their tasks progress. That is not a realistic scenario, people who leave a machine running BOINC may not look at it for hours/days. I leave my machines running BOINC unattended for weeks sometimes if I am away. There is an issue with the project that is not an issue with others, simple as that. You find and fix it, or the crunchers leave. |
Send message Joined: 5 Jul 16 Posts: 31 Credit: 18,447,833 RAC: 0 |
I musst aggree. It is an issue. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Do the tasks get into a "running forever with no chance to complete" state? If so, then this seriously needs to get fixed. I have some computers also running unattended, and if they're wasting their resources, that is unacceptable. Admins: Has anything been done to determine the cause of these problems? I've tried to give relevant info in my other thread ... but we need input and advice from you too, on how these tasks are supposed to behave, and what you can do to fix the bad behaviors! Can you please help us? I'm going to set No New Tasks for this project, until the admins can fix this. Please admins! ========================== For instance, the problematic one I'm going to abort right now, had this: Application: Universe BHspin v2 0.01 Task Name: universe_bh2_160803_59_3_20000_1-999999_360000_2 URL: http://universeathome.pl/universe/result.php?resultid=19571218 "CPU time at last checkpoint" is 02:07:46 (2 hrs) "CPU time" is 14:32:06 (14.5 hrs) Estimated time remaining: 03:19:50 (AND INCREASING) Fraction done: 77.415% log.txt has several "making checkpoint" entries, but the last one was at: 02:08:41 00:08:21 making checkpoint: j: 15000; iidd: 4305399 checkpoint.dat: 15000 4305399 0 1 2 error.dat: error: in Renv_con() unknown Ka type: 1, iidd_old: 4436666error: in Menv_con() unknown Ka type: 1, iidd_old: 4436666 error.dat2 and error.dat3: warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481 warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481 warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777 warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777 warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698 warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698 warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947 warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947 |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I just terminated one also. I only noticed it because it was running "high priority" after almost five days. http://universeathome.pl/universe/result.php?resultid=19652511 But it was apparently completed (though not validated yet) by one other user. http://universeathome.pl/universe/workunit.php?wuid=8550767 I get the long ones only about once every 200 work units (estimate - the logs don't go back to the last one). Whether that is a problem for the project I can't say. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
I have an idea, let me check it... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
Ok, I have indentified functions where the problem exists and... is really interesting! We will REALLY deeply check it as is... something which can give us more data. Thank you for investigation and for deep checking (it will be impossible for us without your feedback). Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Sounds great, honestly! Once you've resolved the issue, please let us know when it is safe to disable "No New Tasks", with no fear of getting a task that wastes the CPU. I'll be monitoring this thread. |
Send message Joined: 6 Mar 15 Posts: 28 Credit: 16,721,329 RAC: 0 |
Thank you for investigation and for deep checking (it will be impossible for us without your feedback). Well done, Jacob Klein. |
Send message Joined: 7 Feb 17 Posts: 6 Credit: 1,410,423 RAC: 0 |
I think I've had this issue too, but only on my Raspberry Pi model B. Tasks seemingly hanging on ~1% completion, with time remaining stretching months ahead. My Raspberry Pi 2 seems to crunch away just fine though, as does my desktop. Btw thanks for supporting the Raspberry Pi! |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
krzyszp: Any response yet for: - Is this an application issue, or is it a bad batch of tasks? - Is it fixed? |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
New version of all apps will be available in next week, I hope this resolve the problem. I had executed locally few of those problematic tasks and found that same code compiled without BOINC support doesn't make problems but compiled with BOINC support got it. So, it's suggests that some boinc client versions cause the problem... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Are the new apps released now, it has been a week? |
Send message Joined: 30 Jun 16 Posts: 42 Credit: 309,815,029 RAC: 0 |
Yep, I just aborted all my WU's. Whole stack of errors. My last valid WU was 19/02//17. |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Ah, err, that doesn't sound exactly optimal, is the failure the same as that the thread was about or is ths something new the "fix" has introduced? |