Message boards :
Number crunching :
extreme long wu's
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Admins: Could you please try again to fix this? I am offering to help out with additional testing to help reproduce, isolate, and fix this issue. Until it is fixed, you are wasting resources, and I'm encouraging everyone I can to set "No New Tasks" on Universe@Home. Your lack of effort leaves no choice. |
Send message Joined: 10 Sep 15 Posts: 12 Credit: 20,067,933 RAC: 0 |
I noticed there is no "fraction done" progress when a task is faulty. I made a BASH script that suspends all the tasks that behaves like this for a set interval (default = 10 minutes). I hope this helps volunteer-side. However, on BOINC suspending tasks from a project turns out to be not getting new tasks too. So I suggest to choose a backup project (resource share=0) if you don't have another project for CPU because you won't get new tasks from U@H until you don't choose what to do (abort/resume) with the possible stuck task. You have to modify boinc_path and boinccmd variables only. https://pastebin.com/Rui6artX |
Send message Joined: 2 Mar 15 Posts: 7 Credit: 4,296,304 RAC: 0 |
A set of >80 WUs ran without error, except for one. Today I killed endless job again (26h), running, but boinc says "suppressed" , WIN10 German. https://universeathome.pl/universe/workunit.php?wuid=11890988 I have a copy of the slot-folder. File error.dat: error: in Renv_con() unknown Ka type: 1, iidd_old: 197723error: in Menv_con() unknown Ka type: 1, iidd_old: 197723 |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I am going to try a little trick, and see how it works. I tried my trick (https://universeathome.pl/universe/forum_thread.php?id=199&postid=2362#2362) for a couple of weeks with no problems. Then, I disabled LHC entirely for the last week or two, and ran only Universe. But I just got a long runner today that was going for about 8 or 9 hours as I recall. However, a reboot fixed it and it seems to have completed normally. So the long runners still occur about once a month, though it is a minor problem for me. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
But I just got a long runner today that was going for about 8 or 9 hours as I recall. However, a reboot fixed it and it seems to have completed normally. Apparently the reboot did not fix it. I kept track of the work unit number this time, and the reboot just reset the time to a low value (30 minutes), but it was still stuck at 52% complete (at 19 hours before the reboot). Here it is, and I will abort it shortly: https://universeathome.pl/universe/result.php?resultid=27353831 But it completed normally on another machine, so whatever causes them to get stuck seems to be almost random. There may be nothing "wrong" with the work units, just something external in the system that sometimes causes problems. It will be hard to track down. |
Send message Joined: 10 Sep 15 Posts: 12 Credit: 20,067,933 RAC: 0 |
My script catched a faulty task today. A lot of hours saved. :) https://universeathome.pl/universe/result.php?resultid=27717637 |
Send message Joined: 2 Mar 15 Posts: 7 Credit: 4,296,304 RAC: 0 |
Yesterday again: One Wu with latest checkpoint 30h ago! Terminated Boinc and restart. Job starts from last checkpoint (1:07 runtime) Two hours later no new checkpoint. An endless running job, aborted. Conclusion: No new tasks in the next months! |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I have moved my Universe work to an Ivy Bridge (i7-3770) machine to see if it changes the long runners from what I was getting on my Haswell (i7-4770) machine. Both are running Ubuntu 16.04.3 and Linux 4.10.0-37. They are the only CPU work units running, and I have been getting a long runner only every six weeks or so. If that doesn't do it, I will try a later version of Linux. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I have moved my Universe work to an Ivy Bridge (i7-3770) machine to see if it changes the long runners from what I was getting on my Haswell (i7-4770) machine. Both are running Ubuntu 16.04.3 and Linux 4.10.0-37. They are the only CPU work units running, and I have been getting a long runner only every six weeks or so. I tried it on two Ivy Bridge machines (both i7-3770 with identical motherboards), and picked up long runners on both right away. One was running Ubuntu 16.04 (Linux 4.10.0-37) and the other Ubuntu 17.10 (Linux 4.13.016). So changing the OS or the CPUs does not fix it. But all the long runners were picked up in the batch right after a reboot. That may explain why I normally don't get them very often, because my machines run 24/7 for weeks at a time. I will just leave this one running and see how it does. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
But all the long runners were picked up in the batch right after a reboot. That may explain why I normally don't get them very often, because my machines run 24/7 for weeks at a time. I will just leave this one running and see how it does. I had to reboot on 3 November to install a utility, but that did not cause any long runners this time. The long runners are probably induced when I pause Universe work units in order to allow others to finish. This would also explain why running multiple projects could produce long runners, since the Universe WUs would then be paused occasionally by the BOINC scheduler. But I have not really tested this out yet, I am just letting the PC run with only Universe work units to see how far it can go without a long runner. My previous record was somewhere around 2 months, so maybe this time I can exceed it by not doing anything to pause a WU. Hopefully no forced shutdowns, etc. either during the test. |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
My Ubuntu 16.04 PC has a credit of 203,000, and gets completely shut down every day (hibernation has never worked properly). I don't do anything special for BOINC when shutting down. I've yet to experience a "long runner" WU on this or any of my other machines running U@H (which do occasionally get shut down completely, but not very often). So I'm not convinced you're on the right track there, but maybe I just need to crunch more WUs, and I don't have any better ideas. However, I can't help but notice that there appears to be a problem with a couple of WUs successfully crunched by two of my machines (both R-Pi): https://universeathome.pl/universe/workunit.php?wuid=11721442 https://universeathome.pl/universe/workunit.php?wuid=12326875 These appear to be causing indigestion for other machines which ordinarily routinely crunch successfully. One machine went "over time", and terminated the task. See also: https://universeathome.pl/universe/forum_thread.php?id=282 |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
My Ubuntu 16.04 PC has a credit of 203,000, and gets completely shut down every day (hibernation has never worked properly). I don't do anything special for BOINC when shutting down. I've yet to experience a "long runner" WU on this or any of my other machines running U@H (which do occasionally get shut down completely, but not very often). So I'm not convinced you're on the right track there, but maybe I just need to crunch more WUs, and I don't have any better ideas. Your experience shows that R-Pi machines may not have the problem, even though the Intel machines do. However, you really don't have enough samples to draw much of a conclusion. I usually have to go several weeks of continuous operation before getting one (total credit of 9,440,456 over several machines now). And often when I check my "long runners", I find that they have completed successfully on other machines, and I think vice-versa. So it is some seemingly random phenomenon, and not necessarily "bad" work units (or at least not different from any others, and they all share some weakness). I have informed the project developer of several of my aborted work units, as he requested, but apparently he could not reproduce it. EDIT: My point may not have been clear, but I now think that the reboot of this machine did not induce the long runner. Rather, when I attached this machine to Universe, I probably paused the newly-downloaded Universe work units in order to allow some other project (Rosetta as I recall) to finish. That may have induced the problem, but that is what I am testing now. |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
Just recently I've been playing with BOINC on Android. Initially I had it only crunch when the charger was plugged in, but every time I switched the phone to another power socket, BOINC tasks would suspend and I would lose a percentage of the progress. These loses are often quite large, and certainly more than could be explained by how often BOINC writes to disk. One U@H task lost 100 percentage points after days of crunching. So there does appear to be something wrong with what happens when a task suspends. |
Send message Joined: 15 Oct 17 Posts: 11 Credit: 4,735,011 RAC: 0 |
I have regular endless tasks to. Win 10 PC which run 7/24. Aborting the last WU after 14 hours I found the process two days later still running in Taskmanager!!! So the process wasn´t canceled but removed from Boinc. I have 80 hours wasted time. So I can´t run no more WUs because of this bad behavior. My i7-6700K Windows 10 Pro system has just had its first task which got stuck in a loop (the _2 task in WU universe_bh2_160803_202_16199984_20000_1-999999_200200, with the _1 task being successfully completed using the android_arm_pie application). The task made a single checkpoint after 0:03:27 and was aborted in BOINC Manager 5:20:29 later. It must have been in a loop because it kept running and had to be manually killed from Process Explorer (i.e. it was never executing the BOINC API code which checks for messages from the BOINC core client). It was run with no interruptions and BOINC's event log shows that the _0 task from universe_bh2_160803_202_94384906_20000_1-999999_385200 was successfully completed 1:14:40 before the problem task started: 07-Nov-2017 14:24:10 [Universe@Home] Starting task universe_bh2_160803_202_94384906_20000_1-999999_385200_0 "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
After crunching for a week with no problems (setup as noted above with no suspensions or other projects running), I picked up four long runners in quick succession over a 24-hour period. They ran for about 17 hours and then got stuck. But rather than aborting them, I rebooted and two of them got unstuck and completed normally. https://universeathome.pl/universe/result.php?resultid=28274101 https://universeathome.pl/universe/result.php?resultid=28267843 However, the last two remain stuck. So after the first two completed, I rebooted again. Surprisingly, the last two then completed normally also. https://universeathome.pl/universe/result.php?resultid=28274042 https://universeathome.pl/universe/result.php?resultid=28274043 So there are no "bad" work units, only something causing them to get stuck at random intervals ranging from days to about two months apart for me. That is strange. Maybe it is a hyper threading problem, or a limitation on the CPU cache that some of them hang? I don't know at this point. |
Send message Joined: 9 Nov 17 Posts: 21 Credit: 563,207,000 RAC: 0 |
Luigi R. wrote: I noticed there is no "fraction done" progress when a task is faulty. Luigi, thank you very much for this script. It's a workaround, but an excellent one. |
Send message Joined: 9 Nov 17 Posts: 21 Credit: 563,207,000 RAC: 0 |
Jim1348 wrote: After crunching for a week with no problems (setup as noted above with no suspensions or other projects running), I picked up four long runners in quick succession over a 24-hour period. They ran for about 17 hours and then got stuck. It seems random to me. If I switch the option "Leave non-GPU tasks in memory while suspended" off, suspend tasks which get stuck, and a while later un-suspend them, I have seen any one of the following outcomes:
|
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
Actually, I have also seen situations where a reboot did not fix the stuck ones and they have to be aborted. I think you have a good list of the possibilities. I will mention another possibility. I have just built a new Ryzen 1700 machine on 10 November, and have been running BHspin v2.0 on 15 cores (with one core being reserved for CPU support) for 2 1/2 days now with no long runners. That does not prove anything yet, but I am impressed with the consistency of the run times, which are all in the range of from 1.5 hours to a little over 4 hours, averaging around 2 1/2 hours. That is about the same average as for my i7-3770 machine (both on Ubuntu), but the times ranged from 1.5 hours to around 12 hours on the Intel chip. My guess is that has something to do with the "long runners", and the more consistent performance of the AMD chip hopefully means there will be fewer (or none) of them. But it will have to run for at least several weeks to get some idea about that. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
My Ryzen machine has now picked up two long-runners, so it is not immune either. After a reboot, both started working again, but there is a hitch. The first one to get stuck resumed at the beginning (0 percent), while the second one to get stuck resumed progress where it left off at the time of the reboot (around 30 percent). Also, there are now two other work units that are past the 6 1/2 hour point, even though they are continuing to make progress. But that is a little long for this machine, and I have not seen it before. So my present guess is that the first long runner induces problems with the other work units that are running at the time, causing them to slow down and eventually get stuck too. Maybe it is due to the first long runner using up too much cache memory, or some other common resource? I have to leave it to the experts here. I am out of hardware/software combinations to try. Good luck to all. |
Send message Joined: 4 Nov 17 Posts: 3 Credit: 27,209,667 RAC: 0 |
Hi, on my dual Xeon I get in last days many long WU´s that stay stuck. Canceled but it waste many resources. Will have it fixed or is better to stop working for this project until it will be fixed? I´m able to send you number of this WUs ... |