Message boards :
Number crunching :
extreme long wu's
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10
Author | Message |
---|---|
Send message Joined: 19 May 16 Posts: 9 Credit: 215,352,825 RAC: 0 |
I don't think that they're long WUs, they're stuck WUs. So far the ones I get don't checkpoint at all. When I look in the slot folder there is no update in any file even after many, many hours. My sample is very small but so far it's happened on my Intel CPUs but not on my AMD. |
Send message Joined: 29 Sep 17 Posts: 5 Credit: 17,563,100 RAC: 0 |
Been getting WUs stuck at 100% but never actually complete & I have to abort. This has been constant on an ARM cpu running android. Never had so many in a row, continuing for days before. Oddly my other systems have been running fine for once. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
If the admin was serious about fixing the problem, it would have been fixed. I gave everything I could, to try to help him. But he never treated it like a significant problem. All I can do is recommend people set "No New Tasks" for Universe@Home, until the admin takes this more seriously and has fixes for us to test. Jacob Klein |
Send message Joined: 18 Nov 17 Posts: 3 Credit: 7,333 RAC: 0 |
I wish I'd looked more deeply into this before joining up. My favorite is a task listed as 30 days estimated runtime, but a due-date 14 days into the future. Of course I aborted that, but I shouldn't have to micromanage compute. They should be well behaved, self policing, etc. I did bump up to 7.8.4 client, just in case (kind of a hassle on xenial, but not too bad). It looks like U@H is not the only problematic project, but seems to have the biggest problems. IIRC, UID 1 has backed off of the project due to family constraints, and UID 2 basically owns the project. My guess is they just really do not have the time or resources to troubleshoot this fully, but are hesitant to kill the project either. |
Send message Joined: 27 Feb 15 Posts: 1 Credit: 103,310,906 RAC: 0 |
universe_bh2_160803_212_1994999_20000_1-999999_995200 My WU ran 5D 21H 49M, restarted it. Now she shows me 1H 44M and stops again at 91.970%. I will break it off. Greeting [SG-2W]Kurzer |
Send message Joined: 15 Oct 15 Posts: 1 Credit: 29,834,581 RAC: 0 |
I noticed long work units on only one system of mine. Linux, Ubuntu x86_64 16.04 LTS (for amdgpu driver), AMD Ryzen 7 1700. I have 4 tasks hanging right now. Their slots/*/error.dat is as follows: error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 456132 error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 490907 error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 739907 error: function Lzahbf(M,Mc) should not be called for HM stars error: function Lzahbf(M,Mc) should not be called for HM stars unexpected remnant case for K=5-6: 615205 Reading through the thread i noticed, that the errrors reported in error.dat changed over time, probably being resolved in app or in input data, but the problem remains. So it is probably a layer up, something messed up with handling of the errors. I also checked what working and "hanging" tasks do. strace looks similiar, but ltrace is very different: Working task's ltrace sample (ltrace -fp PID): [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279 [pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4010000000000000, 61) = 0x7f0d88492440 [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 631 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 291) = 83 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4040000000000000, 33) = 347 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3fd0000000000000, 425) = 395 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3fe0000000000000, 421) = 0xbff0000000000000 [pid 20453] pow(0x7f0d884383a0, 0x4002005e, 0x4002005e, 0x4002005e573c3572) = 777 [pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4090000000000000, 377) = 0x7f0d88492440 [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279 [pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4010000000000000, 61) = 0x7f0d88492440 [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 631 [pid 20454] gettimeofday(0x888cde30, 0) = 0 [pid 20454] gettimeofday(0x888cde30, 0) = 0 [pid 20454] usleep(100000 <unfinished ...> [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 291) = 83 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4040000000000000, 33) = 725 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3f40000000000000, 553) = 333 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40e0000000000000, 483) = 421 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4180000000000000, 39) = 373 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40b0000000000000, 527) = 903 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40f0000000000000, 319) = 725 [pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x3f40000000000000, 553) = 0x7f0d88492440 [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 431 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 139) = 61 [pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x3ff0000000000000, 605) = 0x7f0d88492440 [pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19 [pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279 and so on... "hanging" task's ltrace sample (ltrace -fp PID): [pid 13889] free(0x1a4e560) = <void> [pid 13889] free(0x1a4e530) = <void> [pid 13889] free(0x1a4e500) = <void> [pid 13889] free(0x1a4e4d0) = <void> [pid 13889] free(0x1a4e4a0) = <void> [pid 13889] free(0x1a4e470) = <void> [pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1 [pid 13889] malloc(40) = 0x1a4e470 [pid 13889] malloc(40) = 0x1a4e4a0 [pid 13889] malloc(40) = 0x1a4e4d0 [pid 13889] malloc(40) = 0x1a4e500 [pid 13889] malloc(40) = 0x1a4e530 [pid 13889] malloc(40) = 0x1a4e560 [pid 13889] free(0x1a4e560) = <void> [pid 13889] free(0x1a4e530) = <void> [pid 13889] free(0x1a4e500) = <void> [pid 13889] free(0x1a4e4d0) = <void> [pid 13889] free(0x1a4e4a0) = <void> [pid 13889] free(0x1a4e470) = <void> [pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1 [pid 13889] malloc(40) = 0x1a4e470 [pid 13889] malloc(40) = 0x1a4e4a0 [pid 13889] malloc(40) = 0x1a4e4d0 [pid 13889] malloc(40) = 0x1a4e500 [pid 13889] malloc(40) = 0x1a4e530 [pid 13889] malloc(40) = 0x1a4e560 [pid 13889] free(0x1a4e560) = <void> [pid 13889] free(0x1a4e530) = <void> [pid 13889] free(0x1a4e500) = <void> [pid 13889] free(0x1a4e4d0) = <void> [pid 13889] free(0x1a4e4a0) = <void> [pid 13889] free(0x1a4e470) = <void> [pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1 Looks like infinite loop to me. "hanging" task's gdb backtrace taken at random: (gdb) bt #0 get_T (L=L@entry=651959.17475758004, R=R@entry=1.60374190786905) at singl.c:4688 #1 0x0000000000418274 in dorbdt (t=<optimized out>, y=y@entry=0x1a4e560, dydx=dydx@entry=0x1a4e4d0, KTa=KTa@entry=0, KTb=KTb@entry=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999, Mb=Mb@entry=24.12289019752216, dMa=dMa@entry=nan(0x8000000000000), dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024, Ra=Ra@entry=1.60374190786905, Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf, Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004, Lb=Lb@entry=651959.17475832114, Ka=Ka@entry=-978270896, Kb=Kb@entry=7, magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:7817 #2 0x000000000043244c in rkck1 (y=0x1a4f190, dydx=0x1a4f1c0, n=4, x=3.4806341581987525, h=<optimized out>, yout=0x1a4f220, yerr=0x1a4f1f0, derivs1=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=0, KTb=2.9179459219507452e-10, Ma=1.4399999999999999, Mb=24.12289019752216, dMa=nan(0x8000000000000), dMb=0.55262531449484709, Ia=-nan(0x8000000000000), Ib=4.6753715902533024, Ra=1.60374190786905, Rb=1.6037419078694282, Rca=inf, Rcb=0, wcrit_a=0, wcrit_b=0, La=651959.17475758004, Lb=651959.17475832114, Ka=-978270896, Kb=7, magb_a=0, magb_b=0) at binary.cpp:8427 #3 0x000000000043398f in rkqs1 (y=y@entry=0x1a4f190, dydx=dydx@entry=0x1a4f1c0, n=n@entry=4, x=x@entry=0x7ffdc5b0b9b0, htry=0, eps=nan(0x8000000000000), eps@entry=0.001, yscal=yscal@entry=0x1a4f160, hdid=hdid@entry=0x7ffdc5b0b9d0, hnext=hnext@entry=0x7ffdc5b0b9c0, derivs1=derivs1@entry=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=KTa@entry=0, KTb=KTb@entry=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999, Mb=Mb@entry=24.12289019752216, dMa=dMa@entry=nan(0x8000000000000), dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024, Ra=Ra@entry=1.60374190786905, Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf, Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004, Lb=Lb@entry=651959.17475832114, Ka=Ka@entry=-978270896, Kb=Kb@entry=7, magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:8365 #4 0x000000000042f0ff in odeint1 (ystart=ystart@entry=0x1a4f130, nvar=nvar@entry=4, x1=x1@entry=3.4806341581987525, x2=x2@entry=3.4806531067030479, eps=eps@entry=0.001, h1=0.15274083617124923, hmin=hmin@entry=0, nok=nok@entry=0x7ffdc5b0bc50, nbad=nbad@entry=0x7ffdc5b0bc60, derivs1=derivs1@entry=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, rkqs1=rkqs1@entry=0x433760 <rkqs1(double*, double*, int, double*, double, double, double*, double*, double*, void (*)(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int), double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=0, KTb=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999, Mb=Mb@entry=24.12289019752216, dMa=dMa@entry=nan(0x8000000000000), dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024, Ra=Ra@entry=1.60374190786905, Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf, Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004, Lb=Lb@entry=651959.17475832114, Ka=Ka@entry=-978270896, Kb=Kb@entry=7, magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:8326 #5 0x0000000000430c93 in orb_change (t1=t1@entry=3.4806341581987525, t2=t2@entry=3.4806531067030479, a=a@entry=0x7ffdc5b0c400, e=e@entry=0x7ffdc5b0c410, wa=wa@entry=0x7ffdc5b0c420, wb=wb@entry=0x7ffdc5b0c430, tvira=0, tvirb=3.7897020774656023e-05, Ma=1.4399999999999999, Mb=24.12289019752216, M0a=1.4399999999999999, M0b=24.12289019752216, Mzamsa=147.78874082664947, Mzamsb=125.59129163845198, dMwinda=dMwinda@entry=nan(0x8000000000000), dMwindb=dMwindb@entry=0.55262531449484709, Mca=0, Mcb=0, Ra=1.60374190786905, Rb=1.6037419078694282, Raold=Raold@entry=nan(0x8000000000000), Rbold=Rbold@entry=1.6037347409964342, La=651959.17475758004, Lb=651959.17475832114, Ka=-978270896, Kb=7, Kaold=Kaold@entry=1, Kbold=Kbold@entry=7, mt=mt@entry=0, ce=ce@entry=0, mttype=0, Iaold=Iaold@entry=0x7ffdc5b0c440, Ibold=Ibold@entry=0x7ffdc5b0c450, KTa=KTa@entry=0x7ffdc5b0c590, KTb=KTb@entry=0x7ffdc5b0c598, dMmta=0, dMmtb=0, Mdisa=0, Mdisb=0, darwin=darwin@entry=0x7ffdc5b0c2e4) at binary.cpp:7750 #6 0x000000000040f58c in main (argc=<optimized out>, argv=<optimized out>) at binary.cpp:1899 Those NAN are quite high in backtrace, but i don't know whether they should be there or not. |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
This one appears to be a candidate: https://universeathome.pl/universe/workunit.php?wuid=13505387 This is the first and only long-runner I've had to date (I aborted it when I noticed it was lagging behind other tasks and not making any progress). One host has managed to complete it and have it validated, but note that the run time was 1,870.11s and the CPU time was 0s. Otherwise, most hosts have choked on it. |
Send message Joined: 4 Feb 15 Posts: 24 Credit: 7,035,527 RAC: 0 |
Is there any progress with this issue? I would like to do some work for a repaired U@H again... Btw, respect for all those users who try to analyze the issue. "I should bring one important point to the attention of the authors and that is, the world is not the United States..." |
Send message Joined: 2 Jun 16 Posts: 169 Credit: 317,253,046 RAC: 0 |
Is there any progress with this issue? Issue is over 2 years old crossing several apps. I wouldn't count on it. |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
I've been getting a TON of over 3&4 days long WU's but they do complete. Problem is, I still get the same 666.67 points for a 3&4 day long WU as I do for a 2 hour WU... Umm, maybe something could be done to correct this points problem? 8-) |
Send message Joined: 29 Sep 17 Posts: 5 Credit: 17,563,100 RAC: 0 |
After a week of 100% stuck work units across multiple operating systems & processors I'm officially done running this project. I've not seen it this bad before, both solo and pool crunching. Trying to babysit 6 devices is completely unnecessary. I hope those that ride it out eventually get some satisfaction from whoever is maintaining this. |
Send message Joined: 22 Feb 15 Posts: 23 Credit: 37,205,060 RAC: 0 |
Does any admin actually read the message boards???? 8-) |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
In this thread, an admin gave up trying to solve the issue. Despite me providing as much seemingly-useful info as I could, and offering time to help reproduce and isolate the issue. I have set No New Tasks for Universe@Home on all my PCs, so that my resources do not get wasted, and I've recommended the same to other BOINC friends. I hope an admin takes this more seriously at some point. |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
As a result of the Meltdown/Spectre U@H shutdown, the density of long-runners has increased (because these are the tasks getting bounced from host to host). I have received four tasks in the last couple of days that have been bounced from host to host. All four have been completed and validated by one host, but with a CPU time of zero seconds. Of the three Win 10 tasks, two have completed just fine, but one jams at 54.905%. When I first noticed it had jammed, I shut down BOINC Manager and waited for the BHSpin task to close, which it did after a number of minutes. When I restarted BOINC Manager the BHSpin task restarted from the checkpoint, and then jammed again at exactly the same point. I have saved the files, if anyone wants them. The fourth task is on a Pi and awaiting other (non-U@H) tasks to complete. Edit: This entry in one of the files doesn't sound good: error: in Renv_con() unknown Ka type: 1, iidd_old: 1324350error: in Menv_con() unknown Ka type: 1, iidd_old: 1324350 |
Send message Joined: 7 Nov 17 Posts: 2 Credit: 21,333 RAC: 0 |
Hi all, same long runs here, running windows 7 on Intel I7 CPU. Tasks when downloaded said something like 2 hrs to complete. Now that I've passed the 3rd day I'm seriously thinking to abort the tasks for a second time but also abort the project too. This project is not only eating my CPU time, but also steals the time from other projects that I could participate. Any admins here? Why so long and unpredicted runs and why so low credit for the tasks? Is the code behind the tasks really so ineffective? What language do you use for programming? What's the methodology for developing this project? Regards |
Send message Joined: 19 May 16 Posts: 9 Credit: 215,352,825 RAC: 0 |
As a result of the Meltdown/Spectre U@H shutdown, the density of long-runners has increased (because these are the tasks getting bounced from host to host). I have received four tasks in the last couple of days that have been bounced from host to host. All four have been completed and validated by one host, but with a CPU time of zero seconds. Exactly what I've observed. Sometimes rebooting starts them again. Often it does not. BTW, I'd call them stuck WUs, not long-runners. |
Send message Joined: 29 Jan 18 Posts: 6 Credit: 523,333 RAC: 0 |
Only what I can suggest is to abort work units which computer longer then usual tasks on particular computer. It's very rare situation and ... Error rate is 0.6106% - all errors, not only this one... I only got my first two work units and they both turned erroneous. That is 100% error rate for me :( |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
Two tasks that have had to be resent many times arrived on two of my hosts. Both the desktop PC running Ubuntu and the Pi running Raspbian Stretch completed their tasks without a hitch. @Chris: Unfortunately you've chosen a bad time to join the project as new task creation has been temporarily turned off, which means you'll likely see mainly tasks that have a tendency to get stuck. My Linux hosts have not had one stuck task, and until very recently my Windows host had also seen none. I get the impression some hosts experience more difficulties than others. |
Send message Joined: 2 Jun 16 Posts: 169 Credit: 317,253,046 RAC: 0 |
With no new work being generated the only tasks left are the long running kind that get aborted and resent to another user. I doubt it has anything related to meltdown/spectre. |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
If you read the news, https://universeathome.pl/universe/forum_thread.php?id=309, you'll see that the reason why there is no new work being created is Meltdown/Spectre, and therefore that is the ultimate reason why the concentration of problematic tasks is currently increasing. |