Message boards : Number crunching : extreme long wu's
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10

AuthorMessage
Profile Beyond
Avatar

Send message
Joined: 19 May 16
Posts: 9
Credit: 212,144,158
RAC: 0
Message 2496 - Posted: 15 Nov 2017, 5:29:16 UTC

I don't think that they're long WUs, they're stuck WUs. So far the ones I get don't checkpoint at all. When I look in the slot folder there is no update in any file even after many, many hours. My sample is very small but so far it's happened on my Intel CPUs but not on my AMD.
ID: 2496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jmeister

Send message
Joined: 29 Sep 17
Posts: 5
Credit: 12,892,567
RAC: 23,038
Message 2497 - Posted: 17 Nov 2017, 17:45:36 UTC

Been getting WUs stuck at 100% but never actually complete & I have to abort. This has been constant on an ARM cpu running android. Never had so many in a row, continuing for days before. Oddly my other systems have been running fine for once.
ID: 2497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2498 - Posted: 19 Nov 2017, 9:05:41 UTC

If the admin was serious about fixing the problem, it would have been fixed.
I gave everything I could, to try to help him. But he never treated it like a significant problem.

All I can do is recommend people set "No New Tasks" for Universe@Home, until the admin takes this more seriously and has fixes for us to test.

Jacob Klein
ID: 2498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xaminmo

Send message
Joined: 18 Nov 17
Posts: 3
Credit: 7,333
RAC: 0
Message 2501 - Posted: 20 Nov 2017, 8:32:11 UTC - in response to Message 2498.  

I wish I'd looked more deeply into this before joining up.

My favorite is a task listed as 30 days estimated runtime, but a due-date 14 days into the future.

Of course I aborted that, but I shouldn't have to micromanage compute. They should be well behaved, self policing, etc.

I did bump up to 7.8.4 client, just in case (kind of a hassle on xenial, but not too bad).

It looks like U@H is not the only problematic project, but seems to have the biggest problems.

IIRC, UID 1 has backed off of the project due to family constraints, and UID 2 basically owns the project.

My guess is they just really do not have the time or resources to troubleshoot this fully, but are hesitant to kill the project either.
ID: 2501 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [SG-2W]Kurzer

Send message
Joined: 27 Feb 15
Posts: 1
Credit: 87,913,239
RAC: 288,969
Message 2512 - Posted: 4 Dec 2017, 13:10:24 UTC

universe_bh2_160803_212_1994999_20000_1-999999_995200
My WU ran 5D 21H 49M, restarted it. Now she shows me 1H 44M and stops again at 91.970%. I will break it off.

Greeting [SG-2W]Kurzer
ID: 2512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile apohawk

Send message
Joined: 15 Oct 15
Posts: 1
Credit: 8,311,081
RAC: 7,335
Message 2558 - Posted: 10 Jan 2018, 0:40:28 UTC

I noticed long work units on only one system of mine.
Linux, Ubuntu x86_64 16.04 LTS (for amdgpu driver), AMD Ryzen 7 1700.
I have 4 tasks hanging right now. Their slots/*/error.dat is as follows:

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 456132

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 490907

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 739907

error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 615205

Reading through the thread i noticed, that the errrors reported in error.dat changed over time, probably being resolved in app or in input data, but the problem remains. So it is probably a layer up, something messed up with handling of the errors.

I also checked what working and "hanging" tasks do. strace looks similiar, but ltrace is very different:
Working task's ltrace sample (ltrace -fp PID):
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279
[pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4010000000000000, 61) = 0x7f0d88492440
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 631
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 291) = 83
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4040000000000000, 33) = 347
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3fd0000000000000, 425) = 395
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3fe0000000000000, 421) = 0xbff0000000000000
[pid 20453] pow(0x7f0d884383a0, 0x4002005e, 0x4002005e, 0x4002005e573c3572) = 777
[pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4090000000000000, 377) = 0x7f0d88492440
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279
[pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x4010000000000000, 61) = 0x7f0d88492440
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 631
[pid 20454] gettimeofday(0x888cde30, 0) = 0
[pid 20454] gettimeofday(0x888cde30, 0) = 0
[pid 20454] usleep(100000 <unfinished ...>
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 291) = 83
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4040000000000000, 33) = 725
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x3f40000000000000, 553) = 333
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40e0000000000000, 483) = 421
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4180000000000000, 39) = 373
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40b0000000000000, 527) = 903
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40f0000000000000, 319) = 725
[pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x3f40000000000000, 553) = 0x7f0d88492440
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 431
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x4000000000000000, 139) = 61
[pid 20453] log10(0x7f0d884383a0, 0x7f0d884363a0, 0x3ff0000000000000, 605) = 0x7f0d88492440
[pid 20453] pow(0x7f0d884383a0, 1022, 0x7f0d88491d00, 384) = 19
[pid 20453] pow(0x7f0d884383a0, 0x7f0d884363a0, 0x40c0000000000000, 333) = 279

and so on...
"hanging" task's ltrace sample (ltrace -fp PID):
[pid 13889] free(0x1a4e560) = <void>
[pid 13889] free(0x1a4e530) = <void>
[pid 13889] free(0x1a4e500) = <void>
[pid 13889] free(0x1a4e4d0) = <void>
[pid 13889] free(0x1a4e4a0) = <void>
[pid 13889] free(0x1a4e470) = <void>
[pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1
[pid 13889] malloc(40) = 0x1a4e470
[pid 13889] malloc(40) = 0x1a4e4a0
[pid 13889] malloc(40) = 0x1a4e4d0
[pid 13889] malloc(40) = 0x1a4e500
[pid 13889] malloc(40) = 0x1a4e530
[pid 13889] malloc(40) = 0x1a4e560
[pid 13889] free(0x1a4e560) = <void>
[pid 13889] free(0x1a4e530) = <void>
[pid 13889] free(0x1a4e500) = <void>
[pid 13889] free(0x1a4e4d0) = <void>
[pid 13889] free(0x1a4e4a0) = <void>
[pid 13889] free(0x1a4e470) = <void>
[pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1
[pid 13889] malloc(40) = 0x1a4e470
[pid 13889] malloc(40) = 0x1a4e4a0
[pid 13889] malloc(40) = 0x1a4e4d0
[pid 13889] malloc(40) = 0x1a4e500
[pid 13889] malloc(40) = 0x1a4e530
[pid 13889] malloc(40) = 0x1a4e560
[pid 13889] free(0x1a4e560) = <void>
[pid 13889] free(0x1a4e530) = <void>
[pid 13889] free(0x1a4e500) = <void>
[pid 13889] free(0x1a4e4d0) = <void>
[pid 13889] free(0x1a4e4a0) = <void>
[pid 13889] free(0x1a4e470) = <void>
[pid 13889] __isnan(0xffffffff, 0x7fc692dddb30, 0x1a4e490, 0) = 1

Looks like infinite loop to me.

"hanging" task's gdb backtrace taken at random:
(gdb) bt
#0 get_T (L=L@entry=651959.17475758004, R=R@entry=1.60374190786905) at singl.c:4688
#1 0x0000000000418274 in dorbdt (t=<optimized out>, y=y@entry=0x1a4e560, dydx=dydx@entry=0x1a4e4d0, KTa=KTa@entry=0, KTb=KTb@entry=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999,
Mb=Mb@entry=24.12289019752216, dMa=dMa@entry=nan(0x8000000000000), dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024,
Ra=Ra@entry=1.60374190786905, Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf, Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004,
Lb=Lb@entry=651959.17475832114, Ka=Ka@entry=-978270896, Kb=Kb@entry=7, magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:7817
#2 0x000000000043244c in rkck1 (y=0x1a4f190, dydx=0x1a4f1c0, n=4, x=3.4806341581987525, h=<optimized out>, yout=0x1a4f220, yerr=0x1a4f1f0,
derivs1=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=0, KTb=2.9179459219507452e-10, Ma=1.4399999999999999, Mb=24.12289019752216, dMa=nan(0x8000000000000), dMb=0.55262531449484709, Ia=-nan(0x8000000000000),
Ib=4.6753715902533024, Ra=1.60374190786905, Rb=1.6037419078694282, Rca=inf, Rcb=0, wcrit_a=0, wcrit_b=0, La=651959.17475758004, Lb=651959.17475832114, Ka=-978270896, Kb=7, magb_a=0,
magb_b=0) at binary.cpp:8427
#3 0x000000000043398f in rkqs1 (y=y@entry=0x1a4f190, dydx=dydx@entry=0x1a4f1c0, n=n@entry=4, x=x@entry=0x7ffdc5b0b9b0, htry=0, eps=nan(0x8000000000000), eps@entry=0.001,
yscal=yscal@entry=0x1a4f160, hdid=hdid@entry=0x7ffdc5b0b9d0, hnext=hnext@entry=0x7ffdc5b0b9c0,
derivs1=derivs1@entry=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=KTa@entry=0, KTb=KTb@entry=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999, Mb=Mb@entry=24.12289019752216, dMa=dMa@entry=nan(0x8000000000000),
dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024, Ra=Ra@entry=1.60374190786905, Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf,
Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004, Lb=Lb@entry=651959.17475832114, Ka=Ka@entry=-978270896, Kb=Kb@entry=7,
magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:8365
#4 0x000000000042f0ff in odeint1 (ystart=ystart@entry=0x1a4f130, nvar=nvar@entry=4, x1=x1@entry=3.4806341581987525, x2=x2@entry=3.4806531067030479, eps=eps@entry=0.001,
h1=0.15274083617124923, hmin=hmin@entry=0, nok=nok@entry=0x7ffdc5b0bc50, nbad=nbad@entry=0x7ffdc5b0bc60,
derivs1=derivs1@entry=0x4181d0 <dorbdt(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>,
rkqs1=rkqs1@entry=0x433760 <rkqs1(double*, double*, int, double*, double, double, double*, double*, double*, void (*)(double, double*, double*, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int), double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, int, int, int, int)>, KTa=0, KTb=2.9179459219507452e-10, Ma=Ma@entry=1.4399999999999999, Mb=Mb@entry=24.12289019752216,
dMa=dMa@entry=nan(0x8000000000000), dMb=dMb@entry=0.55262531449484709, Ia=Ia@entry=-nan(0x8000000000000), Ib=Ib@entry=4.6753715902533024, Ra=Ra@entry=1.60374190786905,
Rb=Rb@entry=1.6037419078694282, Rca=Rca@entry=inf, Rcb=Rcb@entry=0, wcrit_a=wcrit_a@entry=0, wcrit_b=wcrit_b@entry=0, La=La@entry=651959.17475758004, Lb=Lb@entry=651959.17475832114,
Ka=Ka@entry=-978270896, Kb=Kb@entry=7, magb_a=magb_a@entry=0, magb_b=magb_b@entry=0) at binary.cpp:8326
#5 0x0000000000430c93 in orb_change (t1=t1@entry=3.4806341581987525, t2=t2@entry=3.4806531067030479, a=a@entry=0x7ffdc5b0c400, e=e@entry=0x7ffdc5b0c410, wa=wa@entry=0x7ffdc5b0c420,
wb=wb@entry=0x7ffdc5b0c430, tvira=0, tvirb=3.7897020774656023e-05, Ma=1.4399999999999999, Mb=24.12289019752216, M0a=1.4399999999999999, M0b=24.12289019752216,
Mzamsa=147.78874082664947, Mzamsb=125.59129163845198, dMwinda=dMwinda@entry=nan(0x8000000000000), dMwindb=dMwindb@entry=0.55262531449484709, Mca=0, Mcb=0, Ra=1.60374190786905,
Rb=1.6037419078694282, Raold=Raold@entry=nan(0x8000000000000), Rbold=Rbold@entry=1.6037347409964342, La=651959.17475758004, Lb=651959.17475832114, Ka=-978270896, Kb=7,
Kaold=Kaold@entry=1, Kbold=Kbold@entry=7, mt=mt@entry=0, ce=ce@entry=0, mttype=0, Iaold=Iaold@entry=0x7ffdc5b0c440, Ibold=Ibold@entry=0x7ffdc5b0c450, KTa=KTa@entry=0x7ffdc5b0c590,
KTb=KTb@entry=0x7ffdc5b0c598, dMmta=0, dMmtb=0, Mdisa=0, Mdisb=0, darwin=darwin@entry=0x7ffdc5b0c2e4) at binary.cpp:7750
#6 0x000000000040f58c in main (argc=<optimized out>, argv=<optimized out>) at binary.cpp:1899

Those NAN are quite high in backtrace, but i don't know whether they should be there or not.
ID: 2558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 64
Credit: 12,928,075
RAC: 27,351
Message 2559 - Posted: 10 Jan 2018, 7:50:26 UTC
Last modified: 10 Jan 2018, 7:52:28 UTC

This one appears to be a candidate:
https://universeathome.pl/universe/workunit.php?wuid=13505387
This is the first and only long-runner I've had to date (I aborted it when I noticed it was lagging behind other tasks and not making any progress). One host has managed to complete it and have it validated, but note that the run time was 1,870.11s and the CPU time was 0s. Otherwise, most hosts have choked on it.
ID: 2559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cykodennis
Avatar

Send message
Joined: 4 Feb 15
Posts: 24
Credit: 7,035,527
RAC: 0
Message 2570 - Posted: 23 Jan 2018, 9:04:16 UTC

Is there any progress with this issue?
I would like to do some work for a repaired U@H again...

Btw, respect for all those users who try to analyze the issue.
"I should bring one important point to the attention of the authors and that is, the world is not the United States..."
ID: 2570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 150
Credit: 200,043,546
RAC: 47,946
Message 2572 - Posted: 23 Jan 2018, 21:36:18 UTC - in response to Message 2570.  

Is there any progress with this issue?
I would like to do some work for a repaired U@H again...

Btw, respect for all those users who try to analyze the issue.


Issue is over 2 years old crossing several apps. I wouldn't count on it.
ID: 2572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 36,303,060
RAC: 0
Message 2573 - Posted: 25 Jan 2018, 16:09:30 UTC

I've been getting a TON of over 3&4 days long WU's but they do complete.

Problem is, I still get the same 666.67 points for a 3&4 day long WU as I do for a 2 hour WU...

Umm, maybe something could be done to correct this points problem?

8-)
ID: 2573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jmeister

Send message
Joined: 29 Sep 17
Posts: 5
Credit: 12,892,567
RAC: 23,038
Message 2574 - Posted: 26 Jan 2018, 3:20:40 UTC

After a week of 100% stuck work units across multiple operating systems & processors I'm officially done running this project. I've not seen it this bad before, both solo and pool crunching. Trying to babysit 6 devices is completely unnecessary. I hope those that ride it out eventually get some satisfaction from whoever is maintaining this.
ID: 2574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 36,303,060
RAC: 0
Message 2577 - Posted: 27 Jan 2018, 18:50:48 UTC

Does any admin actually read the message boards????

8-)
ID: 2577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2586 - Posted: 30 Jan 2018, 22:22:59 UTC - in response to Message 2577.  

In this thread, an admin gave up trying to solve the issue.
Despite me providing as much seemingly-useful info as I could, and offering time to help reproduce and isolate the issue.

I have set No New Tasks for Universe@Home on all my PCs, so that my resources do not get wasted, and I've recommended the same to other BOINC friends.

I hope an admin takes this more seriously at some point.
ID: 2586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 64
Credit: 12,928,075
RAC: 27,351
Message 2587 - Posted: 31 Jan 2018, 17:52:00 UTC
Last modified: 31 Jan 2018, 18:01:49 UTC

As a result of the Meltdown/Spectre U@H shutdown, the density of long-runners has increased (because these are the tasks getting bounced from host to host). I have received four tasks in the last couple of days that have been bounced from host to host. All four have been completed and validated by one host, but with a CPU time of zero seconds.

Of the three Win 10 tasks, two have completed just fine, but one jams at 54.905%. When I first noticed it had jammed, I shut down BOINC Manager and waited for the BHSpin task to close, which it did after a number of minutes. When I restarted BOINC Manager the BHSpin task restarted from the checkpoint, and then jammed again at exactly the same point. I have saved the files, if anyone wants them.

The fourth task is on a Pi and awaiting other (non-U@H) tasks to complete.

Edit: This entry in one of the files doesn't sound good:
error: in Renv_con() unknown Ka type: 1, iidd_old: 1324350error: in Menv_con() unknown Ka type: 1, iidd_old: 1324350
ID: 2587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skywalker@Athens

Send message
Joined: 7 Nov 17
Posts: 2
Credit: 21,333
RAC: 0
Message 2594 - Posted: 5 Feb 2018, 6:44:28 UTC

Hi all,
same long runs here,
running windows 7 on Intel I7 CPU. Tasks when downloaded said something like 2 hrs to complete. Now that I've passed the 3rd day I'm seriously thinking to abort the tasks for a second time but also abort the project too.
This project is not only eating my CPU time, but also steals the time from other projects that I could participate.

Any admins here? Why so long and unpredicted runs and why so low credit for the tasks? Is the code behind the tasks really so ineffective? What language do you use for programming? What's the methodology for developing this project?

Regards
ID: 2594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 19 May 16
Posts: 9
Credit: 212,144,158
RAC: 0
Message 2597 - Posted: 6 Feb 2018, 16:45:35 UTC - in response to Message 2587.  

As a result of the Meltdown/Spectre U@H shutdown, the density of long-runners has increased (because these are the tasks getting bounced from host to host). I have received four tasks in the last couple of days that have been bounced from host to host. All four have been completed and validated by one host, but with a CPU time of zero seconds.

Exactly what I've observed. Sometimes rebooting starts them again. Often it does not. BTW, I'd call them stuck WUs, not long-runners.
ID: 2597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris

Send message
Joined: 29 Jan 18
Posts: 6
Credit: 523,333
RAC: 0
Message 2599 - Posted: 7 Feb 2018, 9:41:59 UTC - in response to Message 1898.  

Only what I can suggest is to abort work units which computer longer then usual tasks on particular computer. It's very rare situation and ...

Error rate is 0.6106% - all errors, not only this one...

I only got my first two work units and they both turned erroneous.
That is 100% error rate for me :(
ID: 2599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 64
Credit: 12,928,075
RAC: 27,351
Message 2600 - Posted: 7 Feb 2018, 13:21:18 UTC - in response to Message 2599.  

Two tasks that have had to be resent many times arrived on two of my hosts. Both the desktop PC running Ubuntu and the Pi running Raspbian Stretch completed their tasks without a hitch.

@Chris: Unfortunately you've chosen a bad time to join the project as new task creation has been temporarily turned off, which means you'll likely see mainly tasks that have a tendency to get stuck. My Linux hosts have not had one stuck task, and until very recently my Windows host had also seen none. I get the impression some hosts experience more difficulties than others.
ID: 2600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 150
Credit: 200,043,546
RAC: 47,946
Message 2603 - Posted: 8 Feb 2018, 21:07:16 UTC

With no new work being generated the only tasks left are the long running kind that get aborted and resent to another user. I doubt it has anything related to meltdown/spectre.
ID: 2603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 64
Credit: 12,928,075
RAC: 27,351
Message 2604 - Posted: 9 Feb 2018, 12:16:22 UTC - in response to Message 2603.  

If you read the news, https://universeathome.pl/universe/forum_thread.php?id=309, you'll see that the reason why there is no new work being created is Meltdown/Spectre, and therefore that is the ultimate reason why the concentration of problematic tasks is currently increasing.
ID: 2604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10

Message boards : Number crunching : extreme long wu's




Copyright © 2021 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek