Message boards : Number crunching : Long running work units
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
fractal

Send message
Joined: 23 Mar 15
Posts: 3
Credit: 155,811,308
RAC: 84
Message 1053 - Posted: 30 Jan 2016, 18:10:26 UTC

Am I the only one getting the long running work units mixed in with regular running units? They appear to come in bunches.

The 333 credit work units usually take between 6000 and 12000 seconds on the various machines I have but occasionally I get the 60,000 plus second work unit. These get the same 333 credits when the complete.

Examples include http://universeathome.pl/universe/workunit.php?wuid=3718041, http://universeathome.pl/universe/workunit.php?wuid=3717416, http://universeathome.pl/universe/workunit.php?wuid=3717415http://universeathome.pl/universe/workunit.php?wuid=3717828 on a stock i7. Ok, let's ignore that last one. That's an ARM processor which took as long as the i7 ...

A stock i3 sees http://universeathome.pl/universe/workunit.php?wuid=3675859 and http://universeathome.pl/universe/workunit.php?wuid=3675872

As you can see, sometimes the wingmen also get the really long run times, sometimes they get slightly longer runtimes and sometimes they get standard runtimes.

All my machines run Intel processors at stock frequencies with decent cooling under Linux. There is no pattern to the wingmen as they run a variety of processors and operating systems.

I have written it off as "one of those things" for a while until I looked into it this weekend and found that the wingmen are not affected the same way I am.

So, does anyone else see a pile of these day long work units mixed in with their 2-3 hour units? Does anyone have any idea what is going on?
ID: 1053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alex

Send message
Joined: 21 Feb 15
Posts: 64
Credit: 65,733,511
RAC: 298
Message 1054 - Posted: 30 Jan 2016, 20:23:48 UTC

My '333' wu's on i7 are wu's between 10.900 and 15.900 sec. Longer wu's are granted with higher credit.
This is how I see it.
No dedicated cruncher, all productive pc's.
ID: 1054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1055 - Posted: 30 Jan 2016, 21:38:15 UTC - in response to Message 1053.  

@fractal, You just probably get sometimes resends of longer work units from past.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cyrusNGC_224@P3D

Send message
Joined: 21 Feb 15
Posts: 46
Credit: 926,538,317
RAC: 584
Message 1056 - Posted: 30 Jan 2016, 23:28:43 UTC

I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground:
universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF

All these work units carry the number 10 in position B! No matter what value the other positions have.
Look in your work unit lists to verify this.
ID: 1056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 23 Mar 15
Posts: 3
Credit: 155,811,308
RAC: 84
Message 1057 - Posted: 30 Jan 2016, 23:40:24 UTC - in response to Message 1056.  
Last modified: 31 Jan 2016, 0:04:35 UTC

I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground:
universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF

All these work units carry the number 10 in position B! No matter what value the other positions have.
Look in your work unit lists to verify this.

Good observation. Not sure what that field means but, yes, all of the ones I spotted seem to have a "10" there. I wonder what it is about those work units that cause some machines to take so much longer than others.

edit: I just went through 1000 of the 8000 units in my "valid tasks" list and you are 33 for 33. All 33/1000 have 10 in that position. http://universeathome.pl/universe/workunit.php?wuid=3676163 is interesting in it taking 2x as long as it should. Maybe 1/3 of the 33 took 2x as long as normal and the rest taking 8x as long. Occasionally the wingman takes the normal time, sometimes it takes 2x as long and rarely both of us take the full extra long time.

Oh, and every unit gets 333 credits no matter how long they took to compute.
ID: 1057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1058 - Posted: 31 Jan 2016, 7:45:22 UTC - in response to Message 1056.  

I have observed these vastly longer running work units too. But you can see (notably in your examples) they have one common ground:
universe_bh_AAA_10_CCCCC_D-EEEEEE_FFFFFF

All these work units carry the number 10 in position B! No matter what value the other positions have.
Look in your work unit lists to verify this.

This is very interesting...

This "10" in WU name means, that is 10th incrementation through loop which generates one of the starting parameters ("idum"). I strongly suspect that it have conjunction with other one giving in result longer computing time and, probably, interesting results.

Let me investigate this, please :)
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cyrusNGC_224@P3D

Send message
Joined: 21 Feb 15
Posts: 46
Credit: 926,538,317
RAC: 584
Message 1065 - Posted: 9 Feb 2016, 22:53:11 UTC - in response to Message 1057.  

I wonder what it is about those work units that cause some machines to take so much longer than others.
Further I see that only my AMD Bulldozer based (Opteron and Kaveri) machines are not affected by this long runtimes with the "10" work units (so take normal time)! AMD K8, K10, *Cat and Intel C2D and Core i CPUs all take the long time.
Everything within Linux; Win, however, I have not noticed.
ID: 1065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 32
Message 1150 - Posted: 21 Apr 2016, 15:56:07 UTC
Last modified: 21 Apr 2016, 15:57:20 UTC

I have observed the same thing. It seems AMD runs faster on these "10" tasks but also some i3 and i5 CPU's seem to go faster. It's very interesting that I can find a minor correlation between the OS (Win vs. Linux). I see the same thing on my E3-1240V2 CPU's as I do on the 2p X5680 setup.

However, they ALL get the same 333 points... how weird...

In many cases, the i3/i5 CPU's that significantly do better are running windows instead of Linux, so some compiler thing may be going on as well...

Tis for the developers to clear this up I think...

:D
ID: 1150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 48
Credit: 15,956,546
RAC: 54
Message 1153 - Posted: 27 Apr 2016, 1:56:16 UTC

I have had at least one long one before that ran over a day (about 30 hours, my wingman only took 5 hours).
I currently have one that has been running for over 21 hours and has been stuck at 10.075% completed for many hours without changing.
It is using a full CPU core so appears to be running OK.
Both long units used this hardware, an AMD PhenomII X6, running Linux Fedora16 X64.

Waiting to see if it completes, previous long runner got the same 333 points, so I assume this one will too, it will be a poor return on resource use but there are only a few of them.
They are recent creations not resends.

Conan
ID: 1153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TheHoosh

Send message
Joined: 6 Mar 15
Posts: 13
Credit: 20,214,952
RAC: 0
Message 1154 - Posted: 28 Apr 2016, 9:09:33 UTC

A couple of days ago I received some of those long-running WUs as well.
And just like on your machine(s), they all got stuck at random levels of completion (which stands in contrast to the previous and occasionally occuring long-running WUs).

Some ran longer than 9 hours and did not proceed farther than 10%, others were stuck at around 90% after running for 21 hours.
I canceled all of them as they've appeared to be somehow corrupted to me.
ID: 1154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1155 - Posted: 28 Apr 2016, 10:21:04 UTC

I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows...

I will try to find out why this happened.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 48
Credit: 15,956,546
RAC: 54
Message 1156 - Posted: 28 Apr 2016, 22:31:52 UTC

Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine.
Should I abort it?
It is still using CPU and the Task Manager (Linux equivilent) says it has been running for over 3 Days.

Conan
ID: 1156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 48
Credit: 15,956,546
RAC: 54
Message 1157 - Posted: 29 Apr 2016, 8:08:58 UTC - in response to Message 1156.  

Mine has now moved to over 32 hours and still stuck at 10.075%, this is on a Linux machine.
Should I abort it?
It is still using CPU and the Task Manager (Linux equivilent) says it has been running for over 3 Days.

Conan


Don't worry about this one as I have aborted it.
It reached nearly 40 hours and had only made one checkpoint at 8 minutes 50 seconds then nothing after that.
The percentage done remained at 10.075% all the time.
The properties of the work unit (in BOINC manager) said it was using just 0.3 GFlops per second, which seems a bit slow.
So even though it was using a CPU it did not appear to be doing much.

Conan
ID: 1157 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cyrusNGC_224@P3D

Send message
Joined: 21 Feb 15
Posts: 46
Credit: 926,538,317
RAC: 584
Message 1158 - Posted: 29 Apr 2016, 10:47:16 UTC - in response to Message 1155.  

I found that at least some of them are break on Linux computers and successfully finish (and get points) on Windows...

I will try to find out why this happened.
That affected only universe_bh_361_0 WUs like this one: http://universeathome.pl/universe/workunit.php?wuid=4823677
ID: 1158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 32
Message 1159 - Posted: 1 May 2016, 12:19:23 UTC
Last modified: 1 May 2016, 12:25:15 UTC

I have now 4 long ones on two computers (total 8) that have run for 6 days on one setup and over 1 day on the other setup. Both are running Linux. I let them run because I was curious... I'll abort them all after I report what is in the slots directories now...

Both setups are server grade with E3-1230V3 and an E3-1240V2 CPU's. Both run Linux Mint 17.3 .

In any case, I checked the slot files and nothing is changing except the error.dat files some stuff...

error.dat =
error: function Lzahbf(M,Mc) should not be called for HM stars
error: function Lzahbf(M,Mc) should not be called for HM stars
unexpected remnant case for K=5-6: 254568

error.dat2 =
error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181)

error.dat3 =
error: bondi() accreted mass (2.801104) larger than envelope mass (2.786657) (190181)

The boinc_mmap_file has some binary junk in it...

Nothing in the stderr.txt file.

log.txt =
00:00:00 00:00:00 PROGRAM START: Sun Apr 24 23:43:52 2016
00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints
00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present
00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -943500
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Sal = -2.3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: Fa = 1
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -943500; num_tested: 20000
00:03:53 00:03:53 making checkpoint: j: 1000; iidd: 118910
00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:03:53 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:03:53 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:03:53 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:03:53 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:07:50 00:03:57 making checkpoint: j: 2000; iidd: 242302
00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:07:50 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:07:50 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:07:50 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:07:50 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:00:00 00:00:00 PROGRAM START: Sun May 1 06:16:53 2016
00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -943500
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Sal = -2.3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: Fa = 1
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -943500; num_tested: 20000
00:00:00 00:00:00 random number generator initialised: 242302
00:00:00 00:00:00 PROGRAM START: Sun May 1 06:28:15 2016
00:00:01 00:00:01 reading checkpoint: istart: 2000; pp: 242302; n: 2
00:00:01 00:00:00 checkpoint read
00:00:01 00:00:00 default values set
00:00:01 00:00:00 Reading param.in file
00:00:01 00:00:00 PARAMIN: num_tested = 20000
00:00:01 00:00:00 PARAMIN: hub_val = 1000
00:00:01 00:00:00 PARAMIN: idum = -943500
00:00:01 00:00:00 PARAMIN: OUTPUT = 3
00:00:01 00:00:00 PARAMIN: Sal = -2.3
00:00:01 00:00:00 PARAMIN: Mmina = 5.0
00:00:01 00:00:00 PARAMIN: Mminb = 3.0
00:00:01 00:00:00 PARAMIN: Fa = 1
00:00:01 00:00:00 PARAMIN: ZZ = 0.0001
00:00:01 00:00:00 param.in file read
00:00:01 00:00:00 idum: -943500; num_tested: 20000
00:00:01 00:00:00 random number generator initialised: 242302
00:00:00 00:00:00 PROGRAM START: Sun May 1 06:57:59 2016
00:00:00 00:00:00 reading checkpoint: istart: 2000; pp: 242302; n: 2
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -943500
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Sal = -2.3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: Fa = 1
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -943500; num_tested: 20000
00:00:00 00:00:00 random number generator initialised: 242302

This task is one of four on this setup running over 6 days. I've restarted it a couple times to see if something changes... seems the completion percent is moving up...

8-)

ID: 1159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 32
Message 1161 - Posted: 2 May 2016, 2:24:19 UTC
Last modified: 2 May 2016, 2:24:38 UTC

2p setup just finished a couple dozen 12+ hours tasks like this one... and only 333 points?

http://universeathome.pl/universe/workunit.php?wuid=4907485

I think even Credit New would give more points... but all in the same boat I guess...

8-)
ID: 1161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1162 - Posted: 2 May 2016, 11:55:00 UTC - in response to Message 1161.  

We had experiment with Credit new... after few hours of work some tasks get 5 points where another ones gets 1500... Believe me, everything is better then credit new ;)

Anyway, thank you for detailed info about errors - it is very usefully.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 32
Message 1163 - Posted: 2 May 2016, 12:59:15 UTC - in response to Message 1162.  

Well, hope it helps. One other thing I notice (like others mentioned) is when the task properties are viewed, they report only 0.06 GFLOPS/Sec???? If the tasks are not doing much math, that is why credit new is so flaky.

Computer: Linux-DX5680
Project Universe@Home

Name universe_bh_362_16_20000_1-999999_335600_1

Application Universe BHspin 0.09
Workunit name universe_bh_362_16_20000_1-999999_335600
State Running
Received 5/1/2016 10:50:38 PM
Report deadline 5/15/2016 10:50:23 PM
Estimated app speed 0.06 GFLOPs/sec
Estimated task size 807 GFLOPs
CPU time at last checkpoint 01:36:19
CPU time 01:42:02
Elapsed time 01:42:53
Estimated time remaining 02:20:58
Fraction done 42.160%
Virtual memory size 12.80 MB
Working set size 3.46 MB
Directory slots/1
Process ID 2447



This is on a task near completion. If your WU's are not doing a lot of math, what are they doing? They use less than 4 MB of ram... just curious...

8-)
ID: 1163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1164 - Posted: 2 May 2016, 13:11:16 UTC - in response to Message 1163.  

This is what BM reports, but I not really believe in the numbers... Just see CPU usage...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 32
Message 1165 - Posted: 2 May 2016, 17:42:36 UTC - in response to Message 1164.  

This is what BM reports, but I not really believe in the numbers... Just see CPU usage...


CPU usage is normal, as in 98% or higher. I see no problems in the longer running WU's, only that they run longer... excepting the ones that never finish! Hope you can figure those out.

8-)
ID: 1165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Long running work units




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek