Message boards : Number crunching : extreme long wu's
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

AuthorMessage
Profile Sebastian M. Bobrecki
Volunteer tester

Send message
Joined: 4 Feb 15
Posts: 17
Credit: 158,222,691
RAC: 0
Message 2013 - Posted: 15 Mar 2017, 20:03:48 UTC

Quick ugly script to abort task with enormous run time:

#!/bin/sh

MAX_TIME="20240"

##########

RM="/bin/rm"
GREP="/bin/grep"
AWK="/usr/bin/awk"
TAIL="/usr/bin/tail"
DATE="/bin/date"
TMP="/tmp/task.tmp"
BCMD="/usr/bin/boinccmd $@"

if [ ! -r ${TMP} ];then
${BCMD} --get_tasks > ${TMP}
fi

for TASK in `${GREP} "^ name: " ${TMP}|${AWK} -F ": " '{print $2}'`;do
URL=`${GREP} -A2 ${TASK} ${TMP}|${TAIL} -n1|${AWK} -F ": " '{print $2}'`
if [ ${URL} == "http://universeathome.pl/universe/" ];then
TIME=`${GREP} -A15 ${TASK} ${TMP}|${TAIL} -n1|${AWK} -F ": " '{print $2}'|${AWK} -F "." '{print $1}'`
if [ ${TIME} -ge ${MAX_TIME} ];then
echo -e "`${DATE} "+%Y.%m.%d %H:%M:%S"`\tName:\t${TASK}\tTime:\t${TIME}"
${BCMD} --task ${URL} ${TASK} abort
fi
fi
done

if [ -r ${TMP} ];then
${RM} -f ${TMP}
fi

exit 0

It works on Linux but is using remote rpc so it can connect to any client with rpc enabled via network:

./script.sh --host 192.168.x.y --paswd password

MAX_TIME is in seconds and should be tuned to match host speed. Run it from cron for example every 5 minutes and voila.
ID: 2013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2014 - Posted: 15 Mar 2017, 20:19:33 UTC

Agree with Jacob, there are people that would help you if they knew what the REAL situation was, you did say you had a fix but couldn't compile it for example.
ID: 2014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2017 - Posted: 16 Mar 2017, 18:28:07 UTC - in response to Message 2012.  

If there is anything you'd like me to test or try, tell me what to do and I'll do it. I want it solved, and am willing to try things for you.

Unfortunately, I can't as I can't send source code.
The problem is the code compiles correctly without BOINC libraries, but with them not.
What makes situation strange, BOINC code in sources didn't change for almost a year, so it is well tested by us (all of us).
We make checks nearly every day trying to find wrong part, but this is going part by part in above 20k of source code lines.

With the current app - I can't identify problem as it exists only on some computers in very rare conditions and doesn't replicate on my development computers (two of them).
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2019 - Posted: 16 Mar 2017, 20:52:05 UTC
Last modified: 16 Mar 2017, 20:58:36 UTC

If the routines in the libraries have not changed, then the way you call the routines, the parameters you pass them, type, number etc., you say compile, so presumably it is not a linking problem. You will understand, without the code, or specific error messages from the compiler, it is not possible to be specific. Have you changed versions of something somewhere?

Some of our embedded systems had over a million lines of code.
ID: 2019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2020 - Posted: 16 Mar 2017, 20:59:54 UTC - in response to Message 2019.  
Last modified: 16 Mar 2017, 21:01:06 UTC

If the routines in the libraries have not changed, then the way you call the routines, the parameters you pass them, type, number etc., you say compile, so presumably it is not a linking problem. You will understand, without the code, or specific error messages from the compiler, it is not possible to be specific. Have you changed versions of something somewhere?

Some of our embedded systems had over a million lines of code.

In fact, "compilation problem" is in linking stage.
Just linker says that some of statics are declared twice, which is not true.

Edit:
Environment isn't changed. The machines still based on Debian 6 and on same stage from two years.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Yavanius
Avatar

Send message
Joined: 13 May 15
Posts: 87
Credit: 4,320,738
RAC: 5
Message 2021 - Posted: 17 Mar 2017, 17:43:32 UTC - in response to Message 1999.  

krzyszp, can't you do something to stop the bleeding, even??


Here, stick your finger right here on your neck and hold it. Hold right there.

(Pulls out a bunch of gauze tape and wraps your finger and hand to your neck tightly.)

There, bleeding stopped.

What do you mean you're loosing feeling in your hand? You were bleeding out before and now you're worried about a little loss of sensation in your hand??

Geesh, people have no sense of priorities.
ID: 2021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2022 - Posted: 17 Mar 2017, 17:56:14 UTC - in response to Message 2021.  

Hi. You are posting on a forum where computer enthusiasts care passionately about their usage of computer resources. A bug like this can result in large numbers of computers wasting their resources, and wasting the environments resources, for no gain, in an indefinite loop with no way to exit the problem (tasks run indefinitely).

If you're not here to help, then don't post. I'm offering my help, despite my disagreement with the admins' response thus far.
ID: 2022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2023 - Posted: 18 Mar 2017, 16:18:08 UTC

Are you using statics in your program?
ID: 2023 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2024 - Posted: 18 Mar 2017, 16:37:30 UTC - in response to Message 2023.  

Are you using statics in your program?

Yes, loads of them.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dariusz Kozlowski

Send message
Joined: 4 Mar 17
Posts: 1
Credit: 29,980,900
RAC: 0
Message 2025 - Posted: 18 Mar 2017, 16:58:26 UTC

I have all the symptoms described by other members above.
The elapsed time increases, but so is the remaining estimate.
From the original estimate of slightly over 2 hours I get elapsed time of 16 hours and remaining over 1 hour and keeps growing.
Restarting either the task, the project project or boinc, gets these numbers back to something more reasonable (last check point), from where they keep growing again.
I did not record file names, but these tasks do get stuck, with no way of finishing them.
I'm not a scientist, and I wouldn't run a scientific project. When I hear that there are errors during compilation or linking and the app is still distributed (did I get it right?) ..... c'mon, get a software engineer to help out.
I have wasted enough time on these tasks....
Please either fix it or let me know how can I help. Until then, I'm donating my cpu where is is going to do actual work.
No hard feelings.
ID: 2025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2026 - Posted: 18 Mar 2017, 18:36:36 UTC

krzyszp:

Would it help if I actively tried to get the problem to happen on a task on my machine(s)? Then what is the next step? Do you have debug output we can get to? Do we copy the files locally to run them outside of BOINC, to see if it is still reproducible on my machine?

Come on ---- Let's SOLVE this already! :)
ID: 2026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2027 - Posted: 19 Mar 2017, 20:41:16 UTC

On the info available I am forced to be vague, but are your own and BOINC's statics "names" being hashed, and you are getting a hash clash? Hell this is frustrating.
ID: 2027 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2028 - Posted: 20 Mar 2017, 16:18:53 UTC - in response to Message 2026.  

krzyszp:

Would it help if I actively tried to get the problem to happen on a task on my machine(s)? Then what is the next step? Do you have debug output we can get to? Do we copy the files locally to run them outside of BOINC, to see if it is still reproducible on my machine?

Come on ---- Let's SOLVE this already! :)

Yes, you can run it outside BOINC - just simply copy executable file and param.in file to another folder and start is from command line - this is a way which I used to check it.
Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine.

We will see with new app version, but I'm still wait now for scientists to check some code parts.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2029 - Posted: 20 Mar 2017, 17:10:50 UTC

When running a new task, what is the first indication that we've encountered the problem in this thread?
ID: 2029 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2030 - Posted: 20 Mar 2017, 18:44:25 UTC - in response to Message 2029.  

When running a new task, what is the first indication that we've encountered the problem in this thread?

I think the best clue is when the progress percentage does not increase, and the time remaining does increase. For me, anything over about twelve hours falls into that category thus far, but I would not count on that, since I just completed a couple of tasks that took almost eight hours, while they normally take less than three hours on this machine.
http://universeathome.pl/universe/result.php?resultid=21185473
http://universeathome.pl/universe/result.php?resultid=21185592
ID: 2030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2031 - Posted: 21 Mar 2017, 8:11:48 UTC

>>> Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine.

At this project only...
ID: 2031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2032 - Posted: 21 Mar 2017, 11:56:26 UTC

Just to redress the balance, I have a Raspberry Pi that has been flawlessly running tasks throughout this "long WU" period. Some tasks have taken quite a bit longer than others, but they complete. I would be disappointed if the project was put on hold until the problem was fixed just because some computers were experiencing difficulties.

I found I had problems crunching climateprediction.net tasks on my work PC (they almost always crashed), so I simply stopped crunching them, but I can see from my negative progress in the climateprediction.net rankings that others are crunching successfully. Every so often I go back and try again.
ID: 2032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2033 - Posted: 21 Mar 2017, 15:14:29 UTC - in response to Message 2032.  

I found I had problems crunching climateprediction.net tasks on my work PC (they almost always crashed), so I simply stopped crunching them, but I can see from my negative progress in the climateprediction.net rankings that others are crunching successfully. Every so often I go back and try again.

That makes the point that each project is different and may have subtle hardware differences to succeed. I found that I could do well on CPDN by the use of a write-cache, at least on the older projects; it seems less critical now with the new WAH2. But apparently write-contention was causing a lot of problems and people didn't realize the cause (I have posted on it).

The magic bullet on Universe is not clear yet. Ideally, it would work for everyone. I am currently investigating the idea that running other projects at the same time may sometimes trigger the long-runners. Such interactions are not impossible. There is the case now on ATLAS that running them by themselves works, while running them with other projects at the same time results in validation errors, and I can think of other cases. So I am running Universe by itself for a few weeks to see what happens.
ID: 2033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2034 - Posted: 21 Mar 2017, 15:54:05 UTC - in response to Message 2031.  

>>> Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine.

At this project only...

Believe me, not.
For my few years (actually, about 17 years) with projects, it's happens quite often.
We had some time ago application version where most AMD CPU's crashes... Without any serious reason until... Windows update where some stuff was fixed in Windows drivers. That was on test phase of project two years (or so) ago...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2035 - Posted: 21 Mar 2017, 20:58:21 UTC
Last modified: 21 Mar 2017, 21:02:08 UTC

>>> I would be disappointed if the project was put on hold until the problem was fixed just because some computers were experiencing difficulties.

I wonder how you would feel if it was one of your systems that was having its resources stolen. What is more, he KNOWS the problem exists, yet continues to send the work units with it included. Help has been offered, and rejected. You say you would be disapointed, I AM disappointed.
ID: 2035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

Message boards : Number crunching : extreme long wu's




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek