[SATLUG] xargs performance with python

Donald L Wilcox makiten at neonnightrider.com
Wed Feb 6 00:36:06 CST 2013

Hash: SHA1

On 2/5/2013 10:55 PM, Brad Knowles wrote:
> On Feb 5, 2013, at 8:25 PM, Donald L Wilcox
> <makiten at neonnightrider.com> wrote:
>> When I use xargs with wget and awk on my VM, I don't see any 
>> performance changes. I was doing some research on xargs and
>> apparently it's designed for dealing with a lot of arguments and
>> parameters, but I don't know what constitutes "a lot."
> The xargs program is designed to figure out the maximum allowed
> under the circumstances, and to fill the buffers until it reaches
> that point.  If the system can handle 5000 arguments, then by
> default xargs will feed it that many.
>> My command is something like this:
>> awk -F: -v data_dir=$TMP_PATH '{print "xargs -n3 -P16 wget -q -O 
>> "data_dir"/datafiles/"$0".xml 
>> http://website.com/GetPlatform.php?id="$0}'
>> $TMP_PATH/datafiles/ids.txt
> From the man page:
> -n number Set the maximum number of arguments taken from standard
> input for each invocation of utility.  An invocation of utility
> will use less than number standard input arguments if the number of
> bytes accumulated (see the -s option) exceeds the specified size
> or there are fewer than number arguments remaining for the last 
> invocation of utility.  The current default value for number is 
> 5000.
> and:
> -P maxprocs Parallel mode: run at most maxprocs invocations of
> utility at once.
> In my experience, using xargs to run things in parallel sometimes
> works well and sometimes it doesn't.  I have found the "gnu
> parallel" program to be more consistent in this regard.
> However, when you lock the system down to just using three
> arguments at a time, you're not likely to see a whole lot of
> speedup regardless.
I actually played around with the -n flag, since some sites were
suggesting -n1, and checking the manpage convinced me to go with the
default value in the end. It didn't make a difference, so I'll give
parallel a shot.

> Also, have you checked that you're not being rate-limited or
> connection-limited by the remote end?

Yes, and their are no limits (yet) on the remote host.

>> Is there's a more efficient way to read a file line by line and
>> use wget and xargs outside of awk to get these xml files? I've 
>> tried other approaches, but they don't work.
> My normal approach would be to try to construct the input file in
> such a way that you could do "wget -f filename" and have it do all
> the work for you.  If you wanted to split that up into multiple
> parallel chunks, then that would be easy enough to work out.
> But I don't have wget installed on this machine and it's pretty
> late at night, so I can't work out the exact syntax to make that
> happen.
> Have you tried curl?
No, but I'll give it a try tomorrow or Thursday.

> -- Brad Knowles <brad at shub-internet.org> LinkedIn Profile:
> <http://tinyurl.com/y8kpxu>

Donald Wilcox        Web: http://www.neonnightrider.com
San Antonio, TX LinkedIn: http://www.linkedin.com/in/donaldwilcoxjr
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/


More information about the SATLUG mailing list