[SATLUG] xargs performance with python

Brad Knowles brad at shub-internet.org
Tue Feb 5 22:55:53 CST 2013


On Feb 5, 2013, at 8:25 PM, Donald L Wilcox <makiten at neonnightrider.com> wrote:

> When I use xargs with wget and awk on my VM, I don't see any
> performance changes. I was doing some research on xargs and apparently
> it's designed for dealing with a lot of arguments and parameters, but
> I don't know what constitutes "a lot."

The xargs program is designed to figure out the maximum allowed under the circumstances, and to fill the buffers until it reaches that point.  If the system can handle 5000 arguments, then by default xargs will feed it that many.

> My command is something like this:
> 
> awk -F: -v data_dir=$TMP_PATH '{print "xargs -n3 -P16 wget -q -O
> "data_dir"/datafiles/"$0".xml
> http://website.com/GetPlatform.php?id="$0}' $TMP_PATH/datafiles/ids.txt

From the man page:

     -n number
             Set the maximum number of arguments taken from standard input for
             each invocation of utility.  An invocation of utility will use
             less than number standard input arguments if the number of bytes
             accumulated (see the -s option) exceeds the specified size or
             there are fewer than number arguments remaining for the last
             invocation of utility.  The current default value for number is
             5000.

and:

     -P maxprocs
             Parallel mode: run at most maxprocs invocations of utility at
             once.

In my experience, using xargs to run things in parallel sometimes works well and sometimes it doesn't.  I have found the "gnu parallel" program to be more consistent in this regard.

However, when you lock the system down to just using three arguments at a time, you're not likely to see a whole lot of speedup regardless.

Also, have you checked that you're not being rate-limited or connection-limited by the remote end?

> Is there's a more efficient way to read a file line by line
> and use wget and xargs outside of awk to get these xml files? I've
> tried other approaches, but they don't work.

My normal approach would be to try to construct the input file in such a way that you could do "wget -f filename" and have it do all the work for you.  If you wanted to split that up into multiple parallel chunks, then that would be easy enough to work out.

But I don't have wget installed on this machine and it's pretty late at night, so I can't work out the exact syntax to make that happen.

Have you tried curl?

--
Brad Knowles <brad at shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>



More information about the SATLUG mailing list