Info:

Github:
Userusage's Github page can be found here.

Readme:
The full README which includes everything from install instructions to use instructions is available on the Github page.

Story:

This is not the first iteration of Userusage. In fact, this isn't even the first Python iteration of Userusage.

We manage a lot of servers at CU. We have 1849 systems in our database at this moment. These are everything from web servers to student dev servers, and we can't always keep a close eye on what is going on. There are plenty of tools out there to measure the "whats" in a system, but none to keep track of the "whos." What is overheating? What drive is full? What UPS is off? All of those can be answered with widely available tools (Nagios, for example) in a quick and efficient manner. Who's using all of that disk space? Now that is an equally important question with a much more difficult answer. However, the disk is going to continue being full if nobody knows or cares that it is full, and that can cause even more problems.

This is where the first iteration of Userusage came into play. The first round was a Perl program that was about 5 years old. It took a monstrous 4 hours to run on a standard size (1TB /home) server, but it did its job. However, when I started learning Python and joined the CU Unix team in 2014, my boss handed it to me with a simple directive, "make it better." Better can mean a lot of things, but reading through a complicated man page for an hour and then running a 30 minute script is still faster (in a mono-focus sense) than four hours of waiting. If you are doing computations on the solar eclipse we had here in Boulder not too long ago, you might not have four hours to free up space. So the biggest goal was to speed it up.

Storage is a really complicated thing. All I can say is "thank God there are more talented people than me managing that," because physical storage is a difficult problem to solve. Due to this, I decided the best thing to do would be to go to what was already written. Python is pretty slow compared to C, especially when you are looking at a terabyte of data. So we wanted to use standard command line interfaces to do the computations, parse it with Python, and go home happy.

This is where the first iteration of Python Userusage came about. The first one did exactly what the Perl version did, ran this command

find . -user username -print

And then used Python to build the file list for users and find the space that the files took up. Next up, we tried a few pure Python implementations.

for item in os.walk('.'):
    if isfile(item):
        space += getsize(item)

We rebuilt our find command

find . -user username -type f -exec du -k {} ;

What we found was about a 25% increase. A difference of an hour from the original. This was good, but I really wanted better. The other important thing to note is that these all were either built for one thing or another.

What I mean by that is that it was using either purely Python or purely C, because people rarely mixed the two in the same way I wanted to. This meant that no matter what we tried, it would churn out results similar in speed.

The other issue is one everyone who uses Python might haves said from the beginning, "subprocesses are slow." Simply enough, the time it takes to shell out to bash and back is significant when you are running lots of commands.

The best thing to do in this case, following prior logic, is to only run 1 subprocess throughout the whole script, get back data, and parse it using Python. Both sides of the equation are doing what they are strong at and never interact with each other outside of 1 way communication from the command to Python. The question now became, what command?

Find was good because we could search for a username, but had to do a command for every username. We have hundreds of students on some systems, so that's not acceptable. That's when I found stat, a built in utility for finding tons of stuff about file information. A few hours of manpages later, I came up with this.

find / type -f -exec stat -c %U %s {} +

This churns out a bunch of lines for every file, with the username and the disk space used on every line. Then we add it up, convert it to a dict, use sorted() to order it, and produce the results accordingly. The time? 15 minutes. A nearly 94% efficiency increase just for using 2 things for their strengths.

After that I did some bugfixes, added a config file, license, and extended readme, and threw it up on Github with hopes that other people ran into the same problems we did.

Tech:

Userusage is as straightforward as it sounds, but I also made it pretty feature heavy. With the config the way it is, you can configure it once, throw it in a cronjob and fohgettaboutit. It runs (as described before) a find -exec stat command and just parses the results, nothing too fancy.

Status:

Userusage's status is currently DEAD. It does its job well, but I am no longer working with the CU Unix team, so I simply do not have the need to improve it further. If you would like to use it and find a bug, feel free to submit an issue or PR on Github and I'll see if I can fix it.

License:

Userusage is licensed under Beerware R42.