Oskar Pearson (oskar@linux.org.za) wrote the original version of this document, with the support of both Internet Solutions (Pty) Ltd (http://www.is.co.za/) and Qualica Technologies (Pty) Ltd (http://www.qualica.com/), both of South Africa. He gives many thanks to both of these organisations for their support.
Caching stresses certain hardware subsystems more than others. Although the key to good cache performance is good overall system performance, the following list is arranged in order of decreasing importance:
Disk random seek time
Amount of system memory
Sustained disk throughput
CPU power
Do not drastically underpower any one subsystem, or performance will suffer. In the case of catastrophic hardware failure you must have a ready supply of alternate parts. When your cache is critical, you should have a (working!) standby machine with operating system and Squid installed. This can be kept ready for nearly instantaneous swap-out. This will, of course, increase your costs, something that you may want to take into account. Chapter 13 covers standby procedures in detail.
When deciding on your cache's horsepower, many factors must be taken into account. To decide on your machine, you need an idea of the load that it will need to sustain: the peak number of requests per minute. This number indicates the number of 'objects' downloaded in a minute by clients, and can be used to get an idea of your cache load.
Computing the peak number of requests is difficult, since it depends on the browsing habits of users. This, in turn, makes deciding on the required hardware difficult. If you don't have many statistics as to your Internet usage, it is probably worth your while installing a test cache server (on any machine that you have handy) and pointing some of your staff at it. Using ratios you can estimate the number of requests with a larger user base.
When gathering statistics, make sure that you judge the 'peak' number of requests, rather than an average value. You shouldn't take the number of requests per day and divide, since your peak (during, for example, lunch hour) can be many times your average number of requests.
It's a very good idea to over-estimate hardware requirements. Stay ahead of the growth curve too, since an overloaded cache can spiral out of control due to a transient network problems If a cache cannot deal with incoming requests for some reason (say a DNS outage), it still continues to accept incoming requests, in the hope that it can deal with them. If no requests can be handled, the number of concurrent connections will increase at the rate that new requests arrive.
If your cache runs close to capacity, a temporary glitch can increase the number of concurrent, waiting, requests tremendously. If your cache can't cope with this number of established connections, it may never be able to recover, with current connections never being cleared while it tries to deal with a huge backlog.
Squid 2.0 may be configured to use threads to perform asynchronous Input/Output on operating systems that supports Posix threads. Including async-IO can dramatically reduce your cache latency, allowing you to use a less powerful machine. Unfortunately not all systems support Posix threads correctly, so your choice of hardware can depend on the abilities of your operating system. Your choice of operating system is discussed in the next section - see if your system will support threads there.
There are numerous things to consider when buying disks. Earlier on we mentioned the importance of disks with a fast random-seek time, and with high sustained-throughput. Having the world's fastest drive is not useful, though, if it holds a tiny amount of data. To cache effectively you need disks that can hold a significant amount of downloaded data, but that are fast enough to not slow your cache to a crawl.
Seek time is one of the most important considerations if your cache is going to be loaded. If you have a look at a disk's documentation there is normally a random seek time figure. The smaller this value the better: it is the average time that the disk's heads take to move from a random track to another (in milliseconds). Operating systems do all sorts of interesting things (which are not covered here) to attempt to speed up disk access times: waiting for disks can slow a machine down dramatically. These operating system features make it difficult to estimate how many requests per second your cache can handle before being slowed by disk access times (rather than by network speed). In the next few paragraphs we ignore operating system readahead, inode update seeks and more: it's a back of the envelope approximation for your use.
If your cache does not use asynchronous Input-Output (described in the Operating system section shortly) then your cache loses a lot of the advantage gained by multiple disks. If your cache is going to be loaded (or is running anywhere approaching capacity according to the formulae below) you must ensure that your operating system supports posix threads!
A cache with one disk has to seek at least once per request (ignoring
RAM caching of the disk and inode update times). If you have only one disk,
the formula for working out seeks per second (and hence requests per
second) is quite simple:
requests per second = 1000/seek time 1000
Squid load-balances writes between multiple cache disks, so if you
have more than one data disk your seeks-per-second per disk will be lower.
Almost all operating systems will increase random seek time in a
semi-linear fashion as you add more disks, though others may have a small
performance penalty. If you add more disks to the equation, the requests per second value becomes even more approximate! To simplify things in the
meantime, we are going to assume that you use only disks with the same seek
time. Our formula thus becomes:
theoretical requests per second = -----------------
(seek time)/(number of disks)
Let's consider a less theoretical example:
I have three disks - all have 12ms seek times. I can thus (theoretically,
as always) handle:
requests per second = 1000/(12/3) = 1000/4 = 250 requests per second
While we are on this topic: many people query the use of IDE disks in caches. IDE disks these days generally have very similar seek times to SCSI disks, and (with DMA-compatible IDE controllers) approach the speed of data transfer without slowing the whole machine down.
Deciding how much disk space to allocate to Squid is difficult. For the pilot project you can simply allocate a few megabytes, but this is unlikely to be useful on a production cache.
The amount of disk space required depends on quite a few factors.
Assume that you were to run a cache just for yourself. If you were to allocate 1 gig of disk, and you browse pages at a rate of 10 megabytes per day, it will take at least 100 days for you to fill the cache.
You can thus see that the rate of incoming cache queries influences the amount of disk to allocate.
If you examine the other end of the scale, where you have 10 megabytes of disk, and 10 incoming queries per second, you will realize that at this rate your disk space will not last very long. Objects are likely to be pushed out of the cache as they arrive, so getting a hit would require two people to be downloading the object at almost exactly the same time. Note that the latter is definitely not impossible, but it happens only occasionally on loaded caches.
The above certainly appears simple, but many people do not extrapolate. The same relationships govern the expulsion of objects from your cache at larger cache store sizes. When deciding on the amount of disk space to allocate, you should determine approximately how much data will pass through the cache each day. If you are unable to determine this, you could simply use your theoretical maximum transfer rate of your line as a basis. A 1mb/s line can transfer about 125000 bytes per second. If all clients were setup to access the cache, disk would be used at about 125k per second, which translates to about 450 megabytes per hour. If the bulk of your traffic is transferred during the day, you are probably transferring 3.6 gigabytes per day. If your line was 100% used, however, you would probably have upgraded it a while ago, so let's assume you transfer 2 gigabytes per day. If you wanted to keep ALL data for a day, you would have to have 2 gigabytes of disk for Squid.
The feasibility of caching depends on two or more users visiting the same page while the object is still on disk. This is quite likely to happen with the large sites (search engines, and the default home pages in respective browsers), but the chances of a user visiting the same obscure page is slim, simply due to the volume of pages. In many cases the obscure pages are on the slowest links, frustrating users. Depending on the number of users requesting pages you should keep pages for longer, so that the chances of different users accessing the same page twice is higher. Determining this value, however, is difficult, since it also depends on the average object size, which, in turn, depends on user habits.
Some people use RAID systems on their caches. This can dramatically increase availability, but a RAID-5 system can reduce disk throughput significantly. If you are really concerned with uptime, you may find a RAID system useful. Since the actual data in the cache store is not vital, though, you may prefer to manually fail-over the cache, simply re-formatting or replacing drives. Sure, your cache may have a lower hit-ratio for a short while, but you can easily balance this minute cost against what hardware to do automatic failover would have cost you.
You should probably base your purchase on the bandwidth description above, and use the data discussed in chapter 11 to decide when to add more disk.
Squid keeps an in-memory table of objects in RAM. Because of the way that Squid checks if objects are in the file store, fast access to the table is very important. Squid slows down dramatically when parts of the table are in swap.
Since Squid is one large process, swapping is particularly bad. If the operating system has to swap data, Squid is placed on the 'sleeping tasks' queue, and cannot service other established connections. (? hmm. it will actually get woken up straight away. I wonder if this is relevant ?)
Each object stored on disk uses about 75 bytes (? get exact value ?) of RAM in the index. The average size of an object on the Internet is about 13kb, so if you have a gigabyte of disk space you will probably store around about 80 000 objects.
At 75 bytes of RAM per object, 80 000 objects require about six megabytes of RAM. If you have 8gigs of disk you will need 48Mb of RAM just for the object index. It is important to note that this excludes memory for your operating system, the Squid binary, memory for in-transit objects and spare RAM for for disk cache.
So, what should your sustained-thoughput of your disks be? Squid tends to read in small blocks, so throughput is of lesser importance than random seek times. Generally disks with fast seeks are high throughput, and most disks (even IDE disks these days) can transfer data faster than clients can download it from you. Don't blow a year's budget on really high-speed disks, go for lower-seek times instead - or add more disks.
Squid is not generally CPU intensive. On startup Squid can use a lot of CPU while it works out what is in the cache, and a slow CPU can slow down access to the cache for the first few minutes upon startup. A Pentium 133 machine generally runs pretty idle, while receiving 7 TCP requests a second A multiprocessor machine generally doesn't increase speed dramatically: only certain portions of the Squid code are threaded. These sections of code are not processor intensive either: they are the code paths where Squid is waiting for the operating system to complete something. A multiprocessor machine generally does not reduce these wait times: more memory (for caching of data) and more disks may help more.
Where I work, we run many varieties of Unix. When I first installed Squid it was on my desktop Linux machine - if I break it by mistake it's not going to cause users hassles, so I am free to do on it what I wish.
Once I had tested Squid, we decided to allow general access to the cache. I installed Squid on the fastest unused machine we had available at the time: a (then, at least) top of the range Pentium 133 with 128Mb of RAM running FreeBSD.
I was much more familiar with Linux at that stage, and eventually installed Linux on the public cache machine. Though running Linux caused some inconveniences (specifically with low per-process filehandle limits), it was the right choice, simply because I could maintain the machine better. Many times my experience with Linux has gotten me out of potentially sticky situations.
If your choice of operating system saves you time, and runs Squid, use it! Just as I didn't use Digital Unix (Squid is developed on funded Digital Unix machines at NLANR), you don't need to use Linux just because I do.
Most modern operating systems sport both similar performance and similar feature sets. If your system is commonly used and roughly Posix compliant at the source level, it will almost certainly be supported by Squid.
When was the last time you had an outage due to hardware failure? Unless you are particularly unlucky, the interval between hardware failures is low. While the quality of hardware has increased dramatically, software often does not keep pace. Many outages are caused by faulty application of operating system software. You must thus be able to pick up the pieces if your operating system crashes for some reason.
If you normally work on a specific operating system, you should probably not use your cache as a system to experiment with a new 'flavor' of Unix. If you have more experience in an operating system, you should use that system as the basis for your cache server. Customers rapidly turn off caching when a cache stops accepting requests (while you learn your way around some 'feature').
Your cache system will almost certainly form a core part of your network as soon as it is stable. You must be able to return the system to working order in minimal time in the event of a system failure, and this is where your existing experience becomes crucial. If the failure happens out of business hours you may not be able to get technical support from your vendor. A dialup ISP's hours of business differ dramatically to that of Operating System vendors.
Though most operating systems support similar features, there are often no standards for functions required for some of the less commonly used operating system features. One example is transparency: many operating systems can now support transparent redirection to a local program, but almost all of them function in a different way, since there is not a real standard for the way an operating system is supposed to function in this scenario.
If you are unable to find information about Squid on your operating system, you may want to organize a trial hardware installation (assuming that you are using a commercial operating system) as a test. Only when you have the system running can you be sure that your operating system supports the required features.
Squid works on the following systems: (? List ?)
If you are using Squid without extensions like transparency and ARP access control lists, you should not have problems. For your convenience a table of operating system support of specific features is included. Since Squid is constantly being developed, it's likely that this list will change.
Squid is written on Digital Unix (?version ?) machines running the GNU C compiler (GCC). GCC is included with free operating systems such as Linux and FreeBSD, and is easily available for many other operating systems and hardware platforms. The GNU compiler adheres as closely to the ANSI C standard as possible, so if a different compiler is included with your operating system, it may (or may not) have trouble interpreting Squid's source code, depending on it's level of ANSI compliance. In practice, most compilers work fine.
Some commercial compilers choose backward compatibility with older versions over ANSI compliance. These compilers generally support an option that turns on 'ANSI compliant mode'. If you have trouble compiling Squid you may have to turn this mode on. (? is this still valid? I remember things like this back in the Borland C days - though I seem to remember this on a Unix system too... ?) In the worst possible scenario you may have to compile GCC with your existing compiler and use GCC to compile Squid.
If you do not have a compiler, you may be able to find a precompiled version of GCC for your system on the Internet. Be very careful when installing software from untrusted sources. This is discussed shortly in the "precompiled binary" section.
If you cannot find versions of GCC for your platform, you may have to factor in the cost of the compiler when deciding on your operating system and hardware.
Before you even install the operating system, it's best to get an idea as to how the system will look once Squid is up and running. This will allow you to partition the disks on the machine so that their mount path will match Squid's default configuration.
Normally Squid's directory tree looks like this:
bin. The Squid binary and associated tools are stored in this directory. Some tools are included with the Squid source to help you manage and tune your cache server.
cache. Squid has to store cached data on disk somewhere. The path /usr/local/squid/cache is the default location. You can change the location of this directory by editing the Squid config file.
etc. Squid configuration files are stored in this directory. The most commonly changed file in here is squid.conf. We discuss the basic tags in that file in the next chapter.
src. Since you are likely to download the source code for Squid from the net, it is useful to compile the code where you can find it easily. I generally create a src directory and extract the code in there. This way I can revert to a previous version (without downloading it all over again). If you wish, you can easily keep Squid in your /usr/local/src directory, or delete it completely once you have installed the binaries.
Back to the cache directory: if you have more than one partition for the cached data, you can make subdirectories for each of the filesystems in the cache directory. Normally people name these directories cache1, cache2', cache3 and so forth. Your cache directories should be mounted somewhere like /usr/local/squid/cache/1/ and /usr/local/squid/cache/2/. If you have only one cache disk, you can simply name the directory /usr/local/squid/cache/.
In Squid-1.1 cache directories had to be identical in size. This is no longer the case, so if you are upgrading to Squid 2.0 you may be able to resize your cache partitions. To do this, however, you may have to repartition disks and reformat.
When you upgrade to the latest version of Squid, it's a good idea to keep the old working compiled source tree somewhere. If you upgrade to the latest Squid and encounter problems, simply kill Squid, change to the previous source directory and reinstall the old binaries. This is a lot faster than trying to remember which source tree you were running, downloading it, compiling it, applying local patches and then reinstalling.
Squid, like most daemon processes on Unix machines, normally runs as the user nobody and with the group nogroup.
For the maximum flexibility in allowing root and non-root users to manipulate the Squid configuration, you should make both a new user and two new groups, specifically for the Squid system, rather than using the nobody and nogroup IDs. Throughout this book we assume that you have done so, and that a group and a user have been created, (both called squid) and a second admin group, called squidadm. The squid user's primary group should be squid, and the user's home directory should be /usr/local/squid (the default squid software install destination).
When you have multiple administrators of a cache machine, it is useful to have a dedicated squidadm group, with sub-administrators added to this group. This way, you don't have to change to the root user whenever you want to make changes to the Squid config. It's possible, for users in the squidadm group to gain root access, so you shouldn't place people without root access in the squidadm group.
When the config file has been changed, a signal has to be sent to the Squid process to inform it that that config files are to be re-read. Sending signals to running processes isn't possible when the signal sender isn't the same userid as the receiver. Other config file maintainers need permission to change their user-id (either by using the 'su' command, or by logging in with another session) to either the root user or to the user Squid is running as.
In some environments cache software maintainers aren't trusted with root access, and the user nobody isn't allowed to log in. The best solution is to allow users that need to make changes to the config file access to a reload script using sudo. Sudo is available for many systems, and source code is available.
In Chapter 4 we go through the process of changing the user-id that Squid runs as, so that files Squid creates are owned by the squid user-id, and by the group squid. Binaries are owned by root, and config files are changeable by the squidadm group.
Now that your machine is ready for your Squid install, you need to download and install the Squid program. This can be done in two ways: you can download a source version and compile it, or you can download a precompiled binary version and install that, relying on someone else to do the compilation for you.
Binary versions of Squid are generally easier to install than source code versions, specifically if your operating system vendor distributes a package which you can simply install.
Installing Squid from source code is recommended. This method allows you to turn on compile-time options that may not be included in distributed binary versions (one of many examples: SNMP support is not included into the source at compile time unless it is specifically included, and most binary versions available do not include snmp support). If your operating system has been optimized so that Squid can run better (let's say you have increased the number of open filehandles per process) a precompiled binary will not take advantage of this tuning, since your compiler header files are probably different to the ones where the binaries where compiled.
It's also a little worrying running binaries that other people distribute (unless, of course, they are officially supplied by your operating system vendor): what if they have placed a trojan into the binary version? To ensure the security of your system it is recommended that you compile from the official source tree.
Since we suggest installing from source code first, we cover that first: if you have to download a Squid binary from somewhere, simply skip to the next sub-section: Getting a binary version of Squid.
Squid source is mirrored by numerous sites. For a list of mirrors, have a look at
Deciding which of the available files to download can become an issue, especially if you are not familiar with the Squid version naming convention. Squid is (as of this writing) in version 2. As features are added, the minor version number is incremented (Squid 2.0 becomes Squid 2.1, then Squid 2.2 etc etc). Since new features may introduce new bugs, the first version including new features is distributed as a pre-release (or beta) version. The first pre-release of Squid 1.2 is called squid-2.1.PRE1-src.tar.gz. The second is squid-2.1.PRE2-src.tar.gz. Once Squid is considered stable, a general release version is distributed: the first release version is called squid-2.0.RELEASE-src.tar.gz, the second (which would include minor bugfixes) squid-2.0.RELEASE2-src.tar.gz.
In short, files are named as follows: squid-2.minor-version-number.stability-info.release-number.tar.gz. Unless you are a Squid developer, you should download the last available RELEASE version: you are less likely to encounter bugs this way.
Squid source is normally available via FTP (the File Transfer Protocol), so you should be able to download Squid source by using the ftp program, available on almost every Unix system. If you are not familiar with ftp, you can simply select the mirror closest to you with your browser and save the Squid source to your disk by right-clicking on the filename and selecting save as (do not simply click on the filename - many browsers attempt to extract compressed files, printing the tar file to your browser window: this is definitely not what you want!). Once the download is complete, transfer the file to the cache machine.
Finding binary versions of Squid to install is easy: deciding which binary to trust is more difficult. If you do not choose carefully, someone could undermine your system security. If you cannot compile Squid, but know (and trust) someone that can do it for you, get them to help. It's better than downloading a version contributed by someone that you don't know.
The worst places to download precompiled packages from are sites that accept contributions from the public at large: avoid files in paths like incoming or uploads, since the source of the file is unknown.
Mailing lists are often good places to find compiled Software (though people become irritated if you do not actually make a concerted effort to find a trusted version before bothering the list). Regular contributors to mailing lists have a reputation at stake, and are likely to provide binary versions of software that actually match the official source.
Binaries compiled by people the core Squid developers (www.ircache.net) know and trust are available at ftp://squid.nlanr.net/pub/contrib/binaries/. You may be able to find a Squid binary for your operating system here.
Files can be distributed in many different ways. Generally Squid is tranformed into a package that can be installed with some package tool. There are many competing package managers, so there is no way of covering them all here.
Compiling Squid is quite easy: you need the right tools to do the job, though. First, let's go through getting the tools, then you can extract the source code package, include optional Squid components (using the configure command) and then actually compile the distributed code into a binary format.
A word of warning, though: this is the stage where most people run into problems. If you haven't compiled source before, try and follow the next section in order - it shouldn't be too bad. If you don't manage to get Squid running, at least you have gained experience.
All GNU utilities mentioned below are avaliable via FTP from the official GNU ftp site or one of it's mirrors. A list of mirrors is available at http://www.gnu.org/, or download them directly from ftp://ftp.gnu.org/.
The GNU compiler is only distributed as source (creating a chicken-and-egg problem if you do not have a compiler) you may have to do an Internet search (using one of the standard search engines) to try and find a binary copy of the GNU compiler for your system. The Squid source is distributed in compressed form. First a standard tar file is created. This file is then compressed with the GNU gzip program. To decompress this file you need a copy of gzip. GCC (The Gnu C Compiler) is the recommended compiler: the developers wrote Squid with it, and it is available for almost all systems.
You will also need the make program, of which there is also a GNU version easily available.
If possible, install a C debugger: the GNU debugger (GDB) is available for most platforms. Though a debugger is not necessary for installation, but is very useful in the case of software bugs (as discussed in chapter 13).
Earlier we looked at the tree structure of the /usr/local/squid directory. I suggest extracting the Squid source to the /usr/local/squid/src directory. So, create the directory and copy the downloaded Squid tar.gz file into it.
First let's decompress the file. Some versions of tar can decompress the file in one step, but for compatability's sake we are going to do it in two steps. Decompress the tar file by running gzip -dv squid-version.tar.gz. If all has gone well you should have a file called squid-version.tar in the current directory. To get the files out of the "tarball", run tar xvf squid-version.tar.
Tar automatically puts the files into a subdirectory: something like squid-2.1.PRE2. Change into the extracted directory, and we can start configuring the Squid source.
Squid features are enabled (or disabled) with the configure shell script. Some Squid features have to be specifically enabled when Squid is compiled, which can mean that you have to recompile at a later stage. There are two reasons that a feature can be disabled by default:
Operating system Compatibility. Although Squid is written in as generic a way possible, certain functions (such as async-io, transparency and ARP-based access control lists) are not available on all operating systems. When many operating systems cannot use a feature, it is included as a compile time option.
Efficiency. On a very lightly loaded cache, async-io can actually slow down requests minutely. Some system administrators may wish to disable certain features to speed up their caches.
You may be wondering why there simply aren't config file options for these less used features. For most of the features there really isn't a reason other than (?minimalisim?). Why have code sitting in the executable that isn't actually used? You can include the features that you might use at some time in the future without detrimental effects (other than a slightly larger binary), so as to avoid having to recompile the Squid source later on.
The configure program also has a second function: with some source code you have to edit a header file which tell the compiler which function calls to use on the system. This very often makes source compilation difficult. With Squid, however, the GNU configure script checks what programs, libraries and function calls are available on your system. This simplifies setup dramatically.
To make configure as generic as possible, it's actually a Bourne Shell /bin/sh script. If you have replaced your /bin/sh shell with a less Posix-capable shell (like ash) you may not be able to run configure. If this is the case you will have to change the first line of the configure script to run the full shell.
all source inclusion options are set with the command './configure option'. On most systems root doesn't have a '.' in their search path for security reasons, so you have to fully specify the path to the binary (hence the '/').
To turn more than one configuration option on at once you simply append
each option to the end of the command line.
You can, for example, change the prefix install directory and turn
Async-IO on with a command like the following (more on what each of these
options is for shortly).
./configure --prefix=/usr/people/staff/oskar/squid --enable-async-io
Note that only the commonly used configuration options are included here.
To get a complete list of options you can run './configure --help'. Many of
the resulting options are standard to the GNU configure script that Squid
uses, and are used for some things like cross compilation.
If you wish to find out about some of the more obscure options you may have to ask someone on one of the relevant mailing lists, or even read the source code!
When you run configure you normally get a fairly verbose output as to what is being checked for. Most people don't need all this information, so there is an option to stop configure printing the messages that aren't important. To reduce the amount of printed output, use the --quiet option. This way you will only see error messages, not debug information.
The first time you run configure you should run it in verbose mode. The configure process can take a while on slower machines, so you should get an idea as to how long it will take to run. Should you need to submit a bug report, you should always include as much information as possible, and should include the full configure output.
Some system administrators would prefer to dispense with the /usr/local/squid directory described earlier. On some systems you may even be installing Squid on a machine where you do not have root access (and can thus not create the /usr/local/squid directory). In either of these cases you will need to change your destination path.
Throughout this book I assume that you have installed Squid in the default directory. Using the default destination will make it easier for you to follow the examples in this book.
Changing the destination directory is done with the --prefix configure option. Here are some examples where we use this option.
Installing Squid in your home directory:
./configure --prefix=/usr/people/staff/oskar/squid ./configure --prefix=/usr/local/
If you are installing Squid on a dedicated cache machine you may wish to
place all Squid-related files in the /usr/local directory. Config
files (for example) will thus live in /usr/local/etc.
The memory allocation routines included with many operating systems aren't very good for the way that Squid allocates and uses memory. Squid uses the memory subsystem more intensively than most programs, since it's a single process which runs for an extended period of time and continuously allocates and frees small sections of memory. On some systems the Squid process size increases at a rapid rate. When it eventually consumes all the memory on the system, it crashes.
This option enables a different system memory allocator: DL-Malloc, by Doug Lea, which is known to be efficient for Squid's allocation patterns.
Squid will increase in size as objects are added to the disk cache, as discussed in the Hardware Requirements section. The index of objects in the disk cache takes up RAM, so make sure that you have sufficient memory in your system before deciding that the memory allocation system is at fault.
If a recently started copy of Squid uses substantially less memory than one that has been running for a few days (with the same size cache store), you might want to configure Squid to use DL-Malloc.
The included DL-Malloc memory allocation routines are not thread-safe, so you may not be able to use this option in conjunction with Async-IO. (? need to check details ?)
To use DL-Malloc, simply use the --enable-dl-malloc option:
./configure --enable-dl-malloc
Regular expressions allow you to do complex string matching, and are used for various things in the Squid config files (most notably in the rules that control how long objects stay in the cache).
On some systems you may wish to replace the default regular-expression routines with the GNU routines. This may be because the default operating system ones are incompatible with Squid or do not function correctly. If your system doesn't have regular expression libraries, Squid will automatically use the GNU library, so the GNU regular expression routines are included in the default Squid source code tree, and don't have to be downloaded seperately.
To enable use of the GNU libraries, simply use the --enable-gnuregex configure option.
Squid 2.0 includes a major performance increase in the form of Async-IO.
It's important to remember that Squid is one processes. In many Internet daemons, more than one copy runs at a time, so if one process is by a system call, it does not effect the other running copies.
Squid is only one process. If the main loop stops running for some reason, all connections are slowed. In all versions of Squid, the main loop uses the select and poll system calls to decide which connections to service. As Squid receives data from the server, it writes the data to disk and to the client.
To write data to disk, a file has to be opened on the cache drive. When lots of clients are opening and closing connections to a busy cache, the main loop has to make lots of calls to open and close network and disk filehandles (note that the word filehandle can refer to both a network connection and an on-disk file). These two functions block the flow of all data through the cache. While waiting for open to return, Squid cannot perform any other functions.
When you enable Async-IO, Squid 2.0 uses threads to open and close filedescriptors. A thread is part of the main Squid program in most ways, except that if it makes use of a blocking system call (such as open), only the thread stops, not the main loop or other threads. Note that there is not one thread per connection.
Using threads to make calls to blocking function calls reduces the latency that a cache adds to each request. (People sometimes worry about the latency that caches add, but if you have a fast enough cache the latency is not an issue - the client sees no noticeable overhead. Network overhead normally outweighs Squid overhead). Async-IO drastically reduces cache overhead when you have a loaded cache.
Unfortunately Posix threads aren't available on all operating systems. This ties your hardware choice into your choice of operating system, since if your operating system does not support threads there may be no choice but to use a faster system, or even to split the load between multiple machines. (? need a table of machines that work ?)
You should probably try and run Squid with Async-IO enabled if you have a few thousand requests per hour. Some systems only support threads properly with a fair amount of initial setup. If your load is low and Async-IO doesn't work straight away you can leave Squid in the default configuration.
Use the --enable-async-io configure option to include the async-io code into Squid.
Most modern browsers include a header with each outgoing request that includes some basic information about the user's browser and operating system. This header is called a 'user-agent' header, since it describes the agent program (the browser) used. An automated agent includes different user-agent headers, so logging user-agent headers allow you to see if someone using an automated web fetcher program (commonly referred to as a spider) to fetch pages on their behalf. It can also be used to find statistics as to the most commonly used browsers. The captured information is written to a log file specified in the configuration file. To include the code responsible for logging this information into the Squid binary, use the --enable-useragent-log option to configure.
Enabling the Simple Network Monitoring Protocol (SNMP) allows you to query your cache machine with one of the many SNMP tools available. If you have an existing SNMP monitoring system, you should be able to use your existing software to monitor Squid performance and retrieve usage information. This is discussed in detail in Chapter 6.
Some tools will read the Squid MIB (? what does this stand for ?) included with Squid (as /usr/local/squid/etc/mib.txt, once Squid is installed). Some tools, on the other hand, will have to be patched to understand the MIB that Squid uses. Since most SNMP products are written with a router in mind, they may not talk to an application like Squid, since the Squid MIB is quite different from a router MIB. (For more information on Squid and SNMP, see chapter 11)
Use the --enable-snmp configure option to enable the Squid SNMP code.
Since Squid will be a very important part of your network when it is installed, you will probably have a program which simply restarts Squid if the running process exits. The RunCache program included with Squid does just this.
If you are doing maintenance on the cache system and actually wanted to kill the Squid process, having it automatically restarted as you work can be irritating, or even cause real problems.
This option puts code into Squid that kills the parent process if Squid is shutdown cleanly. If Squid crashes it leaves the parent process alone, and will this be automatically restarted.
Use the --enable-kill-parent-hack to enable killing of the parent process on exit.
If you don't use this option, the correct procedure is to kill the parent with the kill command, and to then use the shutdown command described in the Running Squid section to shutdown Squid. Do not use the 'kill' command if you can avoid it: Squid needs time to shut down cleanly, since it writes a complete list of objects to disk).
When writing logs of cache events and client accesses, Squid calls the gettimeofday() operating system call to determine the accurate time.
This system call can take a short while to return, leaving Squid doing nothing while while it could be reading and writing data for something that doesn't require logging. The amount of time that Squid takes to make the system call is negligible on most machines, but under very high load the huge number of calls can impact overall performance. Enabling the 'time-hack' option makes Squid update the clock only once per second, reducing the overhead dramatically on such caches. This does means that your log messages are less accurate. The log accuracy is important to some people, though. When you have accurate time stamps of how long transfers take, you can create graphs of response time, and use them to decide when you need to upgrade your machine. (More on this in chapter 11: Cache analysis).
Most people do not need to use the --enable-time-hack option. It's useful mainly on very slow machines, or on operating systems where the gettimeofday call is very slow.
All ethernet cards have a (supposedly) unique identifier which is used as an address for all network traffic destined for that card. This number is referred to as a MAC address. If the card didn't have this address the operating system would have to check every packet on the network and decide if the packet was destined for it's IP address. With ethernet, however, the card's internal optimized hardware can check all the packets and decide if the packet needs to be passed up to the operating system. The network protocol that associates MAC addresses with IP numbers is known as ARP (Address Resolution Protocol).
If you want to control cache access by MAC address, you can enable ARP access control lists.
This option is only available on certain operating systems, since there is no standard method of finding the ARP address of a host when you are connected at the TCP level. As of this writing, ARP acl lists only work on Linux. If you are an operating system that can return this information to a user-level process, use the --enable-arp-acl option to use MAC acls.
Squid includes multiple Inter-Cache communication protocols. By default, the original Inter-Cache protocol (ICP) is included in the source code. If you wish to include some of the less used protocols, you will need to include them at compile time. Inter-cache communication is covered in depth in chapter 8. For the initial install you should probably not enable these protocols, since they may not be used.
If you are planning on joining an existing hierarchy you should ask the hierarchy administrators as to what protocols are supported or needed. If you are setting up a new hierarchy then you should only enable these after reading the above referenced chapter.
You cna enable the cache-digests with the --enable-cache-digests option, and the Hyper Text Caching Protocol (HTCP) with --enable-htcp.
(? I have never used this function. I think that it may be used mainly by the NLANR caches. I need to find out exactly what this is used for. This is my 'best guess' in the meantime. ?)
When Squid caches forward requests on to a destination server (or, in fact, to a parent cache) it adds headers to the request indicating both the origin IP of the requester and the IP address of the cache that is doing the forwarding (it's own IP). Squid can be configured to keep track of both of these headers for access logs of incoming requests. If you have caches beneath yours, this logs the headers the client caches add.
This feature is only really useful if you are at the top of a hierarchy and want to see who the biggest users of lower caches are. Currently, you can only access the data stored in this way with the cachemgr.cgi cgi program. (? not sure ?).
You probably don't want to enable this option, but if you do, use the --enable-forw-via-db option.
When Squid is unable to fulfill a request, an error page is returned to
the user with information on what went wrong. This page can be in the
language of your choice. Squid already includes error pages in quite a
number of languages: for list of included languages, check the contents of
the directory errors/ in the extracted source directory.
cache:~/src/squid-2.0.RELEASE> ls errors/
Bulgarian Estonian Italian Russian-1251 list
Czech French Makefile.in Russian-koi8-r
Dutch German Polish Spanish
English Hungarian Portuguese Turkish
The file 'list' contains a list of files to edit, when creating your own language error files.
Unfortunately there are not versions of the config file in different languages - only the error messages returned to users have been translated. The language defaults to English if you do not specify a language.
To use a specific language, replace language-name in the below text with something like Bulgarian. enable-err-language=language-name
Now that you have decided which options to use, it's time to run configure. Here's an example:
./configure --enable-err-language=Bulgarian --prefix=/usr/local
Running ./configure with the options that you have chosen should go smoothly. In the unlikely event that configure returns with an error message, here are some suggestions that may help.
The most common problem for new installers is that there is a problem with the installed compiler (or the headers) for the system.
To test this theory simply try and run configure with no options at all. If you still get an error message it is almost certainly a compiler or header file problem.
To make sure try and compile a program that uses some of the less used system calls and see if this compiles.
If your compiler doesn't compile files correctly, you might want to check if he header files exist, and if they do, permissions on the directory and the include files themselves.
If you have installed GCC in a non-standard directory, or if you are cross compiling, you may need configure to append options to the GCC command it uses during it's tests. You can get configure to append options to the GCC command line by setting the 'CFLAGS' shell variable prior to running configure. If, for example, you compiler only works when you you modify the default include directory, you can get configure to append that option to the default command line with a (Bourne Shell) command like:
Some configuration options exclude the use of others. This is another common cause of problems. To test this you should just try and run configure without any options at all, and see if the problem disappears. If so, you can try and work out which option is causing the conflict by adding each option to the configure command line one-by-one. You may find that you have to choose between two options (for example Async-IO and the DL-Malloc routines). In this case you may have to decide which of the options is the most important in your setup.
Now that you have configured Squid, you need to make the Squid binaries.
You should simply have to run make in the extracted source directory, and
a binary will be created as src/squid.
cache:/ # cd /usr/local/squid/src/squid-2.2.RELEASE
cache:/usr/local/squid/src/squid-2.2.RELEASE # make
If the compilation fails, it may be because of conflicting configure options as described in the configure section. Follow the same instructions described there to find the offending option. (You should run make clean between configure runs, to ensure that old binaries are removed) As a start, try running configure without any options at all and then see if make completes. If this works, try additional configure options one at a time to see which one causes the problem.
The make command creates the binary, but doesn't install it.
Running make install creates the /usr/local/squid/bin and /usr/local/squid/etc subdirectories, and copies the binaries and default config files in the appropriate directories. Permissions may not be set correctly, but we will work through all created directories and set them up correctly shortly.
This command also copies the relevant config files into the default directories. The standard config file included with the source is placed in the etc subdirectory, as are the mime.types file and the default Squid MIB file (squid.mib).
If you are upgrading (or reinstalling), make install will overwrite binary files in the bin directory, but will not overwrite your painfully manipulated configuration files. If the destination configuration file exists, make install will instead create a file called filename.default. This allows you to check if useful options have been added by comparing config files.
If all has gone well you should have a fully installed (but unconfigured) Squid system setup.
Congratulations!
The first high-performance proxy-cache program was developed as part of the Harvest project. The Harvest project was an NSF (?check this info for accuracy?) funded project to create a web indexing system. Part of this project included writing a high-performance cache daemon, or cached (pronounced "Cache-Dee") to speed the re-indexing of pages. Once the project was completed the cached source code was used as the basis for many commercial cache servers, as the source was freely available. Many of the cached developers moved on to or formed companies that developed commercial cache software.
I remember first installing cached: I was boggled at the number of options in the configuration file. I tried working through the options from top to bottom, deciding which to change and which to leave. I had no idea what they all meant. As I worked though the file, I figured more and more options out, though others remained mysteries.
After a lot of changes I tried to start cached, and had no luck. It spat out loads of errors, and I couldn't connect to the machine with my web browser at all. I had no idea what the real problem was - and I changed more and more options with time. This simply buried the real problem beneath hundreds of other possible problems.
Though Squid is now easier to install, the lessons I learned then are still relevant. The default configuration file is probably right for 90% of installations - once you have Squid running, you should change the configuration file one option at a time. Don't get over-ambitious in your changes quite yet! Leave things like refresh rules until you have experimented with the basic options - what port you want your to accept requests on, what user to run as, and where to keep cached pages on your drives.
So that you can get Squid running, this chapter works through the basic Squid options, giving you background information and introducing you to some of the basic concepts. In later chapters you'll move on to more advanced topics.
The Squid config file is not arranged in the order as this book. The config file also does not progress from basic to advanced config options in any specific order, but instead consists of related sections, with all hierarchy settings in a specific section of the file, all access controls in another and so forth.
To make changes detailed in this chapter you are going to have to skip around in the config file a bit. It's probably easiest to simply search for the options discussed in each subsection of this chapter, but if you have some time it will be best if you read through the config file, so that you have an idea of how sections fit together.
The chapter also points out options that may have to be changed on the other 10% of machines. If you have a firewall, for example, you will almost certainly have to configure Squid differently to someone that doesn't.
I recommend that you put all Squid configuration files and startup scripts under revision control. If you are like me, you love to play with new software. You change an option, get the program to re-read the configuration file, and see what difference it makes. By repeating this process, I learn what each option does, and at the same time I gain experience, and discover why the program is written the way it is. Quite often configuration files make no sense until you discover the overall structure of the underlying program.
The best way for you to understand each of the options in the Squid config file (and to understand Squid itself) is to experiment with the multitude of options. At some stage in the experimentation stage, you will find that you break something. It's useful to be able to revert to a previous version (or simply to be reminded what changes you have made).
Many readers will already have used a Revision Control System. The RCS
system is included with many Unix systems, and source is freely
available. For the few that haven't used RCS, however, it's worth
including some pointers to some manual pages:
ci(1) #!/bin/sh
One of the wonders of Unix is the ability to create scripts which reduce
the number of commands that you have to type to get something done. I have
a short script on all the machines I maintain called rvi. Using
rvi instead of vi allows me to use one command to edit files
under RCS (as opposed to the customary four). Put this file somewhere in
your path and make it executable chmod +x rvi. You can then simply
use a command like rvi squid.conf to edit files that are under
revision control. This is a lot quicker than running each of the co,
rcsdiff and ci commands.
co(1)
rcs(1)
rcsdiff(1)
rlog(1)
co -l $1
$VISUAL $1
rcsdiff -u $1
ci -u $1
All Squid configuration files are kept in the directory /usr/local/squid/etc. Though there is more than one file in this directory, only one file is important to most administrators, the squid.conf file. Though there are (as of this writing) one hundred and twenty five option tags in this file, you should only need to change eight options to get Squid up and running. The other one hundred and seventeen options give you amazing flexibility, but you can learn about them once you have Squid running, by playing with the options or by reading the descriptions in chapter 10.
Squid assumes that you wish to use the default value if there is no occurrence of a tag in the squid.conf file. Theoretically, you could even run Squid with a zero length configuration file.
The remainder of this chapter works through the options that you may need to change to get Squid to run. Most people will not need to change all of these settings. You will need to change at least one part of the configuration file though: the default squid.conf denies access to all browsers. If you don't change this, Squid will not be very useful!
The first option in the squid.conf file sets the HTTP port(s) that Squid will listen to for incoming requests.
Network services listen on particular ports. Ports below 1024 can only be used by the system administrator, and are used by programs that provide basic Internet services: SMTP, POP, DNS and HTTP (web). Ports above 1024 are used for untrusted services (where a service does not run as administrator), and for transient connections, such as outgoing data requests.
Typically, web servers listen for incoming web requests (using the HyperText Transfer Protocol - HTTP) on port 80.
Squid's default HTTP port is 3129. Many people run their cache servers on a port which is easier to remember: something like 80 or 8080). If you choose a low-numbered port, you will have to start Squid as root (otherwise you are considered untrusted, and you will not be able to start Squid. Many ISPs use port 8080, making it an accepted pseudo-standard.
If you wish, you can use multiple ports appending a second port number
to the http_port variable. Here is an example:
http_port 3128 8080
It is very important to refer to your cache server with a generic DNS name. Simply because you only have one server now does not mean that you should not plan for the future. It is a good idea to setup a DNS hostname for your proxy server. Do this right away! A simple DNS entry can save many hours further down the line. Configuring client machines to access the cache server by IP address is asking for a long, painful transition down the road. Generally people add a hostname like cache.mydomain.com to the DNS. Other people prefer the name proxy, and create a name like proxy.mydomain.com.
HTTP defines the format of both the request for information and the format of the server response. The basic aspects of the protocol are quite straight forward: a client (such as your browser) connects to port 80 and asks for the file by supplying the full path and filename that it wishes to download. The client also specifies the version of the HTTP protocol it wishes to use for the retrieval.
With a proxy request the format is only a little different. The client specifies the whole URL instead of just the path to the file. The proxy server then connects to the web server specified in the URL, and sends a normal HTTP request for the page. (? The format of HTTP requests is described in more detail in chapter 4, where you type in an HTTP request, just as a browser would send it to test that the cache is responding to requests - may use the 'client' program instead.?)
Since the format of proxy requests is so similar to a normal HTTP request, it is not especially surprising that many web servers can function as proxy servers too. Changing a web server program to function as a proxy normally involves comparatively small changes to the code, especially if the code is written in a modular manner - as is the Apache web server. In many cases the resulting server is not as fast, or as configurable, as a dedicated cache server can be.
The CERN web server httpd was the first widely available web proxy server. The whole WWW system was initially created to give people easy access to CERN data, and CERN HTTPD was thus the de-facto test-bed for new additions to the initial informal HTTP specification. Most (and certainly at one stage all) of the early web sites ran the CERN server. Many system administrators who wanted a proxy server simply used their standard CERN web server (listening on port 80) as their proxy server, since it could function as one. It is easy for the web server to distinguish a web site request from a normal web page request, since it simply has to check if the full URL is given instead of simply a path name. Given the choice (even today) many system administrators would choose port 80 as their proxy server port simply as 'port 80 is the standard port for web requests'.
There are, however, good reasons for you to choose a port other than 80.
Running both services on the same port meant that if the system administrator wanted to install a different web server package (for extra features available in the new software) they would be limited to software that could perform both as a web server and as a proxy. Similarly, if the same sysadmin found that their web server's low-end proxy module could not handle the load of their ever-expanding local client base, they would be restricted to a proxy server that could function as a web server. The only other alternative is to re-configure all the clients, which normally involves spending a few days apologizing to users and helping them through the steps involved in changing over.
Microsoft use the Microsoft web server (IIS) as a basis for their proxy server component, and Microsoft proxy thus only (? tried once - let's see if it's changed since ?) accepts incoming proxy request on port 80. If you are installing a Squid system to replace either CERN, Apache or IIS running in both web-server and cache-server modes on the same port, you will have to set http_port to 80. Squid is written only as a high-performance proxy server, so there is no way for it to function as a web server, since Squid has no support for reading files from a local disk, running CGI scripts and so forth. There is, however, a workaround.
If you have both services running on the same port, and you cannot change your client PC's, do not despair. Squid can accept requests in web-server format and forward them to another server. If you have only one machine, and you can get your web server software to accept incoming requests on a non-default port (for example 81), Squid can be configured to forward incoming web requests to that port. This is called accelerator mode (since it's initial purpose was to speed up very slow web servers). Squid effectively does some translation on the original request, and then simply acts as if the request were a proxy request and connects to the host: the fact that it's not a remote host is irrelevant. Accelerator mode is discussed in more detail in chapter 9. Until then, get Squid installed and running on another port, and work your way through the first couple of chapters of this book, until you have a working pilot-phase system. Once Squid is stable and tested you can move on to changing web server settings. If you feel adventurous, however, you can skip there shortly!
Cached Data has to be kept somewhere. In the section on hardware sizing, we discussed the size and number of drives to use for caching. Squid cannot autodetect where to store this data, though, so you need to let Squid know which directories it can use for data storage.
The cache_dir operator in the squid.conf file is used to configure specific storage areas. If you use more than one disk for cached data, you may need more than one mount point (for example /usr/local/squid/cache1 for the first disk, /usr/local/squid/cache2 for the second). Squid allows you to have more than one cache_dir option in your config file.
Let's consider only one cache_dir entry in the
meantime. Here I am using the default values from the standard
squid.conf. cache_dir /usr/local/squid/cache/ 100 16 256
The first option to the cache_dir
tag sets the directory where data will be stored. The prefix value
simply has /cache/ tagged onto the end and
it's used as the default directory. This directory is also made by
the make install command that we used earlier.
The next option to cache_dir is straight forward: it's a size value. Squid will store up to that amount of data in that directory. The value is in megabytes, so of the cache store. The default is 100 megabytes.
The other two options are more complex: they set the number of subdirectories (first and second tier) to create in this directory. Squid makes lots of directories and stores a few files in each of them in an attempt to speed up disk access (finding the correct entry in a directory with one million files in it is not efficient: it's better to split the files up into lots of smaller sets of files... don't worry too much about this for the moment). I suggest that you use the default values for these options in the mean time: if you have a very large cache store you may want to increase these values, but this is covered in the section on
If Squid dies email is sent to the address specified with the cache_mgr tag. This address is also appended to the end of error pages returned to users if, for example, the remote machine is unreachable.
Squid can only bind to low numbered ports (such as port 80) if it is started as root. Squid is normally started by your system's rc scripts when the machine boots. Since these scripts run as root, Squid is started as root at bootup time.
Once Squid has been started, however, there is no need to run it as root. Good security practice is to run programs as root only when it's absolutely necessary, and for this reason Squid changes user and group ID's once it has bound to the incoming network port.
The cache_effective_user and cache_effective_group tags tell Squid what ID's to change to. The Unix security system would be useless if it allowed all users to change their ID's at will, so Squid only attempts to change ID's if the main program is started as root.
If you do not have root access to the machine, and are thus not starting Squid as root, you can simply leave this option commented out. Squid will then run with whatever user ID starts the actual Squid binary.
As discussed in chapter 2, this book assumes that you have created both a squid user and a squid group on your cache machine. The above tags should thus both be set to "squid".
Squid can act as a proxy server for various Internet protocols. The most commonly used protocol is HTTP, but the File Transfer Protocol (FTP) is still alive and well.
FTP was written for authenticated file transfer (it requires a username and password). To provide public access, a special account is created: the anonymous user. When you log into an FTP server you use this as your username. As a password you generally use your email address. Most browsers these days automatically enter a useless email address.
It's polite to give an address that works, though. If one of your users abuses a site, it allows the site admin get hold of you easily.
Squid allows you to set the email address that is used with the ftp_user tag. You should probably create a squid@yourdomain.example email address specifically for people to contact you on.
There is another reason to enter a proper address here: some servers require a real email address. For your proxy to log into these ftp servers you will have to enter a real email address here.
Squid could not be used in an ISP environment without a sophisticated access control system. Indeed, Squid should not be used in ANY environment without some kind of basic authentication system. It is amazing how fast other Internet users will find out that they can relay requests through your cache, and then proceed to do so.
Why? Sometimes to obfusticate their real identity, and other times since they have a fast line to you, but a slow line to the remainder of the Internet.
In many cases only the most basic level of access control is needed. If you have a small network, and do not wish to use things like user/password authentication or blocking by destination domain, you may find that this small section is sufficient for all your access control setup. If not, you should read chapter 6, where access control is discussed in detail.
The simplest way of restricting access is to only allow IPs that are on your network. If you wish to implement different access control, it's suggested that you put this in place later, after Squid is running. In the meantime, set it up, but only allow access from your PC's IP address.
Example access control entries are included in the default squid.conf. The included entries should help you avoid some of the more obscure problems, such as bandwidth-chewing loops, cache tunneling with SSL CONNECTs and other strange access problems. In chapter 6 we work through the config file's default config options, since some of them are pretty complex.
Access control is done on a per-protocol basis: when Squid accepts an HTTP request, the list of HTTP controls is checked. Similarly, when an ICP request is accepted, the ICP list is checked before a reply is sent.
Assume that you have a list of IP addresses that are to have access to your cache. If you want them to be able to access your cache with both HTTP and ICP, you would have to enter the list of IP addresses twice: you would have lines something like this:
Rule sets like the above are great for small organisations: they are straight forward.
For large organizations, though, things are more convenient if you can create classes of users. You can then allow or deny classes of users in more complex relationships. Let's look at an example like this, where we duplicate the above example with classes of users:
Sure, it's more complex for this example. The benefits only become apparent if you have large access lists, or when you want to integrate refresh-times (which control how long objects are kept) and the sources of incoming requests. I am getting quite far ahead of myself, though, so let's skip back.
We need some terminology to discuss access control lists, otherwise this could become a rather long chapter. So: lines beginning with acl are (appropriately, I believe) acl lines. The lines that use these acls (such as http_access and icp_access in the above example) are called acl-operators. An acl-operator can either allow or deny a request.
So, to recap: acls are used to define classes. When Squid accepts a request it checks the list of acl-operators specific to the type of request: an HTTP request causes the http_access lines to be checked; an ICP request checks the icp_access lists.
Acl-operators are checked in the order that they occur in the file (ie from top to bottom). The frst acl-operator line that matches causes Squid to drop out of the acl list. Squid will not check through all acl-operators if the first denies the request.
In the previous example, we used a src acl: this checks that the source of the request is within the given IP range. The src acl-type accepts IP address lists in many formats, though we used the subnet/netmask in the earlier example. CIDR (Classless Internet Domain Routing) notation can also be used here. Here is an example of the same address range in either notation:
Access control lists inherit permissions when there is no matching acl If all acl-operators in the file are checked, and no match is found, the last acl-operator checked determines whether the request is allowed or denied. This can be confusing, so it's normally a good idea to place a final "catch-all" acl-operator at the end of the list. The simplest way to create such an operator is to create an acl that matches any IP address. This is done with a src acl with a netmask of all 0's. When the netmask arithmetic is done, Squid will find that any IP matches this acl.
Your cache server may well be on the network placed in the relevant allow lists on your cache, and if you were thus to run the client on the cache machine (as opposed to another machine somewhere on your network) the above acl and http_access rules would allow you to test the cache. In many cases, however, a program running on the cache server will end up connecting to (and from) the address '127.0.0.1' (also known as localhost). Your cache should thus allow requests to come from the address 127.0.0.1/255.255.255.255. In the below example we don't allow icp requests from the localhost address, since there is no reason to run two caches on the same machine.
The squid.conf file that comes with Squid includes acls that deny all HTTP requests. To use your cache, you need to explicitly allow incoming requests from the appropriate range. The squid.conf file includes text that reads:
#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
To allow your client machines access, you need to add rules similar to the below in this space. The default access-control rules stop people exploiting your cache, it's best to leave them in.
Acl-operator lines are not only used for authentication. In an earlier section we discussed communication with other cache servers. Acl lines are used to ensure that requests for specific URLs are handled by your cache, not passed on to another (further away) cache.
If you don't have a parent cache (a firewall, or you have a parent ISP cache) you can probably skip this section.
Let's assume that you connect to your ISP's cache server as a parent. A client machine (on your local network) connects to your cache and requests http://www.yourdomain.example/. Your cache server will look in the local cache store. If the page is not there, Squid wil will connect to it's configured parent (your ISP's cache: across your serial link), and request the page from there. The problem, though, is that there is no need to connect across your internet line: the web server is sitting a few feet from your cache in the machine room.
Squid cannot know that it's being very inefficient unless you give it a list of sites that are "near by". This is not the only way around this problem though: your browser could be configure to ignore the cache for certain IPs and domains, and the request will never reach the cache in the first place. Browser config is covered in Chapter 5, but in the meantime here is some info on how to configure Squid to communicate directly with internal machines.
The acl-operators always_direct and never_direct determine whether to pass the connection to a parent or to proceed directly.
The following is a set of operators are based on the final configuration created in the previous section, but using never_direct and always_direct operators. It is assumed that all servers that you wish to connect to directly are in the address ranges specified in with the my-iplist directives. In some cases you may run a web server on the same machine as the cache server, and the localhost acl is thus also considered local.
The always_direct and never_direct tags are covered in more detail in Chapter 7, where we cover hierarchies in detail.
Squid always attempts to cache pages. If you have a large Intranet system, it's a waste of cache store disk space to cache your Intranet. Controlling which URLs and IP ranges not to cache are covered in detail in chapter 6, using the no_cache acl operator.
Squid supports the concept of a hierarchy of proxies. If your proxy does not have an object on disk, it's default action is to connect to the origin web server and retrieve the page. In a hierarchy, your proxy can communicate with other proxies (in the hope that one of these servers will have the relevant page). You will, obviously, only peer with servers that are 'close' to you, otherwise you would end up slowing down access. If access to the origin server is faster than access to neighboring cache servers it is not a good idea to get the page from the slower link!
Having the ability to treat other caches as siblings is very useful in some interactions. For example: if you often do business with another company, and have a permanent link to their premises, you can configure your cache to communicate with their cache. This will reduce overall latency: it's almost certainly faster to get the page from them than from the other side of the country.
When querying more than one cache, Squid does not query each in turn, and wait for a reply from the first before querying the second (since this would create a linear slowdown as you add more siblings, and if the first server stops responding, you would slow down all incoming requests). Squid thus sends all ICP queries together - without waiting for replies. Squid then puts the client's request on hold until the first positive reply from a sibling cache is received, and will retrieve the object from the fastest-replying cache server. Since the earliest returning reply packet is usually on the fastest link (and from the least loaded sibling server), your server gets the page fast.
Squid will always get the page from the fastest-responding cache - be it a parent or a sibling.
The cache_peer option allows you to specify proxy servers that
your server is to communicate with. The first line of the following
example configures Squid to query the cache machine
cache.myparent.example as a parent. Squid will communicate with
the parent on HTTP port 3128, and will use ICP to query the server using
port 3130. Configuring Squid to query more than one server is easy:
simply add another cache_peer line. The second line configures
cache.sibling.example as a sibling, listening for HTTP request on port
8080 and ICP queries on port 3130.
cache_peer cache.myparent.example parent 3128 3130
cache_peer cache.sibling.example sibling 8080 3130
If you do not wish to query any other caches, simply leave all cache_peer lines commented out: the default is to talk directly to origin servers.
Cache peering and hierarchy interactions are discussed in quite some detail in this book. In some cases hierarchy setups are the most difficult part of your cache setup process (especially in a distributed environment like a nationwide ISP). In depth discussion of hierarchies is beyond the scope of this chapter, so much more information is given in chapter 8. There are cases, where you need at least one hierarchy line to get Squid to work at all. This section covers the basics, just for those setups.
You only need to read this material if one of the following scenarios applies to you:
You have to use your Internet Service Provider's cache.
You have a firewall.
If you have to use your Internet Service Provider's cache, you will have to configure Squid to query that machine as a parent. Configuring their cache as a sibling would probably return error pages for every URL that they do not already have in their cache.
Squid will attempt to contact parent caches with ICP for each request.
This is essentially a ping. If there is no response to this
request, Squid will attempt to go direct to the origin server. since
(in this case, at least) you cannot bypass your ISP's cache, you may
want to reduce the latency added by this extra query. To do this,
place the default and no-query keywords at the end of
your cache_peer line:
cache_peer cache.myisp.example parent 3128 3130 default no-query
The default option essentially tells Squid "Go through this
cache for all requests. If it's down, return an error message to the
client: you cannot go direct".
The no-query option gets Squid to ignore the given ICP port (leaving the port number out will return an error), and never to attempt to query the cache with ICP.
Firewalls can make cache configuration hairy. Inter-cache protocols generally use packets which firewalls inherently distrust. Most caches (Squid included) use ICP, which is a layer on top of UDP. UDP is difficult to make secure, and firewall administrators generally disable it if at all possible.
It's suggested that you place your cache server on your DMZ (if you have one). There are a few advantages to this:
Your cache server is kept secure.
The firewall can be configured to hand off requests to the cache server, assuming it is capable.
You will be able to peer with other, outside, caches (like your ISP's), since DMZ networks generally have less rigid rule sets.
The remainder of this section should help you getting Squid and your firewall to co-operate. A few cases are covered for each type of firewall: the cache inside the firewall; the cache outside the firewall; and, finally, on the DMZ.
The vast majority of firewalls no nothing about ICP. If, on the other hand, your firewall does not support HTTP, it's a good time to have a serious talk to the buyer that had an all-expenses-paid weekend on the firewall supplier. Configuring the firewall to understand ICP is likely to be painful, but HTTP should be easy.
If you are using a proxy-level firewall, your client machines are probably configured to use the firewall's internal IP address as their proxy server. Your firewall could also be running in transparent mode, where it automatically picks up outgoing web requests. If you have a fair number of client machines, you may not relish the idea of reconfiguring all of them. If you fall into this category, you may wish to put your firewall on the outside (or on the DMZ) and configure the firewall to pass requests to the cache, rather than reconfiguring all client machines.
The cache is considered a trusted host, and is protected by the firewall. You will configure client machines to use the cache server in their browser proxy settings, and when a request is made, the cache server will pass the outgoing request to the firewall, treating the firewall as a parent proxy server. The firewall will then, connect to the destination server. If you have a large number of clients configured to use the firewall as their proxy server, you could get the firewall to hand-off incoming HTTP requests back into the network, to the cache server. This is less efficient though, since the cache will then have to re-pass these requests through the firewall to get to the outside, using the parent option to cache_peer. Since the latter involves traffic passing through the firewall twice, your load is very likely to increase. You should also beware of loops, with the cache server parenting to the firewall and the firewall handing-off the cache's request back to the cache!
As described in chapter 1, Squid will also send ICP queries to parents. Firewalls don't care for UDP packets, and normally log (and then discard) such packets.
When Squid does not receive a response from a configured parent, it will mark the parent as down, and proceed to go directly.
Whenever Squid is setup to use a parent that does not support ICP, the cache_peer line should include the "default" and "no-query" options. These options stop Squid from attempting to go direct when all caches are considered down, and specify that Squid is not to send ICP requests to that parent.
Here is an example config entry:
cache_peer inside.fw.address.domain parent 3128 3130 default no-query
There are only two major reasons for you to put your cache outside the firewall:
One: Although squid can be configured to do authentication, this can lead to the duplication of effort (you will encounter the "add new staff to 500 servers" syndrome). If you want to continue to authenticate users on the firewall, you will have to put your cache on the outside or on the DMZ. The firewall will thus accept requests from clients, authenticate them, and then pass them on to the cache server.
Two: Communicating with cache hierarchies is easy. The cache server can communicate with other systems using any protocol. Sibling caches, for example, are difficult to contact through a proxying firewall.
You can only place your cache outside if your firewall supports hand-offs. Browsers inside will connect to the firewall and request a URL, and the firewall will connect to the outside cache and request the page.
If you place your cache outside your firewall, you may find that your client PC's have problems connecting to internal web servers (your intranet, for example, may be unreachable). The problem is that the cache is unable to connect back through to your internal network (which is actually a good thing: don't change that). The best thing to do here is to add exclusions to your browser settings: this is described in Chapter 5 - you should specifically have a look at the section on browser autoconfig. In the meantime, let's just get Squid going, and we will configure browsers once you have a cache to talk to.
Since the cache is not protected by the firewall, it must be very carefully configured - it must only accept requests from the firewall, and must not run any strange services. If possible, you should disable telnet, and use something like SSH (Secure SHell) instead. The access control lists (which you will setup shortly) must only allow the firewall, otherwise people will be able to relay their requests through your cache, using your bandwidth.
If you place the cache outside the firewall, you client PC's will be configured to use the firewall as their proxy server (this is probably the case already). The firewall must be configured to hand-off client HTTP requests to the cache server. The cache must be configured to only allow HTTP requests when from the firewall's outside IP address. If not configured this way, other Internet users could use your cache server as a relay, using your bandwidth and hardware resources for illegitimate (and possibly illegal) purposes.
With your cache server on the outside network, you should treat the machine as a completely untrusted host, lest a cracker find a hole somewhere on the system. It is recommended that you place the cache server on a dedicated firewall network card, or on a switched ethernet port. This way, if your cache server were to be cracked, the cracker would only be able to read passing HTTP data. Since the majority of sensitive information is sent via email, this would reduce the potential for sensitive data loss.
Since your cache server only accepts requests from the firewall, there is no cache_peer line needed in the squid.conf. If you have to talk to your ISP's cache you will, of course, need one: see the section on this a bit further back.
The best place for a cache is your DMZ.
If you are concerned with the security of your cache server, and want to be able to communicate with outside cache servers (using ICP), you may want to put your cache on the DMZ.
With Squid on your DMZ, internal client PCs are setup to proxy to the firewall. The firewall is then responsible for handing-off these HTTP requests to the cache server (so the firewall in fact treats the cache server as a parent).
Since your cache server is (essentially) on the outside of the firewall, the cache doesn't need to treat the firewall as a parent or sibling: it only accepts requests from the firewall: it never passes them to the firewall.
If your cache is outside your firewall, you will need to configure your client PC's not to use the firewall as a proxy server for internal hosts. This is quite easy, and is discussed in the chapter on browser configuration.
Since the firewall is acting as a filter between your cache and the outside world, you are going to have to open up some ports on the firewall. The cache will need to be able to connect to port 80 on any machine on the outside world. Since some valid web servers will run on ports other than 80, you should consider allowing connections to any port from the cache server. In short, allow connections to:
Port 80 (for normal HTTP requests)
Port 443 (for HTTPS requests)
Ports higher than 1024 (site search engines often use high-numbered ports)
If you are going to communicate with a cache server outside the firewall, you will need even more ports opened. If you are going to communicate with ICP, you will need to allow UDP traffic from and to your cache machine on port 3130. You may find that the cache server that you are peering with uses different ports for reply packets. It's probably a bad idea to open all UDP traffic, though.
Squid will normally live on the inside of your packet-filtering firewall. If you have a DMZ, it may be best to put your cache on this network, as you may want to allow UDP traffic to and from the cache server (to communicate with other caches).
To configure your firewall correctly, you should make the minimum number of holes in your filter set. In the remainder of this section we assume that your internal machines can connect to the cache server unimpeded. If your cache is on the DMZ (or outside the firewall altogether) you will need to allow TCP connections from your internal network (on a random source port) to the HTTP port that Squid will be accepting requests on (this is the port that you set a bit earlier, in the "Setting Squid's HTTP Port" section of this chapter.
First, let's consider the firewall setup when you do not query any outside caches. On accepting a request, Squid will attempt to connect to a machine on the Internet at large. Almost always, the destination port will be the default HTTP port, port 80. A few percent of the time, however, the request will be destined for a high-numbered port (any port number higher than 1023 is a high-numbered port). Squid always sources TCP requests from a high-numbered port, so you will thus need to allow TCP requests (all HTTP is TCP-based) from a random high-numbered port to both port 80 and any high-numbered port.
There is another low-numbered port that you will probably need to open. The HTTPS port (used for secure Internet transactions) is normally listening on TCP port 443, so this should also be opened.
In the second situation, let's look at cache-peering. If you are planning to interact with other caches, you will need to open a few more ports. First, let's look at ICP. As mentioned previously, ICP is UDP-based. Almost all ICP-compliant caches listen for ICP requests on UDP port 3130. Squid will always source requests from port 3130 too, though other ICP-compliant caches may source their requests from a different port.
It's probably not a good idea to allow these UDP packets no matter what source address they come from. Your filter should probably specify the IP addresses for each of the caches that you wish to peer from, rather than allowing UDP packets from any source address. That should be it: You should now be able to save the config file, and get ready to start the Squid program.
Before we can start Squid, we have to create a few directories on the system. It's important that these directories have the correct permissions, otherwise someone with a login on the cache may be able to gain root access. Let's work through the default directory tree, and set the permissions on each directory correctly. Since you may have special requirements, I won't simply give you a sequence of commands to run: if you need to use different permissions, it's important to understand the possible consequences.
In Chapter 2 we created a squid user and group, and created another group, squidadm for the people that will maintain the cache. When Squid starts up, it changes it's user and group ids to squid (thanks to the cache_effective_user and cache_effective_group tags in squid.conf.) Changing userids reduces the chance of a complete exploit because of a bug in Squid. It's important, however, to remember that users in the squidadm group can probably get root on your machine, so you should not put people that do not already have root on the machine in that group: it's just so that you don't have to su to root continuously.