How the Web Works

If you’ve wondered how the web works, but were afraid to ask, this article is for you.

By Bob Ray
February 22, 2022
How the Web Works

The information in this article won’t be particularly deep or technical, and highly web-savvy purists may find that some points are simplified to the point that they are not completely accurate. The point of these articles is to help people begin to understand things, not to prepare them for writing or responding to RFCs.

With today’s tools, it’s entirely possible for a web designer, even with some coding experience in various web programming languages, to create an entire website without really understanding what’s going on behind the scenes. This is especially true for people with a background in a non-web programming language like Java or C++ who are getting started with creating websites. If this describes you and you’d like to know more about how the Web really works—but were afraid to ask—this article is for you.

I created a surprising number of successful websites before I really understood what I was doing. I learned to use HTML and CSS to create attractive (at least to me) websites, but I had no real understanding of how the pages of those sites made it from the server to the user’s screen.

NOTE: In a few places in this article, I refer to the `.htaccess` file, which is a configuration file for Apache web servers. If you are using an IIS or NGINX server, which use different configuration files with different rules, these examples will not apply, but the basic concepts will be the same.

The URL

Since you got to this page, you probably know what a URL is. URL stands for “Uniform Resource Locator” and it’s really just an address like the address for where you live. Like street addresses, each URL specifies a unique location on the Web.

The first part of the URL is a “protocol” (http://, https, ftp://, etc.), which just specifies the rules of communication for the servers involved in the process. The http(s) protocol is fairly complex. If you really want to look at it, you can wade through the text and links here. For our purposes, we’ll assume that it’s https://, for a standard, secure web page (Google now penalizes pages that are not secure). Next, comes the name of the (optional) subdomain and dot, followed by the domain, another dot, and the Top-Level Domain (TLD), e.g., .com, or .edu, which identifies the web server that has the page. Next (optionally) a colon and a port number, a path with one or more directories, and finally, the name of the “page” on the host (a file name). There are two slashes after the protocol to distinguish it from the slashes in the path. There may also be a reference to a named anchor at the end of the URL (preceded by a #), which points to a specific location within the page.

There’s a nice graphic for this here. We’ll see a concrete example in a bit.

The protocol and host (including at least a domain and a TLD) are required for a legitimate URL. The path, if missing, points to the web root of the site (e.g., public_html). If the Resource name (filename) is missing, it is usually assumed to be index followed by a suffix like .html, .htm, or .php, so if the path/filename part is missing or contains the name of a directory rather than a file, most hosts will look for an index.htm, index.html, or index.php file at the site root. The order of that search can be set in the .htaccess file, and you can also set the name of the default index file there. If there is no index file, the server will show a list of all the files, unless told not to in the .htaccess file.

There are a couple of things relating to the index file that you can do to make your site more secure. You can tell the server not to show the file list (by adding `Options -Indexes` to the `.htaccess` file), and you can change the name of the default file the server looks for from "index” to something long and arbitrary. If you’re paranoid, you can do both.

URLs can also contain arbitrary information at the end in a “query" section. When a URL contains a question mark after the path/filename, everything after the question mark is the query. The query can contain almost anything, but usually it’s in the form of several variables and their values.

https://wordsmatter.softville.com/show-quotations.html?topic=aging

In the example URL above, https:// is the protocol, wordsmatter. is the subdomain, softville is the domain, and .com is the Top-level domain (TLD). The page being requested is show-quotations.html and the query section after the question mark indicates that the requested topic is aging.

The Request

When your browser processes a URL, it basically sends out a message, technically called a “request,” into the ether that contains the URL and some other information, some of which is optional. The first step is a check with a DNS (Domain Name Server) to get the IP address of the specified domain/subdomain/TLD in the URL. After that, all the processing points between you and the specified server pass the request in the server’s direction until it gets there. Two identical requests made one after the other could take completely different routes to their destinations, and the response can also take multiple routes. For the purposes of this article, we’ll assume that all just happens by magic, since you rarely have to know the details. The request also contains the address of your server, so the target server knows where to send the response.

If you’re submitting a form, the process is the same, except that the message also contains information about the fields you’ve selected or filled out on the form. The process is essentially the same whether the server is posting to itself or to a server on the other side of the world.

Any time that a new web page is displayed in your browser, you can assume that a request has been sent and a response has been returned from the target server.

Client vs. Server

You may already be familiar with these terms, but in case you’re not, here’s a brief bit of information on them. In a typical exchange of information on the web, the web browser is the client and the computer (or collection of computers) at the host that contains the information is the server. The browser sends a message that says, “Hi, I’d like to be one of your clients.” The server does a little checking to make sure you’re not blocked and the information is available and says, “Sure, why not.”

The Web Server

The server at your web host (or your own machine if you have a localhost install) has one main job. When a user requests a page (by clicking on a link, or typing in a URL), a request is sent to the target server. If the request is an http:// or https:// request, the server looks for the requested page and (assuming that the requester is not blocked for some reason and the page is available) sends it back to the user’s browser. In the very early days of the Web, that was the whole story, and it’s still the most important part. In those days, the pages were HTML files on the server’s drive and were essentially just read from the disk and sent directly to the user’s browser for display.

These days, the process is a little more complicated, but still fairly basic. If there is an .htaccess file (or other routing file) on the site, the server will use the rules in that file that alter the page request before looking for the requested resource (or telling a banned user to get lost). The resource requested in the URL may be a physical file on the server, but in the case of a Content Management System (CMS) like MODX or WordPress, it may not be a file at all. In many CMS platforms, the HTML code for the requested page may be stored in a database. In addition, the returned page may contain images, CSS code, and JavaScript code in addition to HTML.

On modern servers, something else may happen before the page is returned. If the requested page has a file extension other than .htm or .html, (e.g, .php, aspx, cfm, rhtml, etc.) the server will recognize it as needing further processing. If the server has an installed engine for processing the particular extension, it will pass the content of the requested page to the engine, then return the result it gets back from the engine to the user’s browser. The engine most often converts the results of the page’s code to HTML and returns it to the server, which sends in on to the client that requested it.

As we saw above, the returned page may contain images, CSS, or JavaScript, but although it can be, that code isn’t always in the returned page. Often, just the URL of the CSS or JS is returned in the page script, inside a tag that identifies what it is. In other words, the server is telling the user’s browser, "it’s here—get it yourself." In that case, the browser sends another request to the included URL, (unless it already has a copy in the browser cache), which may take a different route than the original request. Every request takes time because it involves a round-trip between the client and the server identified in the URL, which is why it speeds things up a lot to combine multiple JS or CSS files into a single file—it cuts down the number of requests that have to be made before the page can be rendered in the browser.

Another way to speed things up is to have the pages compressed before being sent. That reduces the amount of raw data that has to be transferred in the response. The compression is done by a compression engine on the server and is usually triggered by a directive in the .htaccess file. If the server sees such a directive, it knows that just before returning the response, it should submit the response to the compression engine and return what it gets back to the user’s browser, along with information that tells the browser that the information is compressed and what compression method has been used. Hopefully, the user’s browser knows how to decompress the information.

Server Variables

In PHP, information about the request (and some other things) is available in the $_SERVER array. Some of it comes from information in the request itself and some from information the server already knows. The name of the sever itself, for example, is in the $_SERVER['SERVER_NAME'] variable. The $_SERVER['HTTP_HOST'] variable contains, in theory, the name of the server that sent the request. The $_SERVER['REMOTE_ADDR'] variable contains the IP of the server that sent the request. There are many variables in the $_SERVER array, but not all of them are reliable—some can be spoofed by a clever programmer.

For a complete list of the $_SERVER variables, see This page. You can also see the $_SERVER variables and their values by putting this code on a page and viewing it. (In the case of MODX, you’d put the code in a Snippet and put a tag for the Snippet in the page):

### Server Variables
$output = '<pre>';
$output .= print_r($_SERVER, true);
$output .= '</pre>';
echo $output;
/* Or in a MODX snippet:
    return $output;
*/

Request Variables

Part of the information in a request the server receives is converted to a PHP array called $_REQUEST. The $_REQUEST array often contains information from a form the user has submitted and is processed by the PHP of the requested page, but it can be used for anything.

The simplest part of the $_REQUEST array is the $_GET array, which is constructed from the URL itself. Earlier, we saw an example of a URL with a query section at the end containing two variables. If a form contains those two variables and the form’s method is “get” the browser will automatically add a question mark onto the end of the URL and then tack on those two variables and their values. When the server processes that request, it will automatically put those two variables into the $_GET array:

$_GET = array(
    'term' => 'modx',
    'topic' => 'plugins'
)

In PHP code on the requested page, you might see those variables extracted from the $_GET array like this:

$term = $_GET['term'];
$topic = $_GET['topic'];

The $_POST array is exactly the same, except that the information is sent inside the request rather than in the URL. Using the $_POST array (by setting a form’s method to “post,” for example) is generally considered safer than using $_GET since it hides the information, but it’s easy enough to use tools that will display the $_POST data as it’s sent, it’s just a little less convenient than simply typing them into the browser’s address line, which is what you do to create $_GET variables.

The third member of the $_REQUEST array is the $_COOKIE array, which contains any cookie information stored on the user’s machine related to the current website. If a server’s response to a request contains cookies, the user’s browser will store them (unless they’ve turned cookies off). The next time the user visits a page at the same site, the browser will send any stored cookies in the request. Cookies allow information to persist across visits. They can allow users to be permanently logged in, for example, or store their preferences. For example, cookies let Amazon remind you what you were looking at and make suggestions based on past visits.

One last point about the $_REQUEST array for coders. You can get $_GET, $_POST, or $_COOKIE data from the $_REQUEST array (since they are its three members), but the arrays are separate entities. In other words, if you modify the $_GET, $_POST, or $_COOKIE array yourself, the real $_REQUEST array will not change and vice versa.

Back to the Browser (Client)

We’ve talked about how servers respond to requests, but what about when the response arrives back at the user’s browser? Before displaying the page, the browser makes further requests for any URLs contained in the response to get any required images, JS, or CSS. It plugs those into the page, uses the CSS as a guide, and renders the page’s HTML for viewing. It also executes any JavaScript code it encounters and alters the page accordingly.

Unfortunately, different browsers have different ideas about how to interpret and render the data they receive, so cross-browser testing is still necessary in many cases. I hope I live to see the day when every browser renders the same code in the same way, but I’m not counting on it.

How This Relates to MODX

In MODX, requests from a browser are tied to specific Resources. When a request comes in, the server first looks for the actual file named in the URL (e.g. MyFile.html), if it finds it, that file is returned to the client (browser) and MODX is not involved.

The .htaccess file on a MODX website (which is read before MODX gets involved) contains a rule that says if the file is not found, the request is re-routed to the MODX index.php file. This leads to an interesting, and hard-to-diagnose error if you have an actual index.htm or index.html file in addition to the MODX index.php file at the root of your site. Most servers will serve up .htm or .html files right away if they find them. This can happen if you try to convert a non-MODX website to a MODX one and leave the old index.htm or index.html file in place. That old file will be served up for every request where the named file doesn’t exist. MODX will never know the request exists.

When the request makes it to the MODX index.php file, MODX looks for a Resource in the database that corresponds to the one named in the request. If it doesn’t find one, it returns the content of the MODX error (page not found) page. This is always the page with the Resource ID specified in the MODX error_page System Setting. By default, this is the Home Page resource (id = 1) created by MODX in all original installs. If you keep ending up at your home page, MODX is not finding the requested page in the database, usually because of an error in the path to the file or an incomplete switch to Friendly URLs. For Friendly URLs, you need to set the MODX Friendly URL System Settings, and uncomment the Friendly URL section of the MODX .htaccess file.

If MODX does find the requested resource in database, it checks to see if it’s published and the current user is allowed to see it, and if so, it does all its MODX magic to render the page (e.g., getting the template, getting the page content and other fields from the database, processing tags, etc.), then sends it back to the browser.

Summing Up

The key to understanding how the web works is the request. Web browsers make requests, and servers respond to them. The request contains lots of information and the server uses that information to tailor the response. The process is relatively simple, though, (as long as you ignore the intricacies of the http(s) protocol), and understanding it can help you bring your web skills to the next level.


Bob Ray is the author of the MODX: The Official Guide and dozens of MODX Extras including QuickEmail, NewsPublisher, SiteCheck, GoRevo, Personalize, EZfaq, MyComponent and many more. His website is Bob’s Guides. It not only includes a plethora of MODX tutorials but there are some really great bread recipes there, as well.