The io_write() function

Another easy function to understand is the io_write() function. It gets a little more complicated because we have to handle allocating blocks when we run out (i.e., when we need to extend the file because we have written past the end of the file).

The io_write() functionality is presented in two parts, one is a fairly generic io_write() handler, the other is the actual block handler that writes the data to the blocks.

The generic io_write() handler looks at the current size of the file, the OCB's offset member, and the number of bytes being written to determine if the handler needs to extend the number of blocks stored in the fileblocks member of the extended attributes structure. Once that determination is made, and blocks have been added (and zeroed!), then the RAM-disk-specific write handler, ramdisk_io_write(), is called.

The following diagram illustrates the case where we need to extend the blocks stored in the file:

Figure 1. A write that overwrites existing data in the file, adds data to the “unused” portion of the current last block, and then adds one more block of data.

The following shows what happens when the RAM disk fills up. Initially, the write would want to perform something like this:

Figure 2. A write that requests more space than exists on the disk.

However, since the disk is full (we could allocate only one more block), we trim the write request to match the maximum space available:

Figure 3. A write that's been trimmed due to lack of disk space.

There was only 4 KB more available, but the client requested more than that, so the request was trimmed.

int
cfs_io_write (resmgr_context_t *ctp, io_write_t *msg,
              RESMGR_OCB_T *ocb)
{
  cfs_attr_t    *attr;
  int       i;
  off_t       newsize;

  if ((i = iofunc_write_verify (ctp, msg, ocb, NULL)) != EOK) {
    return (i);
  }

  // shortcuts
  attr = ocb -> attr;
  newsize = ocb -> offset + msg -> i.nbytes;

  // 1) see if we need to grow the file
  if (newsize > attr -> attr.nbytes) {
    // 2) truncate to new size using TRUNCATE_ERASE
    cfs_a_truncate (attr, newsize, TRUNCATE_ERASE);
    // 3) if it's still not big enough
    if (newsize > attr -> attr.nbytes) {
      // 4) trim the client's size
      msg -> i.nbytes = attr -> attr.nbytes - ocb -> offset;
      if (!msg -> i.nbytes) {
        return (ENOSPC);
      }
    }
  }

  // 5) call the RAM disk version
  return (ramdisk_io_write (ctp, msg, ocb));
}

The code walkthrough is as follows:

We compare the newsize (derived by adding the OCB's offset plus the number of bytes the client wants to write) against the current size of the resource. If the newsize is less than or equal to the existing size, then it means we don't have to grow the file, and we can skip to step 5.
We decided that we needed to grow the file. We call cfs_a_truncate(), a utility function, with the parameter TRUNCATE_ERASE. This will attempt to grow the file to the required size by adding zero-filled blocks. However, we could run out of space while we're doing this. There's another flag we could have used, TRUNCATE_ALL_OR_NONE, which would either grow the file to the required size or not. The TRUNCATE_ERASE flag grows the file to the desired size, but does not release newly added blocks in case it runs out of room. Instead, it simply adjusts the base attributes structure's nbytes member to indicate the size it was able to grow the file to.
Now we check to see if we were able to grow the file to the required size.
If we can't grow the file to the required size (i.e., we're out of space), then we trim the size of the client's request by storing the actual number of bytes we can write back into the message header's nbytes member. (We're pretending that the client asked for fewer bytes than they really asked for.) We calculate the number of bytes we can write by subtracting the total available bytes minus the OCB's offset member.
Finally, we call the RAM disk version of the io_write() routine, which deals with getting the data from the client and storing it in the disk blocks.

As mentioned above, the generic io_write() function isn't doing anything that's RAM-disk-specific; that's why it was separated out into its own function.

Now, for the RAM-disk-specific functionality. The following code implements the block-management logic (refer to the diagrams for the read logic):

int
ramdisk_io_write (resmgr_context_t *ctp, io_write_t *msg,
                  RESMGR_OCB_T *ocb)
{
  cfs_attr_t  *attr;
  int         sb;      // startblock
  int         so;      // startoffset
  int         lb;      // lastblock
  int         nbytes, nleft;
  int         toread;
  iov_t       *newblocks;
  int         i;
  off_t       newsize;
  int         pool_flag;

  // shortcuts
  nbytes = msg -> i.nbytes;
  attr = ocb -> attr;
  newsize = ocb -> offset + nbytes;

  // 1) precalculate the block size constants...
  sb = ocb -> offset / BLOCKSIZE;
  so = ocb -> offset & (BLOCKSIZE - 1);
  lb = newsize / BLOCKSIZE;

  // 2) allocate IOVs
  i = lb - sb + 1;
  if (i <= 8) {
    newblocks = mpool_malloc (mpool_iov8);
    pool_flag = 1;
  } else {
    newblocks = malloc (sizeof (iov_t) * i);
    pool_flag = 0;
  }

  if (newblocks == NULL) {
    return (ENOMEM);
  }

  // 3) calculate the first block size
  toread = BLOCKSIZE - so;
  if (toread > nbytes) {
    toread = nbytes;
  }
  SETIOV (&newblocks [0], (char *)
          (attr -> type.fileblocks [sb].iov_base) + so, toread);

  // 4) now calculate zero or more blocks;
  //    special logic exists for a short final block
  nleft = nbytes - toread;
  for (i = 1; nleft > 0; i++) {
    if (nleft > BLOCKSIZE) {
      SETIOV (&newblocks [i],
              attr -> type.fileblocks [sb + i].iov_base, BLOCKSIZE);
      nleft -= BLOCKSIZE;
    } else {
      SETIOV (&newblocks [i],
              attr -> type.fileblocks [sb + i].iov_base, nleft);
      nleft = 0;
    }
  }

  // 5) transfer data from client directly into the ramdisk...
  resmgr_msgreadv (ctp, newblocks, i, sizeof (msg -> i));

  // 6) clean up
  if (pool_flag) {
    mpool_free (mpool_iov8, newblocks);
  } else {
    free (newblocks);
  }

  // 7) use the original value of nbytes here...
  if (nbytes) {
    attr -> attr.flags |= IOFUNC_ATTR_MTIME | IOFUNC_ATTR_DIRTY_TIME;
    ocb -> offset += nbytes;
  }
  _IO_SET_WRITE_NBYTES (ctp, nbytes);
  return (EOK);
}

We precalculate some constants to make life easier later on. The sb variable contains the starting block number where our writing begins. The so variable (“start offset”) contains the offset into the start block where writing begins (we may be writing somewhere other than the first byte of the block). Finally, lb contains the last block number affected by the write. The sb and lb variables define the range of blocks affected by the write.
We're going to allocate a number of IOVs (into the newblocks array) to point into the blocks, so that we can issue the MsgRead() (via resmgr_msgreadv() in step 5, below). The + 1 is in place in case the sb and lb are the same—we still need to read at least one block.
The first block that we read may be short, because we don't necessarily start at the beginning of the block. The toread variable contains the number of bytes we transfer in the first block. We then set this into the first newblocks array element.
The logic we use to get the rest of the blocks is based on the remaining number of bytes to be read, which is stored in nleft. The for loop runs until nleft is exhausted (we are guaranteed to have enough IOVs, because we calculated the number in step 1, above).
Here we use the resmgr_msgreadv() function to read the actual data from the client directly into our buffers through the newblocks IOV array. We don't read the data from the passed message, msg, because we may not have enough data from the client sitting in that buffer (even if we somehow determine the size a priori and specify it in the resmgr_attr.msg_max_size, the network case doesn't necessarily transfer all of the data). In the network case, this resmgr_msgreadv() may be a blocking call—just something to be aware of.
Clean up after ourselves. The flag pool_flag determines where we allocated the data from.
If we transferred any data, adjust the access time as per POSIX.