Go to the first, previous, next, last section, table of contents.


Finding Tokens in a String

It's fairly common for programs to have a need to do some simple kinds of lexical analysis and parsing, such as splitting a command string up into tokens. You can do this with the strtok function, declared in the header file `string.h'.

Function: char * strtok (char *newstring, const char *delimiters)
A string can be split into tokens by making a series of calls to the function strtok.

The string to be split up is passed as the newstring argument on the first call only. The strtok function uses this to set up some internal state information. Subsequent calls to get additional tokens from the same string are indicated by passing a null pointer as the newstring argument. Calling strtok with another non-null newstring argument reinitializes the state information. It is guaranteed that no other library function ever calls strtok behind your back (which would mess up this internal state information).

The delimiters argument is a string that specifies a set of delimiters that may surround the token being extracted. All the initial characters that are members of this set are discarded. The first character that is not a member of this set of delimiters marks the beginning of the next token. The end of the token is found by looking for the next character that is a member of the delimiter set. This character in the original string newstring is overwritten by a null character, and the pointer to the beginning of the token in newstring is returned.

On the next call to strtok, the searching begins at the next character beyond the one that marked the end of the previous token. Note that the set of delimiters delimiters do not have to be the same on every call in a series of calls to strtok.

If the end of the string newstring is reached, or if the remainder of string consists only of delimiter characters, strtok returns a null pointer.

Warning: Since strtok alters the string it is parsing, you should always copy the string to a temporary buffer before parsing it with strtok. If you allow strtok to modify a string that came from another part of your program, you are asking for trouble; that string might be used for other purposes after strtok has modified it, and it would not have the expected value.

The string that you are operating on might even be a constant. Then when strtok tries to modify it, your program will get a fatal signal for writing in read-only memory. See section Program Error Signals.

This is a special case of a general principle: if a part of a program does not have as its purpose the modification of a certain data structure, then it is error-prone to modify the data structure temporarily.

The function strtok is not reentrant. See section Signal Handling and Nonreentrant Functions, for a discussion of where and why reentrancy is important.

Here is a simple example showing the use of strtok.

#include <string.h>
#include <stddef.h>

...

const char string[] = "words separated by spaces -- and, punctuation!";
const char delimiters[] = " .,;:!-";
char *token, *cp;

...

cp = strdupa (string);                /* Make writable copy.  */
token = strtok (cp, delimiters);      /* token => "words" */
token = strtok (NULL, delimiters);    /* token => "separated" */
token = strtok (NULL, delimiters);    /* token => "by" */
token = strtok (NULL, delimiters);    /* token => "spaces" */
token = strtok (NULL, delimiters);    /* token => "and" */
token = strtok (NULL, delimiters);    /* token => "punctuation" */
token = strtok (NULL, delimiters);    /* token => NULL */

The GNU C library contains two more functions for tokenizing a string which overcome the limitation of non-reentrancy.

Function: char * strtok_r (char *newstring, const char *delimiters, char **save_ptr)
Just like strtok, this function splits the string into several tokens which can be accessed by successive calls to strtok_r. The difference is that the information about the next token is stored in the space pointed to by the third argument, save_ptr, which is a pointer to a string pointer. Calling strtok_r with a null pointer for newstring and leaving save_ptr between the calls unchanged does the job without hindering reentrancy.

This function is defined in POSIX-1 and can be found on many systems which support multi-threading.

Function: char * strsep (char **string_ptr, const char *delimiter)
This function is just strtok_r with the newstring argument replaced by the save_ptr argument. The initialization of the moving pointer has to be done by the user. Successive calls to strsep move the pointer along the tokens separated by delimiter, returning the address of the next token and updating string_ptr to point to the beginning of the next token.

If the input string contains more than one character from delimiter in a row strsep returns an empty string for each pair of characters from delimiter. This means that a program normally should test for strsep returning an empty string before processing it.

This function was introduced in 4.3BSD and therefore is widely available.

Here is how the above example looks like when strsep is used.

#include <string.h>
#include <stddef.h>

...

const char string[] = "words separated by spaces -- and, punctuation!";
const char delimiters[] = " .,;:!-";
char *running;
char *token;

...

running = strdupa (string);
token = strsep (&running, delimiters);    /* token => "words" */
token = strsep (&running, delimiters);    /* token => "separated" */
token = strsep (&running, delimiters);    /* token => "by" */
token = strsep (&running, delimiters);    /* token => "spaces" */
token = strsep (&running, delimiters);    /* token => "" */
token = strsep (&running, delimiters);    /* token => "" */
token = strsep (&running, delimiters);    /* token => "" */
token = strsep (&running, delimiters);    /* token => "and" */
token = strsep (&running, delimiters);    /* token => "" */
token = strsep (&running, delimiters);    /* token => "punctuation" */
token = strsep (&running, delimiters);    /* token => "" */
token = strsep (&running, delimiters);    /* token => NULL */


Go to the first, previous, next, last section, table of contents.