Java String Tokenisation

Handra
5 min readDec 6, 2019
Photo by Andrew Ridley on Unsplash

String tokenisation or sometimes called string splitting is one of the most common string operations to be performed when we are doing programming. Take a CSV (Comma Separated Value) for example, which practically separates content based on the use of comma (,) character as separator.

Java makes it very easy to do string tokenisation. The most commonly used method to do string tokenisation is by utilising the split method provided by the String class. Besides utilising the built-in split method, we can as well utilise a class named StringTokenizer provided by the Apache Commons (commons-text) library.

Using built-in split function

Let’s say we have a string with the following value:

2019-11-25|12:26:10,496|default task-1|Enroll Certificate||Start|10.85.75.10

Now, as you should have realised from reading the split method documentation, it accepts regular expression (regex) instead of just the plain separator text. In this case, you have to be careful with the separator you are going to use and how are you going to tell Java about it.

As you can see from the example above, we are using the pipe (|) character as the string separator. Pipe itself has a special meaning in regular expression, hence we need to tell Java to literally use the pipe character by accompanying it with the \\ symbol. See below the complete code.

Now when we print out the value of each token, we will get the following output.

2019-11-25
12:26:10,496
default task-1
"Enroll Certificate"

Start
10.85.75.100

All look good, until we realise that something is not quite right. If you take a look at the output, we will have an output with value “Enroll Certificate”. This, of course, most of the times not what we really want. Technically, we should get value of Enroll Certificate only without any quote.

Of course, to fix this issue, it is as simple as removing the beginning and ending quote. However, this becomes another hassle to tackle in our code. In the next section, we will be using a library developed by Apache in order to ease us to deal with this kind of situation.

Using StringTokenizer from Apache Commons

Now, let’s assume we have the same string value like the example above, and we want to tokenise this string based on the pipe character.

First of all, we want to make sure that we have the Apache commons-text library provided into our Java project. In this example, I’m using Maven as the build tool and dependency management. To add commons-text as one of the dependencies, it is as simple as adding this dependency into the pom.xml file inside the dependencies tag.

Adding dependency for Apache commons-text

As can be seen from the above configuration, I am using commons-text library version 1.8. As of this writing, this is the latest version of Apache commons-text.

Once the library is imported, we can now start doing tokenisation of a string by utilising the class named StringTokenizer.

Example of using StringTokenizer

As can be seen, I am using exactly the same string. The only difference is in how the tokenisation is to be performed. In this case, I’m using StringTokenizer class. To use it, it is as simple as creating an instance of it and sending the string together with the delimiter as parameters.

By now, you should have noticed that the delimiter that I passed is just a single pipe character (|) instead of the string accompanied by the double backslash (\\|) like we did previously. This is because the delimiter accepted by the StringTokenizer class is not a regular expression.

We can as well tell the tokeniser that we want to use the quote symbol (“) as the quote character. This is easily done by calling the setQuoteChar method of the tokeniser. We can also tell the tokeniser to ignore or not to ignore the empty tokens by calling the setIgnoreEmptyTokens method. The default value of it is true. You can set it to false if you want to keep empty tokens.

Lastly, we can simply call getTokenArray method. This method returns all the tokens in an array of String. We can then iterate over this array and use the values. Below is the output of printing each token generated by using this method.

2019-11-25
12:26:10,496
default task-1
Enroll Certificate

Start
10.85.75.100

Now, you should have noticed that in our output here, we do not have the quote (“) character, which is mostly correct, as we are most likely using quote (“) to tell the parser that everything inside the quote must be treated as a single value including if the separator itself is inside the quoted string. This situation can be handled perfectly by the StringTokenizer, but cannot be done easily without additional logic if we are simply using the split function of the String class.

Conclusion

String tokenisation is a very easy thing to do and achieve in Java. Java already provide a build-in method to do string tokenisation based on the delimiter, expressed in regular expression. To handle more cases though, using StringTokenizer by Apache can sometimes make our life easier. Of course, in the end, there is no right or wrong in using either way. We need to assess what’s our necessity and then only proceed with what kind of method to choose to do string tokenisation.

That’s it for today. Happy coding !!!

--

--

Handra

I am a software engineer focusing on Java programming language and Public Key Infrastructure (PKI). Loves Linux and open-source technology.