When to use what - RegExp, String Replace & Character Replace
Sometimes it’s hard to know what to use, and why to use it even.
What
In most, or dare I say all, popular programming languages there exists a multitude of string replacements methods, most common is to have one String-based and one RegExp-based. In some languages such as Java there’s also a special method to replace Characters in a String.
Why
Performance sometimes matter, sometimes it doesn’t. But if it does it’s really good knowing which method to use as the speed-up can be substantial!
The use-case
Replace “a”, “b” & “c” to “d”. It’s simple, but good. As for data I’m using a few of shakespeares works which in total is 4.5 million characters, I’ve also added variants of these as shown in the table.
Type | Length (characters) | Iterations | Average (msg) | Normalized to RegExp |
---|---|---|---|---|
RegExp | 1k | 1 million | 0.0049ms | 1x |
Char | 1k | 1 million | 0.0027ms | 0.55x |
String | 1k | 1 million | 0.0087ms | 1.63x |
— | — | — | — | — |
RegExp | 4.5 million | 1k | 29.67ms | 1x |
Char | 4.5 million | 1k | 11.84 | 0.39x |
String | 4.5 million | 1k | 57.20 | 1.92x |
— | — | — | — | — |
RegExp | 45 million | 10 | 361.8ms | 1x |
Char | 45 million | 10 | 117.0ms | 0.32x |
String | 45 million | 10 | 588.1ms | 1.54x |
As shown the Character-based replace is much faster! It’s only getting faster in comparison to the RegExp the bigger the file is.
I think a interesting test would be to do character swaps, using these methods and see if it’s retained.
When to use what?
I’d say that I see a few clear results.
- Use Character Based Replace if you only need to replace characters. It’s much faster!
- Use String Based Replace if you only swap one string to another (it’s faster than RegExp), doing multiple swaps grows fast in time consumed.
- Use RegExp Based Replace if you want to swap multiple strings
- Use RegExp Based Replace if you wanna do anything complex really! It’s pretty performant if you remember to compile the pattern :)
Extra
Some extra comments that are good to know in cases as these
RegExp
I’ve said this before but… Please remember to compile your patterns once, and not in each loop. Compiling patterns is incredibly expensive! Running (1..1_000_000).forEach { str.find(regexStr) }
is a multitude slower than
// pseudo-code
val regexCompiled = regexStr.toRegex()
(1..1_000_000).forEach { regexCompiled.find(str) }
because in the first example pattern is compiled each time…
Python specific
Note that in Python as an example there exists C-implementations for some methods, it’s very important to actually use these if you care about performance. As an example str.find(keyword)
is a multitude slower than keyword in str
, because the in
keyword is actually a C-implementation when str.find
is a python one.
Appendix A. The Code
import java.io.File
import kotlin.system.measureTimeMillis
object RegexTester {
val text = File("/home/londet/git/text-gen-kt/files/shakespeare.txt").readText()
val textSmall = text.take(1000)
val textLarge = text.repeat(10)
val regex = "[abc]".toRegex()
val charReplace = listOf('a', 'b', 'c')
val stringReplace = listOf("a", "b", "c")
@JvmStatic
fun main(args: Array<String>) {
("Warming up JVM by running 10,000 iterations of each replacer on normal size.")
println(1..10_000)
.forEach { regex.replace(text, "d") }
(1..10_000)
.forEach { charReplace.fold(text) { acc, ch -> acc.replace(ch, 'd') } }
(1..10_000)
.forEach { stringReplace.fold(text) { acc, ch -> acc.replace(ch, "d") } }
("Warmup done!")
println
val regexSmall = measureTimeMillis { (1..1_000_000).forEach { regex.replace(textSmall, "d") } } / 1_000_000.0
val regexNormal = measureTimeMillis { (1..1_000).forEach { regex.replace(text, "d") } } / 1000.0
val regexLarge = measureTimeMillis { (1..10).forEach { regex.replace(textLarge, "d") } } / 10.0
// val regexLargeCompile = measureTimeMillis { (1..10).forEach { textLarge.replace("[abc]", "d") } } / 10.0
("Regex Small (1000 characters, 1,000,000 avg): $regexSmall")
println("Regex Normal (4.5 million characters, 1000 avg): $regexNormal")
println("Regex Large (45 million characters, 10 avg): $regexLarge")
println
val charSmall = measureTimeMillis { (1..1_000_000).forEach { charReplace.fold(textSmall) { acc, ch -> acc.replace(ch, 'd') } } } / 1_000_000.0
val charNormal = measureTimeMillis { (1..1_000).forEach { charReplace.fold(text) { acc, ch -> acc.replace(ch, 'd') } } } / 1000.0
val charLarge = measureTimeMillis { (1..10).forEach { charReplace.fold(textLarge) { acc, ch -> acc.replace(ch, 'd') } } } / 10.0
("CharReplace Small (1000 characters, 1,000,000 avg): $charSmall")
println("CharReplace Normal (4.5 million characters, 1000 avg): $charNormal")
println("CharReplace Large (45 million characters, 10 avg): $charLarge")
println
val stringSmall = measureTimeMillis { (1..1_000_000).forEach { stringReplace.fold(textSmall) { acc, ch -> acc.replace(ch, "d") } } } / 1_000_000.0
val stringNormal = measureTimeMillis { (1..1_000).forEach { stringReplace.fold(text) { acc, ch -> acc.replace(ch, "d") } } } / 1000.0
val stringLarge = measureTimeMillis { (1..10).forEach { stringReplace.fold(textLarge) { acc, ch -> acc.replace(ch, "d") } } } / 10.0
("StringReplace Small (1000 characters, 1,000,000 avg): $stringSmall")
println("StringReplace Normal (4.5 million characters, 1000 avg): $stringNormal")
println("StringReplace Large (45 million characters, 10 avg): $stringLarge")
println
/**
Regex Small (1000 characters, 1,000,00 avg): 0.004949
Regex Normal (4.5 million characters, 1000 avg): 29.671
Regex Large (45 million characters, 10 avg): 361.8
CharReplace Small (1000 characters, 1,000,00 avg): 0.002752
CharReplace Normal (4.5 million characters, 1000 avg): 11.835
CharReplace Large (45 million characters, 10 avg): 117.0
StringReplace Small (1000 characters, 1,000,00 avg): 0.008692
StringReplace Normal (4.5 million characters, 1000 avg): 57.204
StringReplace Large (45 million characters, 10 avg): 588.1
*/
}
}