{"id":1209,"date":"2010-01-20T08:58:10","date_gmt":"2010-01-20T06:58:10","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=1209"},"modified":"2010-02-04T07:39:43","modified_gmt":"2010-02-04T05:39:43","slug":"checking-for-string-permutation","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/checking-for-string-permutation","title":{"rendered":"Checking for string permutation"},"content":{"rendered":"<p>A permutation is a change of places. Thus, <strong>&#8216;lolhe&#8217;<\/strong> is a permuted <strong>&#8216;hello&#8217;<\/strong> (commonly referred to as &#8216;scrambled text&#8217;).<\/p>\n<p>I wish to present an SQL solution for checking if two strings are permutations of the same text.<\/p>\n<h4>About permutations<\/h4>\n<p>So, if <strong>&#8216;lolhe&#8217;<\/strong> is a permutation of <strong>&#8216;hello&#8217;<\/strong>, then <strong>&#8216;hello&#8217;<\/strong> is a permutation of <strong>&#8216;lolhe&#8217;<\/strong>, as well; and both are permutations of <strong>&#8216;elloh&#8217;<\/strong>. The <strong>REVERSE()<\/strong> of a text is an example of permutation. Mathematically, string permutation is an equivalence relation, and divides all strings to equivalence classes.<\/p>\n<h4>Use cases<\/h4>\n<ul>\n<li>We may be interested in permutations when a user chooses a password. We may disallow a password which is identical to the login name; but we may also disallow upper-lower-case-only transformations of the text. We may still disallow a <em>permutation<\/em> of the text.<\/li>\n<li>On a slightly different scale, the two queries: <strong>SELECT * FROM City WHERE id IN (5, 21, 13)<\/strong> and <strong>SELECT * FROM City WHERE id IN (13, 5, 21)<\/strong> are identical. Here, the permutation is not with string characters, but with string tokens. While the solution discussed is targeted at string characters, it can be easily converted to work with string tokens.<\/li>\n<\/ul>\n<h4>Checking for permutation<\/h4>\n<p>The solution I&#8217;m suggesting checks for permutation between 2 strings by permuting both to a third, <strong>normal form<\/strong>. The two string are permutations of each other if both have <em>the same normal form<\/em>.<\/p>\n<p><!--more-->What exactly <em>is<\/em> a normal form? Well, anything you like, really, as long as it&#8217;s deterministic and works the same for all elements in the equivalence class (mathematically speaking, this was a really bad definition, I know).<\/p>\n<p>Enough of mathematical notions: on the practical side, I&#8217;ll normalize &#8216;CAB&#8217; to &#8216;ABC&#8217;, and &#8216;DOG&#8217; to &#8216;DGO&#8217;. Can you see what normalization means here? I&#8217;m merely rearranging the characters in alphabetical order. This rearrangement is in itself a permutation of the string (hence, one last mathematical statement, it can be seen as a representative of the equivalence class).<\/p>\n<p>So, to see if two strings are permutations of each other, we rearrange both in alphabetical order, and see if we got the same text. <strong>&#8216;hello&#8217;<\/strong> and <strong>&#8216;lolhe&#8217;<\/strong> both normalize to <strong>&#8216;ehllo&#8217;<\/strong>, hence are permutations.<\/p>\n<h4>Normalizing the text<\/h4>\n<p>Down to business: how do we normalize a text using SQL? Well, once again, <a href=\"http:\/\/code.openark.org\/blog\/mysql\/unwalking-a-string-with-group_concat\">string walking and string unwalking<\/a> to the rescue. The trick is to break the string apart (to distinct characters), then re-combine the characters, but instead of in original order, do that in ascending order.<\/p>\n<p>Breaking of the word &#8216;hello&#8217; is as follows:<\/p>\n<blockquote>\n<pre>SELECT\r\n SUBSTRING(s, value+1, 1) AS c\r\nFROM\r\n tinyint_asc, (SELECT 'hello' AS s) sel1\r\nWHERE\r\n value &lt; char_length(s);<\/pre>\n<\/blockquote>\n<p>Recombining it in ascending character order is as follows:<\/p>\n<blockquote>\n<pre>SELECT GROUP_CONCAT(c ORDER BY c SEPARATOR '') AS normalized FROM\r\n (\r\n SELECT\r\n   SUBSTRING(s, value+1, 1) AS c\r\n FROM\r\n   tinyint_asc, (SELECT 'hello' AS s) sel1\r\n WHERE\r\n   value &lt; char_length(s)\r\n ) walked_string_select;\r\n+------------+\r\n| normalized |\r\n+------------+\r\n| ehllo\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+------------+\r\n1 row in set (0.00 sec)<\/pre>\n<\/blockquote>\n<p>We can write a stored function to do that more conveniently (and, while at it, also normalize character case):<\/p>\n<blockquote>\n<pre>DELIMITER $$\r\n\r\nDROP FUNCTION IF EXISTS `normalize_text`$$\r\nCREATE FUNCTION `normalize_text` (input_text VARCHAR(255) CHARSET utf8) RETURNS VARCHAR(255) CHARSET utf8\r\nDETERMINISTIC\r\nREADS SQL DATA\r\nBEGIN\r\n SELECT GROUP_CONCAT(c ORDER BY c SEPARATOR '') INTO @normalized FROM\r\n (\r\n SELECT\r\n   SUBSTRING(s, value+1, 1) AS c\r\n FROM\r\n   tinyint_asc, (SELECT input_text AS s) sel1\r\n WHERE\r\n   value &lt; char_length(s)\r\n ) walked_string_select;\r\n RETURN LOWER(@normalized);\r\nEND$$\r\n\r\nDELIMITER ;<\/pre>\n<\/blockquote>\n<p>And use like this:<\/p>\n<blockquote>\n<pre>mysql&gt; SELECT normalize_text('Smith');\r\n+-------------------------+\r\n| normalize_text('Smith') |\r\n+-------------------------+\r\n| himst\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+-------------------------+\r\n1 row in set (0.00 sec)\r\n\r\nmysql&gt; SELECT normalize_text('Smith') = normalize_text('Thims');\r\n+---------------------------------------------------+\r\n| normalize_text('Smith') = normalize_text('Thims') |\r\n+---------------------------------------------------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 |\r\n+---------------------------------------------------+\r\n1 row in set (0.00 sec)\r\n\r\nmysql&gt; SELECT normalize_text('Smith') = normalize_text('Sniff');\r\n+---------------------------------------------------+\r\n| normalize_text('Smith') = normalize_text('Sniff') |\r\n+---------------------------------------------------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0 |\r\n+---------------------------------------------------+\r\n1 row in set (0.00 sec)<\/pre>\n<\/blockquote>\n<p>To compare tokenized strings (like <strong>&#8216;21,5,13&#8217;<\/strong> vs. <strong>&#8216;5,13,21&#8217;<\/strong>), we can use the <strong>SUBSTRING_INDEX()<\/strong> function to break the string apart, instead of <strong>SUBSTRING()<\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A permutation is a change of places. Thus, &#8216;lolhe&#8217; is a permuted &#8216;hello&#8217; (commonly referred to as &#8216;scrambled text&#8217;). I wish to present an SQL solution for checking if two strings are permutations of the same text. About permutations So, if &#8216;lolhe&#8217; is a permutation of &#8216;hello&#8217;, then &#8216;hello&#8217; is a permutation of &#8216;lolhe&#8217;, as [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[54,21],"class_list":["post-1209","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-math","tag-sql"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-jv","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1209","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=1209"}],"version-history":[{"count":18,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1209\/revisions"}],"predecessor-version":[{"id":1910,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1209\/revisions\/1910"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=1209"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=1209"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=1209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}