Programming question! This one has been floating around in the back of my head for a while.
Here's a Python 2 function which converts any unicode string into a valid Python 2 identifier:
text = unicodedata.normalize('NFKD', text) # Split off accent marks
text = re.sub('[^a-zA-Z0-9_ ]+', ' ', text) # Punctuation to spaces
text = re.sub(' +', ' ', text) # Remove redundant spaces
text = text.strip()
text = text.replace(' ', '_')
if not text or re.match('^[0-9]', text):
# Must not be empty or start with a digit
text = '_' + text
Of course there are many functions that serve this purpose. I chose this one because the result looks like the original, as much as possible, to human readers. It removes spurious whitespace but leaves meaningful whitespace (as "_" characters). If the input already is a valid identifier, it's left unchanged.
'foo' => 'foo'
'Hello there' => 'Hello_there'
'cöoperate' => 'co_operate'
'"Hi," he said.' => 'Hi_he_said'
'foo__bar_' => 'foo__bar_'
So here's the question: what's a good way to do this in Python 3? In Py3, identifiers are allowed to contain accented and non-Latin letters. 'cöoperate', 'αβγδε', and 'ﬄĳ' are valid identifiers. But not all Unicode symbols are fair game; '∞∞∞' and '①②③' are not valid.
As you might imagine, the Py3 rules for identifiers are baroque:http://docs.python.org/3/reference/lexical_analysis.html#identifiers
I would like to do the above job in the brave new Py3 world.
Again, many possible algorithms fit the bill. I'm aiming for minimum damage (leave the string visually similar to the original) and maximum efficiency. (My Py2 solution uses only regexps and built-in string functions, all of which are implemented in native code in Python.) Evil hacks are encouraged, but stick to pure Python, please.