This is missing one large avenue of speedup: the startup time, which is bad with Python. Some of my Lua scripts are done faster than Python starts up, even when doing file i/o. This is especially crucial for interactive ‘productivity’ helper scripts, which need to respond instantly.
As an anecdote, when I was making my first Lua script for Alfred on a horribly underpowered machine, I briefly couldn't figure out if I actually configured the script properly, because the dummy example output indeed appeared right as I typed the last input key. Turns out, Lua really runs that fast.
Alas, the lack of proper ‘null’ bugs me—it may interfere with making middleware for external APIs, receiving and transmitting schema-less data. A bunch of libraries use cjson.null or interoperable lua-null or somesuch, but I'm sure I'll bump into a lib that won't know about those.
Ah, and by the way, from what I vaguely heard, actual native multithreading might be doable in Lua with libraries. At least I've seen a couple of such libs, but haven't tried them.
Through the years starting from around 1998 I deployed Lua into several embedded systems: hardware debugging tools, scientific instruments, medical devices, home automation and finally robotics. Back then you couldn't beat the small footprint of the environment and the decent availability of libraries so you didn't need to start everything from scratch. In contrast, for example the Pawn language was also competent for similar applications but its ecosystem was quite lacking.
Some years ago when Squirrel started to get traction, particularly due to its adoption in a couple of IoT frameworks, I fiddled with it for some months and it almost got its way into a commercial electronics project. However by then uPython was getting more popular, had better tooling support and ARM microcontroller chips had become very fast. Being able to reuse the same code on the host with minimal or no changes, or having an interface or emulation layer that you can also program in Python is a game changer. It took me less than a week to rewrite the Squirrel code to Python, with even better features thanks to the language facilities.
I thought I was never going to touch Lua again but a couple of months ago I found myself hacking some REAPER audio processing scripts, and cursing Ierusalimschy once more for his decision on neglecting Dijkstra's advice and not starting indexes from zero...
Lua is optimised for interpreter size, Python for "language niceness / flexibility", and it shows. Lua has lots of nits, such as the index starts at 1 thing, or the "all numbers are floats" thing. I also find its class mechanism ugly, but the most annoying one for me is it suffers from the totally unnecessary "a variable is global by default" bug.
However, life is full of unexpected surprises. While Lua small engine size was a deliberate strategy to make it cheap to embed, it also meant it's much easier to optimise than something the size of Python. And ... that's what happened, so LuaJit is the fastest interpreted language - even faster than the best javascript engines when I last looked. That despite javascript engines having had man decades of optimisation effort put into them.
If you want even more speed for Lua, Lua's new experimental sister language, Pallene is really exciting.
Pallene is Lua with strong types, designed for the purpose of being able to (AOT) generate faster and native code.
It's compiler design also has deep knowledge of the Lua implementation, so it short circuits the formal Lua/C API so crossing the Lua/C language boundary in Pallene is even faster (and Lua's is already good...much less overhead than crossing in Python).
Easily interoperable languages present an interesting proposition: a vertically integrated approach to language design where tools are more specialised, but designed from the get-go to work together. Fascinating, and awesome.
I think they define Pallene as a true subset to Lua in that the Pallene compiler can emit a pure Lua version (which strips all the types), and will produce the same results when run.
It's also not just another Lua-like language, in that it is designed to deeply share the Lua VM, and that Lua and Pallene code can call each other easily and fast. The description "Sister language" is trying to impart that there is more going on with Pallene than just creating another new language that kind of resembles Lua.
This is a great example of something that's more than the sum of its parts. A GIL-constrained language and a single-threaded language working together to run code in parallel - something that neither language can achieve on its own.
Although I believe there is a plan to add support for this pattern (multiple threads with one interpreter state each) to pure Python in the form of "subinterpreters".
Multiprocessing requires multiple python processes which are expensive with regards to startup time and resource use. Multiple interpreters give you parallelism without so much overhead.
Wait, isn’t python startup slow because of the interpreter startup and not process creation itself? A thread also has to be created (saves dll mapping time but that’s it) and python state also has to be initialized in a complete isolation on that thread (means setting up the entire scene again and loading stdlib from sources). Am I wrong?
Threads don’t magically start with a python interpreter that is ready to use, it has to be initialized for every thread out there separately as well, unless some clever immutability tricks are involved to skip big parts of that initialization.
AFAIK subinterpreters are similar to multiprocessing in that the interpreters don't share memory - but because they are within the same process, communication between them can be more efficient.
> Does it make sense to have another scripting language inside Python scripts?
Point of interest, Python already has another embedded language if you have Tkinter available: Tcl.
Larry Wall of Perl fame once said: “Tcl tends to be ported to weird places like routers.”[0]
Routers are just the beginning… see here[1] for calling Tcl scripts directly into a Python-embedded Tcl interpreter. I believe at least one of the Perl/Tk attempts did something similar (embedding Tcl wholesale) too.
It's fun and probably useful, but Python is not 30 times slower than lua (from 33s to less than a sec).
The 2 codes don't do the same thing, if you perform the same computation on the python code (monochrome mandelbrot, inline loop, 49 iterations, etc), you get 0.8 vs 4.8 sec, which seems about right.
I've removed numpy since it's cheating if we compare raw language perfs:
So yes, Python is about 6 times slower. Significative, but can be worth the compromise if you consider the Python version is much, much easier to read.
And of course, a vectorized algo would be even faster (see pbourke's comment).
What's more, with SharedMemory from 3.8, you could speed that up even in pure Python using multiprocessing by sharing efficiently the array.
Much of the difference is due to rewriting complex numbers into pairs of floating-point variables. Probably because Lua doesn't have complex numbers?
The same rewrite in Python would not speed things up, it would slow things down.
It's more than that, though. It looks like the Lua code does only 49 repeats where the Python code does 200. It's clear that this isn't just a Python->Lua rewrite, this is a rewrite that was followed by a lot of optimisation work on the Lua code.
Despite that, I kinda buy it. I can believe that you won't get speeds in pure-Python code for this particular problem that are competitive with a JIT. numpy doesn't help, because the number of iterations varies for each point, so you can't easily parallelise. And speaking of parallelisation, he has a point about threading: Doing the computations in pure-Python would not perform well over multiple threads, you need Cython or something. And that something can apparently be Lua.
It's not the same algorithm: That code completes all 200 iterations unconditionally. For some parts of the input domain, the numpy speed-up is enough to compensate and make it faster anyway, but for other parts it will be slower.
Agree, even if it compensates by supplying a gradually shrinking compute mask, but not to all operations. But an effect is there, less populated chunks compute faster.
No. Lua here uses luajit, a jit compiler and interpreter for Lua which Python can not beat. Python has the numba package, which is in the same direction, less easy to use, more restricted, and I don't know if numba can approach luajit's speed here, we'd have to try it out.
(After working on the numpy code posted in this thread, I think Python is pretty competitive, but it requires quite a few contortions - rewriting it using numpy.)
Thanks for sharing, that's a big improvement! The details are easier to see, for me, in grayscale, that is using: Image.fromarray(255 - red).show()
I'm learning as I go here, using numpy functions I've never used before, but we can use more in-place modifying operations like this (np.multiply, np.add with where= and out=) and it decreases runtime of the benchmark by 25%.
Then, after that, this problem is embarassingly parallel and thanks to numpy we can parallelize your code very well, using all cores (improves runtime for me from 3.5 to 0.92 seconds) Using Python3.8.:
import concurrent.futures
import time
from PIL import Image
import numpy as np
def compute_mandelbrot(cs):
"""
cs: complex grid subset
"""
z = np.zeros_like(cs)
red = np.zeros_like(cs, dtype=np.uint8)
mask = np.ones_like(cs, dtype=bool)
for _ in range(199):
# z² + c -> z
np.multiply(z, z, where=mask, out=z)
np.add(z, cs, where=mask, out=z)
np.add(red, 1, where=mask, out=red)
mask &= np.abs(z) < 2
return red
def py_mandelbrot(size):
t1 = time.time()
yPts, xPts = np.ogrid[-1j:1j:size * 1j, -1.5:0.5:size * 1j]
cs = xPts + yPts
# over-split because chunks are not equal in time taken
nsplit = 32
nthreads = 8
css = np.array_split(cs, nsplit)
with concurrent.futures.ThreadPoolExecutor(max_workers=nthreads) as executor:
reds_out = list(executor.map(compute_mandelbrot, css))
red = np.concatenate(reds_out)
dt = time.time() - t1
print(f"dt={dt:.2f}")
Image.fromarray(255 - red).show()
if __name__ == '__main__':
py_mandelbrot(1280)
Python makes it easy to get computed chunks in the order they are submitted, using executor.map. I still tried out what it would look like if I concatenated the chunks in the order they were received. That's when I noticed that emptier tiles were computed faster. So then, we can split the problem into more fine-grained chunks and submit more of them while keeping the number of threads the same, for a small speedup since some chunks are faster to compute.
To bring "another language into it", i.e using numba for a JIT for python, this is another massive speedup for me, from 0.92 seconds to 0.26 seconds. The new code (needs some pieces from the old):
import numba
from numba import complex128, boolean
@numba.vectorize([complex128(complex128, complex128)])
def square_add(z, c):
"compute z² + c"
return z * z + c
@numba.vectorize([boolean(complex128)])
def not_escaped(z):
"compute |z|² < 4"
return ((z.conjugate() * z).real < 4.)
def compute_mandelbrot(cs):
"""
cs: complex grid subset
"""
z = np.zeros_like(cs)
red = np.zeros_like(cs, dtype=np.uint8)
mask = np.ones_like(cs, dtype=bool)
for _ in range(199):
# z² + c -> z
square_add(z, cs, where=mask, out=z)
np.add(red, 1, where=mask, out=red)
# mask &= |z| < 2
not_escaped(z, where=mask, out=mask)
return red
We can compare calling into numba with calling into luajit. In this code we see how much we get for free when using numpy ufuncs! All we did with numba is to define functions for the composite operations - one for z² + c and one for |z|² < 4 - which means we avoid storing intermediate arrays and do more in fewer array traversals.
I had this update lying around now too. Same signature function, this one actually was another 200% improvement:
# guvectorize also supports target='parallel' for automatic parallelization
@numba.guvectorize([(complex128[:], uint8[:])], "(n)->(n)")
def compute_mandelbrot_gu(cs, out):
"compute number of iterations of z² + c before it escapes"
for i in range(cs.shape[0]):
c = cs[i]
z = 0j
count = 0
while count < 200:
z = z * z + c
if (z * z.conjugate()).real > 4.:
break
count += 1
out[i] = count
Whether to use numba's parallel for guvectorize or mine, it doesn't matter so much, they perform about the same. Parallelizing on a higher level is generally favorable to me.
One thought here: the number of 100%+ improvements made here just underscores how slow python is to start with :). And yes, everyone in comments here complaining about numpy are sort of right, since the ultimate end here is just a flat loop without numpy, but as a numpy function.
Those those who wonder how it can make it faster since it's using threads and Python has the GIL, numpy releases the gil while doing computation, so thread can actually use multiple cores if you speed most of your time in numpy world.
For this code we can also easily try the process executor instead of the thread executor - then we avoid the GIL. In this case, it has an approximate 10% performance penalty, probably since we have to serialize data back and forth, can't share memory.
If you use a data structure that supports the buffer protocol (such as a numpy array), you won't even pay a serialization cost.
Python has made a LOT of progress, concurrency wise. Between asyncio, process pools, threads releasing the GIL and shared memory, you can get much perfs with the stdlib alone.
Then of course you have Dask if you need more later.
Lua is so nice, but I don't really use it for anything practical.
I'd like to think that I'd rather have Lua as a language, with Python's class definitions and Python's ecosystem. What I mean by Lua as a language is the interpreter - and luajit - but also its `local` variables and scoping rules (Python's rules are rather strange, which the nonlocal keyword signals clearly).
Questions for the Lua critics. How many Python modules, e.g., in C, have you written. Which language is easier to write modules for, Lua or Python. Ever compare the startup time of the Lua interpreter to the Python one. What s the tally of applications that have embedded Lua interpreters versus those that have embdedded Python ones.
Because it is (optionally) embedded in applications I use every day and/or rely on, whether it's a DNS load balancer, a TCP/HTTP proxy server or a fast OS kernel, I always use Lua for "practical purposes".
Python seems to have development team of millions, but would I rather have a smaller interpreter with faster startup that is easier to write modules for. To me, that is an important question. Will I ever write Python modules in C for own use, or will I have to choose someone else's library and become 100% reliant on it. Does that limit the true "DIY" potential of Python vis-a-vis Lua.
At least the parent did not use the "indices start at one" justification for concluding Lua is inferior. In the early days of Lua, that was a red flag to me that this language must be good, if that s the best criticism anyone could come up with.
much more than appropriate. It feels like you are serving Lua, not the other way round. Reimplementing basic functions that Lua doesn’t have, writing too much glue in C to make it look like native interface. It is also very un-crossplatform when you use rocks (Lua’s PM). Lua is nice and fast and powerful with its metatables, environments and true coroutines, and I miss that in other scripting toy-languages like python and js. And if Lua took place in js ecosystem, we’d be much less fucked on the frontend side, exactly because of these mechanisms. But it is an engine without a cabin and wheels and nothing changes.
As of nil, “:/.” and others like local…in, table=array, it just becomes an always-itching scratch after years. Lua doesn’t even respect full backwards compatibility (not that it was claimed) but its authors stick to the principles that many folks tried to un-convince them to no avail.
language must be good, if that s the best criticism anyone could come up with
That’s only the shallow surface, the real talk always goes in the mailing list before the release.
I have written Lua modules in C, and I learned from and was impressed by how they designed it to be embeddable. I know the stark difference to Python now. However, my bread and butter is currently mostly in Python.
If you need C modules for Python, use Cython for the binding, don't write them "from scratch".
The main downsides of Lua are the overloaded handling of nil, and the lack of a real standard library. If it weren't for that, it'd be a pretty nice general-purpose scripting language, arguably better than Python.
I probably disagree with 90% of the criticisms leveled against Python, but it definitely is slow in "loopy" tasks, and the indentation-based syntax can make for relatively verbose code in quick ad-hoc scripts. I'd be plenty happy using Lua instead, if only it had a more thorough standard library. The nil-handling thing is annoying, but fine enough for quick scripts.
I actually really like Lua's syntax, though I think a form for short Python-style lambdas would really help (short lambdas are the easiest ones to understand, so I don't see a comprehensibility objection, and Lua's grammar already makes the necessary expression-statement distinction).
The Lua ecosystem was seriously damaged by the 5.1>5.2 split for a while since LuaJIT (hence Torch/nginx) and LÖVE both basically stayed with 5.1 and those were major projects that drew people to the language. The other cool unique thing in the Lua ecosystem, the native-widget (Windows/Gtk) IUP toolkit, remains barely known about and never ported to OS X, possibly due to a lack of Mac users in Brazil. Hence the "language without an ecosystem": it didn't happen for no reason, though the willingness to make breaking changes is a reason the language became so popular in the first place.
NB. IUP is still around, but it still doesn't have a Cocoa port.
`if x < 3: x = 3` is valid Python, although I see your point - you could add an `else` to the Lua example but Python would require a second line.
> a form for short Python-style lambdas would really help
Interesting - many in the Python community feel that its lambdas (particularly the use of `lambda` as a keyword) are too verbose and would prefer something like `x => x+1`. Maybe such an arrow-function syntax is achievable with Python's new PEG parser - I'm fairly sure it couldn't have been supported by the old LL(1) parser.
In this particular example, Python has the more wordy `x = 3 if x < 3 else x` which you'd rather write as `x = max(x, 3)` to demonstrate Python's muscles in terms of useful builtins.
About lambda syntax, to me it would be far more important to allow multiline lambdas than to have a shorter keyword.
> `x = max(x, 3)` to demonstrate Python's muscles in terms of useful builtins
Even Lua has a built-in `math.max` function that works with integers. Maybe it’s Lua’s way of flexing on Go :)
> About lambda syntax, to me it would be far more important to allow multiline lambdas than to have a shorter keyword.
Never going to happen, unfortunately. It’s just not possible within the fundamental rules of Python’s grammar: that statements use significant indentation and expressions do not.
It's not that nil itself is a problem, the real issue is that every invalid table key access doesn't trigger an exception (like Python), but instead returns nil. And everything in Lua is built with tables (including whole class systems), so this makes debugging a serious pain in the ass.
In fact, even global variable accesses are designed to work as table accesses to _G (the global table), so if you make a typo then the interpreter doesn't say that you've mispelled a variable but instead silently gives you nil. This leads to all sorts of weird bugs that are very hard to track down. Probably the number one reason why I ditched the language for something else.
Nil serves as non-existence, so cannot be used as a normal value. You can’t have a table (an object/struct in other languages) that has “existing” nil-valued key, so it breaks json schemas. You can’t have an array that contains nil, because arrays are defined to be continuous tables if consecutive indices that do not have nil. #-operator may return either 3 or 1 for {1, nil, 1} array. You delete keys and last array values by setting to nil. People tried to introduce special NIL values (an globally-available table with clear testable indication of its purpose), but it never becomes a standard and as a result every lib that requires it uses it’s own.
Couldn't you just use `false`? I was pretty opposed to the talk of adding an "undefined" value in the discussion prior to 5.4. As someone who has to write javascript fairly often having a large number of falsey values is definitely not something to be desired. I think in older versions of Lua there was a pattern to include an `n` field in your tables to keep track of it's length. Lua is all about "mechanisms over policy" and provides many ways to create tables that behave like arrays in Lua.
You're right, Javascript's undefined creates a whole new set of problems that you don't want to deal with, at least Lua didn't make that mistake.
But using False for invalid key accesses seems to be a really bad idea. Sometimes you need to store boolean values in tables. When that table access returns False then you don't have a way to tell if it was from a successful key retrieval or not. (Lua solves this dillema by making that storing nil in a table is just as if nothing really happened, which is pretty reasonable)
If Lua had a more fleshed-out exception system (The current one based on pcall() is a bit limited both in syntax and functionality), I think it would have been better to go for the Python route (throw exceptions for invalid table access). Don't know how it will take a toll on performance though, maybe that's the reason why it chose such a simplistic route for implementing tables?
Right, and it allows for arguments of functions and destructurizers to be undefined for default and null for null. Absent and empty are really different concepts. These being merged, separate, implicit or explicit, all may be hard to manage, but at least in the case of (undefined & null) it is a most uncrippled value system. JSON also got it right, allowing null but not undefined. At the end of the day all that matters is a programmer’s convenience. Corner cases exist everywhere.
With all flaws of js, this is where it shines:
var t = undefined
var u = {b:1}
var {a=42, b=true} = {undefined, …t, …u}
// a==42, b==1
Why not just call into a C/other compiled language library in this case? It'll be faster and it's not like you're getting the advantages of Lua in this case because you're not actually integrating it. In the ideal case Lua likes to call into native functions to do the heavy lifting anyway.
With LuaJIT you can just write Lua and call C functions as if they were Lua. The LuaJIT FFI is very powerful and a nice way to profit from the C speedup without the hassles.
If you have a c++ compiler available you can get significant speedups without any changes to the code (except for using complex not np.complex, because it's deprecated). Just add a comment
#pythran export PyMandelbrot(uint8[:,:,:], int)
Note I passed the array into the function to remove the image creation from the module.
This yields a 18.6 ms vs 1.1s for the original for size 320.
As a side note I find it worrying that a blog which is called speed matters uses time instead of timeit.
One ray of light is that Mike Pall is working on luajit again https://github.com/LuaJIT/LuaJIT which at this point is a fork of mainline Lua in terms of version history
I don’t want to sound too negative, but that also means that LuaJIT failed to increase its bus factor with these guys who took over the maintenance recently.
Neovim adopted it as its modern scripting language alongside Vimscript. And it now has at least two interesting scripting languages that compile to it, Fennel and Teal. So it's got quite a community of Neovim plugin developers writing Lua and these languages that compile to it.
It is of major importance as an embedded scripting language. Most recently Neovim, but also it's now the preferred way to write Pandoc filters. And then there's LuaLaTeX, where you can script TeX's internals with Lua.
This is a contrived example. Sure yes it can be done but seems hard to work with. It would be hard to debug for example.
For speeding up pure python, the first thing I would do is reach for numpy. It would be interesting to compare the performance of the lua-embedded code with a version using numpy.
As an anecdote, when I was making my first Lua script for Alfred on a horribly underpowered machine, I briefly couldn't figure out if I actually configured the script properly, because the dummy example output indeed appeared right as I typed the last input key. Turns out, Lua really runs that fast.
Alas, the lack of proper ‘null’ bugs me—it may interfere with making middleware for external APIs, receiving and transmitting schema-less data. A bunch of libraries use cjson.null or interoperable lua-null or somesuch, but I'm sure I'll bump into a lib that won't know about those.
Ah, and by the way, from what I vaguely heard, actual native multithreading might be doable in Lua with libraries. At least I've seen a couple of such libs, but haven't tried them.