Skip to content

Conversation

ngehrsitz
Copy link

When running solarman_rtu_proxy.py from the docker container contineously I encountered a crash loop:

OSError: [Errno 24] No file descriptors available
ERROR:asyncio:Unhandled exception in client_connected_cb
transport: <_SelectorSocketTransport fd=7 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/utils/solarman_rtu_proxy.py", line 29, in handle_client
    solarmanv5 = PySolarmanV5Async(
        logger_address, logger_serial, verbose=True, auto_reconnect=True
    )
  File "/usr/local/lib/python3.13/site-packages/pysolarmanv5/pysolarmanv5_async.py", line 66, in __init__
    self.data_wanted_ev = Event()
                          ~~~~~^^
  File "/usr/local/lib/python3.13/multiprocessing/context.py", line 93, in Event
    return Event(ctx=self.get_context())
  File "/usr/local/lib/python3.13/multiprocessing/synchronize.py", line 331, in __init__
    self._cond = ctx.Condition(ctx.Lock())
                 ~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/multiprocessing/context.py", line 78, in Condition
    return Condition(lock, ctx=self.get_context())
  File "/usr/local/lib/python3.13/multiprocessing/synchronize.py", line 221, in __init__
    self._sleeping_count = ctx.Semaphore(0)
                           ~~~~~~~~~~~~~^^^
  File "/usr/local/lib/python3.13/multiprocessing/context.py", line 83, in Semaphore
    return Semaphore(value, ctx=self.get_context())
  File "/usr/local/lib/python3.13/multiprocessing/synchronize.py", line 133, in __init__
    SemLock.__init__(self, SEMAPHORE, value, SEM_VALUE_MAX, ctx=ctx)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/multiprocessing/synchronize.py", line 57, in __init__
    sl = self._semlock = _multiprocessing.SemLock(
                         ~~~~~~~~~~~~~~~~~~~~~~~~^
        kind, value, maxvalue, self._make_name(),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        unlink_now)
        ^^^^^^^^^^^

This PR:

  1. Exits with a non-zero status code, thus causing the docker container to be restarted instead of contineoulsy printing the error message
  2. Switching to a debian based base image since only alpine appears to be affected by this issue: https://stackoverflow.com/questions/77679957/python-multiprocessing-no-file-descriptors-available-error-inside-docker-alpine
    https://gitlab.alpinelinux.org/alpine/aports/-/issues/15651

@githubDante
Copy link
Collaborator

Hi,

The problem is caused by the Pysolarman initialization for each new client/connection which at some point leads to semaphore exhaustion (according to the linked posts). Modification of the proxy to use a single instance of Pysolarman should fix it.

@githubDante
Copy link
Collaborator

Hi,

Can you try this version of the proxy on alpine and see how it handles multiple connections/longer operating intervals.

@ngehrsitz
Copy link
Author

Hi @githubDante,
unfortunately your version does not work at all with evcc as a client. I can see that some data is sent, but I just get IO timeouts

('127.0.0.1', 57492): New connection
DEBUG:pysolarmanv5.pysolarmanv5:[2967017391] SENT: a5 17 00 10 45 29 00 af 17 d9 b0 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 03 02 65 00 01 95 ad 94 15
('127.0.0.1', 57492): Connection closed
read failed: read tcp 127.0.0.1:57454->127.0.0.1:1502: i/o timeout
Power:          read failed: read tcp 127.0.0.1:57454->127.0.0.1:1502: i/o timeout
Energy:         read failed: read tcp 127.0.0.1:57456->127.0.0.1:1502: i/o timeout
Current L1..L3: read failed: read tcp 127.0.0.1:57460->127.0.0.1:1502: i/o timeout
read failed: read tcp 127.0.0.1:57463->127.0.0.1:1502: i/o timeout

I got slightly further with f889e55, but the reconnect does not work:
('127.0.0.1', 52834): New connection

DEBUG:pysolarmanv5.pysolarmanv5:[2967017391] SENT: a5 17 00 10 45 64 00 af 17 d9 b0 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 03 02 65 00 01 95 ad cf 15
WARNING:asyncio:socket.send() raised exception.
('127.0.0.1', 52834): Connection closed

@davidrapan
Copy link
Contributor

evcc supports the Solarman protocol?

@ngehrsitz
Copy link
Author

evcc supports the Solarman protocol?

No, it uses Modbus RTU over TCP which is why I am running solarman_rtu_proxy.py in the first place

@davidrapan
Copy link
Contributor

Ah, I mixed up the proxy "type", my bad. :)

@githubDante
Copy link
Collaborator

Hi @githubDante, unfortunately your version does not work at all with evcc as a client. I can see that some data is sent, but I just get IO timeouts

This doesn't sounds good. Maybe I missed something. I will check and update when I have more info.

@davidrapan
Copy link
Contributor

davidrapan commented Jun 5, 2025

Yup, can confirm:

Listening on 0.0.0.0:8899
('192.168.144.208', 46344): New connection
DEBUG:pysolarmanv5.pysolarmanv5:[????????????] SENT: a5 33 00 10 45 ........................................
('192.168.144.208', 46344): Connection closed
('192.168.144.208', 46318): New connection
DEBUG:pysolarmanv5.pysolarmanv5:[????????????] SENT: a5 33 00 10 45 ........................................
('192.168.144.208', 36698): Connection closed

Edit: And wireshark doesn't show any comm proxy <-> logger, only client <-> proxy.

@githubDante
Copy link
Collaborator

The connect call was missing. The branch has been updated. I will try to test in docker as well.

@davidrapan
Copy link
Contributor

davidrapan commented Jun 5, 2025

('192.168.144.208', 47340): New connection
DEBUG:pysolarmanv5.pysolarmanv5:[????????????] SENT: a5 17 00 10 45 ........................................
DEBUG:pysolarmanv5.pysolarmanv5:[????????????] RECD: a5 41 00 10 15 ........................................

👍

@githubDante
Copy link
Collaborator

Also a fix for the reconnect issue has been added.

@ngehrsitz can you try again. Thank you in advance.

asyncio.run(run_proxy(args.bind, args.port, args.logger, args.serial))
except Exception as e:
print(f"Exception: {e}")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely unnecessary call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thaught that an expection should cause a crash of the script with a non-zero exit code.
But that was not the case, instead it kept prinitng the exception. I suspect this has something to do with how these scripts are made executable

[project.scripts]
solarman-decoder = "utils.solarman_decoder:main"
solarman-rtu-proxy = "utils.solarman_rtu_proxy:main"
solarman-scan = "utils.solarman_scan:main"
solarman-uni-scan = "utils.solarman_uni_scan:main"

Copy link
Contributor

@davidrapan davidrapan Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.. 🤔

@ngehrsitz
Copy link
Author

Also a fix for the reconnect issue has been added.

@ngehrsitz can you try again. Thank you in advance.

It starts now. Let´s see if it is stable. Usually crashes occur after one or two days of runtime

@ngehrsitz
Copy link
Author

Also a fix for the reconnect issue has been added.
@ngehrsitz can you try again. Thank you in advance.

It starts now. Let´s see if it is stable. Usually crashes occur after one or two days of runtime

@githubDante There is still some sort of locking issue. The client can only get data on the first connection. As soon as you reconnect, it stops working.
Besides, why are you using threading.Lock and not asnycio.Lock?
I also don´t understand why go to the effort to do all of the connection setup inside of handle_client? Wouldn´t it be much simpler to connect once before startup like in my variant? Then we would only need to fix the bugs in the reconnect logic.

@githubDante
Copy link
Collaborator

Hi,

I don't see it locked in my test environment, but I have to admit that I'm using a simple echo server simulating the datalogger socket. Just added lower socket timeout during the initialization and extra exception handling to avoid locking by a real datalogger while the proxy is waiting for a response from it (the lock can be held for too long with the default settings).

Besides, why are you using threading.Lock and not asnycio.Lock?

Don't know, it's just a habbit, I guess.

I also don´t understand why go to the effort to do all of the connection setup inside of handle_client?

For control over the PySolarman instance. Since the handle_client method is the main driving force of the proxy we can check the state of PySolarman when a new client is connected and prepare it to serve the requests which the client will start issuing in the next moment, otherwise extra effort should be made elsewhere (extra methods or tasks in the asyncio loop that need to check its state)

Wouldn´t it be much simpler to connect once before startup like in my variant? Then we would only need to fix the bugs in the reconnect logic.

It's basically the same at runtime - single connect (_solarman_init is like a singleton) and then state condition checks via _solarman_connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants