Theoretically there are several ways to do this. Here are the two most common I have heard of.
Disclaimer: This is mostly theoretical knowledge! I haven't really done a lot of networking game development. I could, and probably am, be completely wrong on how this works.
The easiest setup is to have a camera for first person and a camera for third person. The first person camera is close to the model, while the third person camera is farther away. Then you would just need to change the FOV on the first person camera so it looks correct.
However, this method could get complicated from a perspective point of view and isn't very common in most games for a variety of reasons. The primary reason I've heard for not using this method is because it is very hard to get the first person view looking right without special tricks (which unfortunately, I do not know).
One huge plus for this method is it is easy to sync everything across the network so the visuals are correct for every player, since there is only a single model to send data.
The most common way to do this is to have two different scenes. In first person, the player only sees the first person scene, generally only pair of arms. There is a neat picture I saw of the Unity FPS sample that kinda showed how this works, but unfortunately I couldn't find the image.
This way gives a greater control of the visuals for the first person view and because the first person and third person views are separate, it is easier to get the camera rendering things with the correct perspective.
The hard part with this method is syncing the third person character model with the first person one. Things like rotating to look in the correct direction, for example, can be a tad more challenging.
Despite the limitations, this method is the most popular for many games because of the ability to use better visuals and animations in first person. Another bonus is that so long as your third person model is in sync with the first person model, you can reuse that data for syncing the player across the network.
Those are the two most common ways I've seen it done. Both have their pros and cons, technical limitations, and development quirks.
I would go with whatever method works best for you, as ultimately you'll have to write the code to drive it. One way to find what works best for you is to prototype if you have the time.
The two methods I listed above are commonly used, but there are other ways to go about it that, depending on the project, might be a better fit. Hopefully this helps :smile: